NAME

URI::Find - Find URIs in arbitrary text

SYNOPSIS

use URI::Find;

$how_many_found = find_uris($text, \&callback);

DESCRIPTION

This module does one thing: Finds URIs and URLs in plain text. It finds them quickly and it finds them all (or what URI::URL considers a URI to be.) It employs a series of heuristics to:

Find schemeless URIs (ie. www.foo.com)
Avoid picking up trailing characters from the text
Avoid picking up URL-like things such as perl module names.

Functions

URI::Find exports one function, find_uris(). It takes two arguments, the first is a text string to search, the second is a function reference.

The function is a callback which is called on each URI found. It is passed two arguments, the first is a URI::URL object representing the URI found. The second is the original text of the URI found. The return value of the callback will replace the original URI in the text.

EXAMPLES

Simply print the original URI text found and the normalized representation.

find_uris($text, 
          sub {
              my($uri, $orig_uri) = @_;
              print "The text '$orig_uri' represents '$uri'\n";
              return $orig_uri;
          });

Check each URI in document to see if it exists.

use LWP::Simple;
find_uris($text,
          sub {
              my($uri, $orig_uri) = @_;
              if( head $uri ) {
                  print "$orig_uri is okay\n";
              }
              else {
                  print "$orig_uri cannot be found\n";
              }
              return $orig_uri;
          });

Wrap each URI found in an HTML anchor.

find_uris($text,
          sub {
              my($uri, $orig_uri) = @_;
              return qq|<a href="$uri">$orig_uri</a>|;
          });

CAVEATS, BUGS, ETC...

RFC 2396 Appendix E suggests using the form '<http://www.foo.com>' or '<URL:http://www.foo.com>' when putting URLs in plain text. URI::Find accomidates this suggestion and considers the entire thing (brackets and all) to be part of the URL found. This means that when find_uris() sees '<URL:http://www.foo.com>' it will hand that entire string to your callback, not just the URL.

NOTE: The prototype on find_uris() is already getting annoying to me. I might remove it in a future version.

SEE ALSO

L<URI::URL>, L<URI>, RFC 2396 (especially Appendix E)

AUTHOR

Michael G Schwern <schwern@pobox.com> with insight from Uri Gutman, Greg Bacon and Jeff Pinyan.