Take me over?
NAME
URI::Find - Find URIs in arbitrary text
SYNOPSIS
use URI::Find;
$how_many_found = find_uris($text, \&callback);
DESCRIPTION
This module does one thing: Finds URIs and URLs in plain text. It finds them quickly and it finds them all (or what URI::URL considers a URI to be.) It employs a series of heuristics to:
- Find schemeless URIs (ie. www.foo.com)
- Avoid picking up trailing characters from the text
- Avoid picking up URL-like things such as perl module names.
Functions
URI::Find exports one function, find_uris(). It takes two arguments, the first is a text string to search, the second is a function reference.
The function is a callback which is called on each URI found. It is passed two arguments, the first is a URI::URL object representing the URI found. The second is the original text of the URI found. The return value of the callback will replace the original URI in the text.
EXAMPLES
Simply print the original URI text found and the normalized representation.
find_uris($text,
sub {
my($uri, $orig_uri) = @_;
print "The text '$orig_uri' represents '$uri'\n";
return $orig_uri;
});
Check each URI in document to see if it exists.
use LWP::Simple;
find_uris($text,
sub {
my($uri, $orig_uri) = @_;
if( head $uri ) {
print "$orig_uri is okay\n";
}
else {
print "$orig_uri cannot be found\n";
}
return $orig_uri;
});
Wrap each URI found in an HTML anchor.
find_uris($text,
sub {
my($uri, $orig_uri) = @_;
return qq|<a href="$uri">$orig_uri</a>|;
});
CAVEATS, BUGS, ETC...
RFC 2396 Appendix E suggests using the form '<http://www.foo.com>' or '<URL:http://www.foo.com>' when putting URLs in plain text. URI::Find accomidates this suggestion and considers the entire thing (brackets and all) to be part of the URL found. This means that when find_uris() sees '<URL:http://www.foo.com>' it will hand that entire string to your callback, not just the URL.
NOTE: The prototype on find_uris() is already getting annoying to me. I might remove it in a future version.
SEE ALSO
L<URI::URL>, L<URI>, RFC 2396 (especially Appendix E)
AUTHOR
Michael G Schwern <schwern@pobox.com> with insight from Uri Gutman, Greg Bacon and Jeff Pinyan.