NAME
Email::Extractor::Utils - Set of functions that can be useful when building web crawlers
VERSION
version 0.01
SYNOPSIS
use Email::Extractor::Utils qw( looks_like_url looks_like_file get_file_uri load_addr_to_str )
# or use Email::Extractor::Utils qw[:ALL];
Email::Extractor::Utils::Verbose = 1;
my $text = load_addr_to_str($url);
$Email::Extractor::Utils::Assets
List of asset extensions, used in "drop_asset_links" in Email::Extractor::Utils
To see default list of assets:
perl -Ilib -E "use Email::Extractor::Utils qw(:ALL); use Data::Dumper; print Dumper $Email::Extractor::Utils::Assets;"
load_addr_to_str
Accept URI of file path and return string with content
my $text = load_addr_to_str($url);
my $text = load_addr_to_str($path_to_file);
Function can accept http(s) uri or file paths both
dies if no such file
return $resp->content even if no such url
Can be used in tests when you need to mock http requests also
get_abs_path
Return absolute path of file relative to current working directory
get_file_uri
Make absolute path from relative (to cwd) and return absolute path that can pass Regexp::Common::URI::file validation
get_file_uri('/test') # 'file:///root/test' if cwd is /root
looks_like_url
looks_like_url('http://example.com') # 1
looks_like_url('https://example.com') # 1
looks_like_url('/root/somefolder') # 0
Detect if link is http or https url
Uses Regexp::Common::URI::http
Return:
O if provided string is not url
url without query, https://metacpan.org/pod/Regexp::Common::URI::http#$7 if provided string is url
looks_like_rel_link
Return true if link looks like relative url, either return false
looks_like_file
looks_like_file('http://example.com') # 0
looks_like_file('file:///root/somefolder') # 1
Detect if string is file uri or no
Uses Regexp::Common::URI::file
absolutize_links_array
Make all links in array absolute
my $res = absolutize_links( $links, 'http://example.com ');
$links
must be ARRAYREF
, return also ARRAYREF
remove_external_links
my $res = absolutize_links( $links, 'http://example.com '); # leave only links on http://example.com
Relative links stay untouched
$links
must be ARRAYREF
, return also ARRAYREF
drop_asset_links
my $res = drop_asset_links($links)
Leave only links that are not related to assets. Remove query params also
$links
must be ARRAYREF
, return also ARRAYREF
drop_anchor_links
my $res = drop_anchor_links ($links)
Leave only links that are not anchors to same page (anchor link is like #rec31047364
)
$links
must be ARRAYREF
, return also ARRAYREF
remove_query_params
Remove GET query params from provided links array
my $res = remove_query_params($links)
$links
must be ARRAYREF
, return also ARRAYREF
find_all_links
Find all links and return href attributes of a
tags
Return ARRAYREF
find_links_by_text
Find all a
tags containing particular text and return href values
If no search text specified return all links
Currently is not used in Email::Extractor project since it has unexpected behaviour (see tests)
Return ARRAYREF
TO-DO: try to implement this method with HTML::LinkExtor
DESCRIPTION
Set of useful utilities that works with html and urls
NAME
Email::Extractor::Utils
AUTHOR
Pavel Serikov <pavelsr@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018 by Pavel Serikov.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.