NAME

HTML::Miner - This Module 'Mines' (hopefully) useful information for an URL or HTML snippet.

VERSION

Version 0.05

SYNOPSIS

HTML::Miner 'Mines' (hopefully) useful information for an URL or HTML snippet. The following is a list of HTML elements that can be extracted:

Find all links and for each link extract:

URL Title

URL href

URL Anchor Text

URL Domain

URL Protocol

URL URI

URL Absolute location
Find all images and for each image extract:

IMG Source URL

IMG Absolute Source URL

IMG Source Domain
Extracts Meta Elements such as

Page Title

Page Description

Page Keywords

Page RSS Feeds
Finds the final destination URL of a potentially redirecting URL.
Find all JS and CSS files used withing the HTML and find their absolute URL if required.

Example ( Object Oriented Usage )

use HTML::Miner;

my $html = "some html";
# or $html = do{local $/;<DATA>}; with __DATA__ provided

my $html_miner = HTML::Miner->new ( 

  CURRENT_URL                   => 'www.perl.org'   , 
  CURRENT_URL_HTML              => $html 

);


my $meta_data =  $html_miner->get_meta_elements()   ;
my $links     = $html_miner->get_links()            ;
my $images    = $html_miner->get_images()           ;

my ( $clear_url, $protocol, $domain, $uri ) = $html_miner->break_url();  

my $css_and_js =  $html_miner->get_page_css_and_js() ;

my $out = HTML::Miner::get_redirect_destination( "redirectingurl_here.html" ) ;

my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../about/" );

Example ( Direct access of Methods )

use HTML::Miner;

my $html = "some html";
# or $html = do{local $/;<DATA>}; with __DATA__ provided

my $url = "http://www.perl.org";

my $meta_data  = HTML::Miner::get_meta_elements( $url, $html ) ;
my $links      = HTML::Miner::get_links( $url, $html )         ;
my $images     = HTML::Miner::get_images( $url, $html )        ;

my ( $clear_url, $protocol, $domain, $uri ) = HTML::Minerbreak_url( $url );  

my $css_and_js = get_page_css_and_js( 
       URL                       =>    $url                     , 
       HTML                      =>    $optionally_html_of_url  ,   
       CONVERT_URLS_TO_ABS       =>    0/1                      ,  [ Optional argument, default is 1 ]
);

my $out = HTML::Miner::get_redirect_destination( "redirectingurl_here.html" ) ;

my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../about/" );

Testing HTML

__DATA__

  <html>
  <head>
      <title>SiteTitle</title>
      <meta name="description" content="desc of site" />
      <meta name="keywords"    content="kw1, kw2, kw3" />
      <link rel="alternate" type="application/atom+xml" title="Title" href="http://www.my_domain_to_mine.com/feed/atom/" />
      <link rel="alternate" type="application/rss+xml" title="Title" href="http://www.othersite.com/feed/" />
      <link rel="alternate" type="application/rdf+xml" title="Title" href="my_domain_to_mine.com/feed/" /> 
      <link rel="alternate" type="text/xml" title="Title" href="http://www.other.org/feed/rss/" />
      <script type="text/javascript" src="http://static.myjsdomain.com/frameworks/barlesque.js"></script>
      <script type="text/javascript" src="http://js.revsci.net/gateway/gw.js?csid=J08781"></script>
      <script type="text/javascript" src="/about/other.js"></script>
      <link rel="stylesheet" type="text/css" href="http://static.mycssdomain.com/frameworks/style/main.css"  />
  </head>
  <body>
  
  <a href="http://linkone.com">Link1</a>
  <a href="link2.html" TITLE="title2" >Link2</a>
  <a href="/link3">Link3</a>
  
  
  <img src="http://my_domain_to_mine.com/logo_plain.jpg" >
  <img alt="image2" src="http://my_domain_to_mine.com/image2.jpg" />
  <img src="http://my_other.com/image3.jpg" alt="link3">
  <img src="image3.jpg" alt="link3">
  
  
  </body>
  </html>

Example Output:

my $meta_data =  $html_miner->get_meta_elements() ;

# $meta_data->{ TITLE }             =>   "SiteTitle"
# $meta_data->{ DESC }              =>   "desc of site"
# $meta_data->{ KEYWORDS }->[0]     =>   "kw1"
# $meta_data->{ RSS }->[0]->{TYPE}  =>   "application/atom+xml"



my $links = $html_miner->get_links();

# $links->[0]->{ DOMAIN }         =>   "linkone.com"
# $links->[0]->{ ANCHOR }         =>   "Link1"
# $links->[2]->{ ABS_URL   }      =>   "http://my_domain_to_mine.com/link3"
# $links->[1]->{ DOMAIN_IS_BASE } =>   1
# $links->[1]->{ TITLE }          =>   "title2"



my $images = $html_miner->get_images();

# $images->[0]->{ IMG_LOC }     =>  "http://my_domain_to_mine.com/logo_plain.jpg"
# $images->[2]->{ ALT }         =>  "link3"
# $images->[0]->{ IMG_DOMAIN }  =>  "my_domain_to_mine.com"
# $images->[3]->{ ABS_LOC }     =>  "http://my_domain_to_mine.com/image3.jpg"



my $css_and_js =  $html_miner->get_page_css_and_js(
     CONVERT_URLS_TO_ABS       =>    0
);

# $css_and_js will contain:
#    {
#      CSS => [
#         "http://static.mycssdomain.com/frameworks/style/main.css",
# 	      "/rel_cssfile.css",
#        ],
#      JS  => [
# 	       "http://static.myjsdomain.com/frameworks/barlesque.js",
#          "http://js.revsci.net/gateway/gw.js?csid=J08781",
#          "/about/rel_jsfile.js",
#        ],
#    }


my $css_and_js =  $html_miner->get_page_css_and_js(
     CONVERT_URLS_TO_ABS       =>    1
);

# $css_and_js will contain:
#    {
#      CSS => [
#         "http://static.mycssdomain.com/frameworks/style/main.css",
# 	      "http://www.perl.org/rel_cssfile.css",
#        ],
#      JS  => [
# 	       "http://static.myjsdomain.com/frameworks/barlesque.js",
#          "http://js.revsci.net/gateway/gw.js?csid=J08781",
#          "http://www.perl.org/about/rel_jsfile.js",
#        ],
#    }



my ( $clear_url, $protocol, $domain, $uri ) = $html_miner->break_url();  

# $clear_url   =>  "http://my_domain_to_mine.com/my_page_to_mine.pl"
# $protocol    =>  "http"
# $domain      =>  "my_domain_to_mine.com"
# $uri         =>  "/my_page_to_mine.pl"


HTML::Miner::get_redirect_destination( "redirectingurl_here.html" ) => 'redirected_to'



my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../about/" );
# $out    => "http://www.perl.com/about/"

$out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/index.html", "index2.html" );
# $out    => "http://www.perl.com/help/faq/index2.html"

$out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../index.html" );
# $out    => "http://www.perl.com/index.html"

$out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "/about/" );
# $out    => "http://www.perl.com/about/"

$out = HTML::Miner::get_absolute_url( "www.perl.comhelp/faq/", "http://othersite.com" );
# $out    => "http://othersite.com/"

EXPORT

This Module does not export anything through @EXPORT, however does export the following functions through @EXPORT_OK

get_links
get_absolute_url
break_url
get_redirect_destination
get_images
get_meta_elements
get_page_css_and_js

FUNCTIONS

The following functions are all available directly and through the HTML::Miner Object.

Constructor new

The constructor validates the input data and retrieves a URL if the HTML is not provided.

The constructor takes the following parameters:

my $foo = HTML::Miner->new ( 

    ## New will croak if this is not provided. 
    CURRENT_URL                   => 'www.site_i_am_crawling.com/page_i_am_crawling.html'   , 
    ## Optional, will be extracted if this is not provided. 
    CURRENT_URL_HTML              => 'long string here'                                     ,  
    ## Will use default if not provided, 
    USER_AGENT                    => 'Perl_HTML_Miner/$VERSION'                             ,  
    ## Will use default if not provided, 
    TIMEOUT                       => 5                                                      ,

    DEBUG                         => 0                                                       

);

get_links

This function extracts all URLs from a web page.

Syntax:

  When called on an HTML::Miner Object :

         $retun_element = $html_miner->get_links();

  When called directly                 :

         $retun_element = get_links( $url, $optionally_html_of_url );

  The direct call is intended to be a simplified version of OO call 
      and so does not allow for customization of the useragent and so on!

Output:

This function ( regardless of how its called ) returns a pointer to an Array of Hashes who's structure is as follows:

$->Array( 
   Hash->{ 
       "URL"             => "extracted url"                       ,
       "ABS_EXISTS"      => "0_if_abs_url_extraction_failed"      , 
       "ABS_URL"         => "absolute_location_of_extracted_url"  ,
       "TITLE"           => "title_of_this_url"                   , 
       "ANCHOR"          => "anchor_text_of_this_url"             ,
       "DOMAIN"          => "domain_of_this_url"                  ,
       "DOMAIN_IS_BASE"  => "1_if_this_domain_same_as_base_domain ,
       "PROTOCOL"        => "protocol_of_this_domain"             ,
       "URI"             => "URI_of_this_url"                     ,
   }, 
     ... 
)

So, to access the title of the second URL found you would use (yes the order is maintained):

@{ $retun_element }[1]->{ TITLE }

NOTE:

If ABS_EXISTS is 0 then DOMAIN, DOMAIN_IS_BASE, PROTOCOL and URI will be undefined

What if I want to extract URLs from a HTML snippet and don't care about the url of that page?

Well simply pass some garbage as the URL and ignore everything except
   URL, TITLE and ANCHOR

get_page_css_and_js

This function extracts all CSS style sheets and JS Script files use on a web page.

Syntax:

  When called on an HTML::Miner Object :

         $retun_element = $html_miner->get_page_css_and_js(
              CONVERT_URLS_TO_ABS       =>    0/1                         [ Optional argument, default is 1 ]
         );

  When called directly                 :

         $retun_element = get_page_css_and_js( 
              URL                       =>    $url                     , 
              HTML                      =>    $optionally_html_of_url  ,   
              CONVERT_URLS_TO_ABS       =>    0/1                      ,  [ Optional argument, default is 1 ]
         );

  The direct call is intended to be a simplified version of OO call 
      and so does not allow for customization of the useragent and so on!

Output:

This function ( regardless of how its called ) returns a pointer to a Hash [ JS or CSS ] of Arrays containing the URLs

$->HASH->{ 
      "CSS"   => Array( "extracted url1", "extracted url2", .. )
      "JS"    => Array( "extracted url1", "extracted url2", .. )
  }

So, to access the URL of the second CSS style sheet found you would use (again the order is maintained):

$$retun_element{ "CSS" }[1];

$css_data = @{ $retun_element->{ "CSS" } }    ;
$second_css_url_found = $css_data[1]          ;

What if I want to extract CSS and JS links from a HTML snippet and don't care about the url of that page?

Simply set CONVERT_URLS_TO_ABS to 0 and everything should be fine.

get_absolute_url

This function takes as arguments the base URL whithin the HTML of which a second (possibly reletive URL ) URL was found, and returns the absolute location of that second URL.

Example:

my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/fag/", "../../about/" )

Will return:

      www.perl.com/about/

NOTE:

This function can not be called on the HTML::Miner Object. 
The function get_links does this for all URLs found on a webpage.

break_url

This function, given an URL, returns the Domain, Protocol, URI and the input URL in its 'standard' form.

It is called on the HTML::Miner Object as follows:

my ( $clear_url, $protocol, $domain, $uri ) = $break_url();

NOTE: This will return the details of the 'CURRENT_URL'

It is called directly as follows:

my ( $clear_url, $protocol, $domain, $uri ) = $break_url( 'www.perl.org/help/faq/' );

Example returned Values:

 Input

      www.perl.org/help/faq

 Output
   
      clean_url --> http://www.perl.org/help/faq/
      protocol  --> http
      domain    --> www.perl.org
      uri       --> help/faq/

get_redirect_destination

This function takes as argument a URL that is potentially redirected to another and another and ... URL and returns the FINAL destination URL.

This function REQUIRES to access the web each time its called.

This function CAN NOT be called on the HTML::Miner Object.

Example:

my $destination_url = HTML::Miner::get_redirect_destination( 
   'http://rss.cnn.com/~r/rss/edition_world/~3/403863461/index.html' , 
   'optional_user_agent',
   'optional_timeout'
);

$destination_url will contain:

   "http://edition.cnn.com/2008/WORLD/americas/09/26/russia.chavez/index.html?eref=edition_world"

get_images

This function extracts all images from a web page.

Syntax:

  When called on an HTML::Miner Object :

         $retun_element = $html_miner->get_images();

  When called directly                 :

         $retun_element = get_images( $url, $optionally_html_of_url );

  The direct call is intended to be a simplified version of OO call 
      and so does not allow for customization of the useragent and so on!

Output:

This function ( regardless of how its called ) returns a pointer to an Array of Hashes who's structure is as follows:

$->Array( 
   Hash->{ 
       "IMG_LOC"         => "extracted_image"                        ,
       "ALT"             => "alt_text_of_this_image"                 ,
       "ABS_EXISTS"      => "0_if_abs_url_extraction_failed"         , 
       "ABS_LOC"         => "absolute_location_of_extracted_image"   ,
       "IMG_DOMAIN"      => "domain_of_this_image"                   ,
       "DOMAIN_IS_BASE"  => "1_if_this_domain_same_as_base_domain    ,
   }, 
     ...

) So, to access the alt text of the second image found you would use (yes the order is maintained):

@{ $retun_element }[1]->{ TITLE }

NOTE:

If ABS_EXISTS is 0 then IMG_DOMAIN and DOMAIN_IS_BASE will be undefined

What if I want to extract images from a HTML snippet and don't care about the URL of that page?

Well simply pass some garbage as the URL and ignore everything except absolute locations and of course domains.

get_meta_elements

This function retrieves the following meta elements for a given URL (or HTML snippet)

Page Title
Meta Description
Meta Keywords
Page RSS Feeds

It is called through the HTML::Miner Object as follows:

$return_hash = $html_miner->get_meta_elements( );

It is called directly as follows:

$return_hash = $html_miner->get_meta_elements( 
                                URL   => "url_of_page"  ,
                                HTML  => "html_of_page
                            );

Note: The above function requires either the html of the url. If the 
      HTML is provided then the URL is used to retrieve the HTML.
      If both are not provided this function will croak.

      Again this function does not allow for customization of User Agent
      and timeout when called directly.

In either case the returned hash is of the following structure:

$return_hash = ( 
           TITLE     =>   'title_of_page'         ,
           DESC      =>   'description_of_page'   ,
           KEYWORDS  =>   
                'pointer to array of words'       ,
           RSS       => 
                'pointer to Array of Hashes of RSS links' as below
 )


$return_hash->{ RSS } = (
         [
           TYPE      => 'eg: application/atom+xml',
           TITLE     => 'Title of this RSS Feed'  ,
           URL       => 'URL of this RSS Feed'
         ],
             ...
)

_get_url_html

This is an internal function and is not to be used externally.

_convert_to_valid_url

This is an internal function and is not to be used externally.

AUTHOR

Harish T Madabushi, <harish.tmh at gmail.com>

BUGS

Please report any bugs or feature requests to bug-html-miner at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Miner. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc HTML::Miner

You can also look for information at:

RT: CPAN's request tracker

http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-Miner
AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/HTML-Miner
CPAN Ratings

http://cpanratings.perl.org/d/HTML-Miner
Search CPAN

http://search.cpan.org/dist/HTML-Miner/

ACKNOWLEDGEMENTS

Thanks to user ultranerds from http://perlmonks.org/?node_id=721567 for suggesting and helping with JS and CSS extraction.

COPYRIGHT & LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install HTML::Miner, copy and paste the appropriate command in to your terminal.

cpanm

cpanm HTML::Miner

CPAN shell

perl -MCPAN -e shell
install HTML::Miner

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

VERSION

SYNOPSIS

EXPORT

FUNCTIONS

Constructor new

get_links

get_page_css_and_js

get_absolute_url

break_url

get_redirect_destination

get_images

get_meta_elements

_get_url_html

_convert_to_valid_url

AUTHOR

BUGS

SUPPORT

ACKNOWLEDGEMENTS

COPYRIGHT & LICENSE

Module Install Instructions