NAME
Web::PageMeta - get page open-graph / meta data
SYNOPSIS
async fetch previews and images:
use
Web::PageMeta;
my
@urls
=
qw(
)
;
my
@page_views
=
map
{ Web::PageMeta->new(
url
=>
$_
) }
@urls
;
Future->wait_all(
map
{
$_
->fetch_image_data_ft, }
@page_views
)->get;
foreach
my
$pv
(
@page_views
) {
say
'title> '
.
$pv
->title;
say
'img_size> '
.
length
(
$pv
->image_data);
}
# alternativelly instead of Future->wait_all()
fmap_void(
sub
{
return
$_
[0]->fetch_image_data_ft },
foreach
=> [
@page_views
],
concurrent
=> 3
)->get;
DESCRIPTION
Get (not only) open-graph web page meta data. can be used in both normal and async code.
For any other than 200 http status codes during data downloads, HTTP::Exception is thrown.
ACCESSORS
new
Constructor, only "url" is required.
url
HTTP url to fetch data from.
timeout
In addition to AnyEvent::HTTP timeout will also check time during download as the data are being downloaded and dies when over the limit. Default 5 minutes.
max_size
Will die when the document or image size is greater than this limit. Default 100MB.
user_agent
User-Agent header to use for http requests. Default is one from Chrome 89.0.4389.90.
extra_headers
HashRef with extra http request headers.
cookie_jar
Accepts optional HTTP::Cookies compatible object that must provide get_cookies()
method. If set will send http cookie headers with each request.
title
Returns title of the page.
description
Returns description of the page.
canonical_url
Returns open-graph url. If not present returns "url".
image
Returns image location of the page.
image_data
Returns image binary data of "image" link.
Will throw 404 exception if there is not "image" link.
page_meta
Returns hash ref with all open-graph data.
extra_scraper
Web::Scraper::LibXML object to fetch image, title or description from different than default location.
use
Web::Scraper::LibXML;
use
Web::PageMeta;
my
$escraper
= scraper {
process_first
'.slider .camera_wrap div'
,
'image'
=>
'@data-src'
;
};
my
$wmeta
= Web::PageMeta->new(
extra_scraper
=>
$escraper
,
);
page_body_hdr
Returns array ref with page [$body,$headers]. Can be useful for post-processing or special/additional data extractions.
Only text/html
content-type is accepted for fetching.
fetch_page_meta_ft
Returns future object for fetching paga meta data. See "ASYNC USE". On done "page_meta" hash is returned.
fetch_image_data_ft
Returns future object for fetching image data. See "ASYNC USE" On done "image_data" scalar is returned.
fetch_page_body_hdr_ft
Returns future object for fetching page content and headers. See "ASYNC USE" On done "page_body_hdr" array ref is returned.
ASYNC USE
To run multiple page meta data or image http requests in parallel or to be used in async programs "fetch_page_meta_ft" and fetch_image_data_ft returning Future object can be used. See "SYNOPSIS" or t/02_async.t for sample use.
SEE ALSO
AUTHOR
Jozef Kutej, <jkutej at cpan.org>
LICENSE AND COPYRIGHT
Copyright 2021 jkutej@cpan.org
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.