NAME
TaskPipe::Task_Scrape - Base TaskPipe class for scraping a webpage
DESCRIPTION
This is the standard building block for creating a webpage-scraping task. To do this inherit from Task::Scrape using the following package format:
package TaskPipe::Task_Scrape_MyScraper;
use Moose;
use Web::Scraper;
extends 'TaskPipe::Task_Scrape';
has test_pinterp => (is => 'ro', isa => 'ArrayRef[HashRef], default => sub{[
{
url => 'https://www.example.com/some-test-url',
headers => {
Referer => 'https://www.example.com/some-referer-url'
}
}
]});
has ws => (is => 'ro', isa => 'Web::Scraper', default => sub{
scraper {
process 'div.some-class', 'results' => 'TEXT';
result 'results'
}
});
sub post_process { # may or may not be necessary, depending
# on what is returned by ws
my ($self,$results) = @_;
# do something with the results returned from the web scraper
return $results;
}
test_pinterp
allows you to specify test data which you can run the task against by typing
taskpipe test task --name=Scrape_MyScraper
at the command line.
It is assumed you want to use a Web::Scraper to scrape your page. If this is the case, just define a ws
attribute as above. See the Web::Scraper manpage for more information on how to define a Web::Scraper.
Your task needs to return an arrayref of results (each result being a hashref). It's great if you can get ws
to return this directly. Sometimes it is not possible to persuade your Web::Scraper to return results in this format. To make format corrections (remove records from the data etc) you can include a post_process
subroutine. post_process
receives the output from ws. Do what is needed, and make sure you return your results
arrayref at the end.
AUTHOR
Tom Gracey <tomgracey@gmail.com>
COPYRIGHT AND LICENSE
Copyright (c) Tom Gracey 2018
TaskPipe is free software, licensed under
The GNU Public License Version 3