NAME

TaskPipe::Task_Scrape - Base TaskPipe class for scraping a webpage

DESCRIPTION

This is the standard building block for creating a webpage-scraping task. To do this inherit from Task::Scrape using the following package format:

   package TaskPipe::Task_Scrape_MyScraper;

   use Moose;
   use Web::Scraper;
   extends 'TaskPipe::Task_Scrape';

   has test_pinterp => (is => 'ro', isa => 'ArrayRef[HashRef], default => sub{[

       {
           url => 'https://www.example.com/some-test-url',
           headers => {
               Referer => 'https://www.example.com/some-referer-url'
           }
       }
   
   ]});


   has ws => (is => 'ro', isa => 'Web::Scraper', default => sub{
       scraper {
           process 'div.some-class', 'results' => 'TEXT';
           result 'results'
       }
   });

   sub post_process {  # may or may not be necessary, depending
                       # on what is returned by ws

       my ($self,$results) = @_;

       # do something with the results returned from the web scraper

       return $results;
   }

test_pinterp allows you to specify test data which you can run the task against by typing

taskpipe test task --name=Scrape_MyScraper

at the command line.

It is assumed you want to use a Web::Scraper to scrape your page. If this is the case, just define a ws attribute as above. See the Web::Scraper manpage for more information on how to define a Web::Scraper.

Your task needs to return an arrayref of results (each result being a hashref). It's great if you can get ws to return this directly. Sometimes it is not possible to persuade your Web::Scraper to return results in this format. To make format corrections (remove records from the data etc) you can include a post_process subroutine. post_process receives the output from ws. Do what is needed, and make sure you return your results arrayref at the end.

AUTHOR

Tom Gracey <tomgracey@gmail.com>

COPYRIGHT AND LICENSE

TaskPipe is free software, licensed under

The GNU Public License Version 3

To install TaskPipe, copy and paste the appropriate command in to your terminal.

cpanm

cpanm TaskPipe

CPAN shell

perl -MCPAN -e shell
install TaskPipe

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

DESCRIPTION

AUTHOR

COPYRIGHT AND LICENSE

Module Install Instructions