NAME
Sport::Analytics::NHL::Scraper - Scrape and crawl the NHL website for data
SYNOPSIS
Scrape and crawl the NHL website for data
use Sport::Analytics::NHL::Scraper
my $schedules = crawl_schedule({
start_season => 2016,
stop_season => 2017
});
...
my $contents = crawl_game(
{ season => 2011, stage => 2, season_id => 0001 }, # game 2011020001 in NHL accounting
{ game_files => [qw(BS PL)], retries => 2 },
);
IMPORTANT VARIABLE
Variable @GAME_FILES contains specific definitions for the report types. Right now only the boxscore javascript has any meaningful non-default definitions; the PB feed seems to have become unavailable.
FUNCTIONS
scrape
-
A wrapper around the LWP::Simple::get() call for retrying and control.
Arguments: hash reference containing
* url => URL to access * retries => Number of retries * validate => sub reference to validate the download
Returns: the content if both download and validation are successful undef otherwise.
crawl_schedule
-
Crawls the NHL schedule. The schedule is accessed through a minimalistic live api first (only works for post-2010 seasons), then through the general /api/
Arguments: hash reference containing
* start_season => the first season to crawl * stop_season => the last season to crawl
Returns: hash reference of seasonal schedules where seasons are the keys, and decoded JSONs are the values.
get_game_url_args
-
Sets the arguments to populate the game URL for a given report type and game
Arguments:
* document name, currently one of qw(BS PB RO ES GS PL) * game hashref containing - season => YYYY - stage => 2|3 - season ID => NNNN
Returns: a configured list of arguments for the URL.
crawl_game
-
Crawls the data for the given game
Arguments:
game data as hashref: * season => YYYY * stage => 2|3 * season ID => NNNN options hashref: * game_files => hashref of types of reports that are requested * force => 0|1 force overwrite of files already present in the system * retries => N number of the retries for every get call
crawl_player
-
Crawls the data for an NHL player given his NHL id. First, the API call is made, and the JSON is retrieved. Unfortunately, the JSON does not contain the draft information, so another call to the HTML page is made to complete the information. The merged information is stored in a json file at the ROOT_DATA_DIR/players/$ID.json path.
Arguments: * player's NHL id * options hashref: - data_dir root data dir location - playerfile_expiration -how long the saved playerfile should be trusted - force - crawl the player regardless Returns: the path to the saved file
crawl_rotoworld_injuries
-
Crawls the RotoWorld.com injuries page to detect the injuries.
Arguments: none Returns: a list of hashes, each one a player name, the injury status and the injury type.
crawl_injured_players
-
Currently only contains a call to crawl_rotoworld_injuries (q.v.)
AUTHOR
More Hockey Stats, <contact at morehockeystats.com>
BUGS
Please report any bugs or feature requests to contact at morehockeystats.com
, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Sport::Analytics::NHL::Scraper. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Sport::Analytics::NHL::Scraper
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
https://rt.cpan.org/NoAuth/Bugs.html?Dist=Sport::Analytics::NHL::Scraper
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
https://cpanratings.perl.org/d/Sport::Analytics::NHL::Scraper
Search CPAN