NAME

WebFetch - Perl module to download/fetch and save information from the Web

VERSION

version 0.15.6

SYNOPSIS

use WebFetch;

DESCRIPTION

The WebFetch module is a framework for downloading and saving information from the web, and for saving or re-displaying it. It provides a generalized interface for saving to a file while keeping the previous version as a backup. This is mainly intended for use in a cron-job to acquire periodically-updated information.

WebFetch allows the user to specify a source and destination, and the input and output formats. It is possible to write new Perl modules to the WebFetch API in order to add more input and output formats.

The currently-provided input formats are Atom, RSS, WebFetch "SiteNews" files and raw Perl data structures.

The currently-provided output formats are RSS, WebFetch "SiteNews" files, the Perl Template Toolkit, and export into a TWiki site.

Some modules which were specific to pre-RSS/Atom web syndication formats have been deprecated. Those modules can be found in the CPAN archive in WebFetch 0.10. Those modules are no longer compatible with changes in the current WebFetch API.

INSTALLATION

After unpacking and the module sources from the tar file, run

perl Makefile.PL

make

make install

Or from a CPAN shell you can simply type "install WebFetch" and it will download, build and install it for you.

If you need help setting up a separate area to install the modules (i.e. if you don't have write permission where perl keeps its modules) then see the Perl FAQ.

To begin using the WebFetch modules, you will need to test your fetch operations manually, put them into a crontab, and then use server-side include (SSI) or a similar server configuration to include the files in a live web page.

MANUALLY TESTING A FETCH OPERATION

Select a directory which will be the storage area for files created by WebFetch. This is an important administrative decision - keep the volatile automatically-generated files in their own directory so they'll be separated from manually-maintained files.

Choose the specific WebFetch-derived modules that do the work you want. See their particular manual/web pages for details on command-line arguments. Test run them first before committing to a crontab.

SETTING UP CRONTAB ENTRIES

If needed, see the manual pages for crontab(1), crontab(5) and any web sites or books on Unix system administration.

Since WebFetch command lines are usually very long, the user may prefer to make one or more scripts as front-ends so crontab entries aren't so big.

Try not to run crontab entries too often - be aware if the site you're accessing has any resource constraints, and how often their information gets updated. If they request users not to access a feed more often than a certain interval, respect it. (It isn't hard to find violators in server logs.) If in doubt, try every 30 minutes until more information becomes available.

WebFetch FUNCTIONS AND METHODS

The following function definitions assume $obj is a blessed reference to a module that is derived from (inherits from) WebFetch.

WebFetch->version()

Return the version number of WebFetch, or for any subclass which inherits the method.

When running code within a source-code development workspace, it returns "00-dev" to avoid warnings about undefined values. Release version numbers are assigned and added by the build system upon release, and are not available when running directly from a source code repository.

WebFetch->config( $key, [$value])

This class method is the read/write accessor to WebFetch's key/value configuration store. If $value is not provided (or is undefied) then this is a read accessor, returning the value of the configuration entry named by $key. If $value is defined then this is a write accessor, assigning $value to the configuration entry named by $key.

WebFetch->has_config($key)

This class method returns a boolean value which is true if the configuration entry named by $key exists in the WebFetch key/value configuration store. Otherwise it returns false.

WebFetch->del_config($key)

This class method deletes the configuration entry named by $key.

WebFetch->import_config(\%hashref)

This class method imports all the key/value pairs from %hashref into the WebFetch configuration.

WebFetch->keys_config()

This class method returns a list of the keys in the WebFetch configuration store. This method was made for testing purposes. That is currently its only foreseen use case.

WebFetch::module_register( $module, @capabilities );

This function allows a Perl module to register itself with the WebFetch API as able to perform various capabilities.

For subclasses of WebFetch, it can be called as a class method. __PACKAGE__->module_register( @capabilities );

For the $module parameter, the Perl module should provide its own name, usually via the __PACKAGE__ string.

@capabilities is an array of strings as needed to list the capabilities which the module performs for the WebFetch API.

If any entry of @capabilities is a hash reference, its key/value pairs are all imported to the WebFetch configuration, and becomes accessible via the config() method. For more readable code, a hashref parmeter should not be used more than once. Though that would work. Also for readability, it is recommended to make the hashref the first parameter when this feature is used.

Except for the config hashref, parameters must be strings as follows.

The currently-recognized capabilities are "cmdline", "input" and "output". "filter", "save" and "storage" are reserved for future use. The function will save all the capability names that the module provides, without checking whether any code will use it.

For example, the WebFetch::Output::TT module registers itself like this: __PACKAGE__->module_register( "cmdline", "output:tt" ); meaning that it defines additional command-line options, and it provides an output format handler for the "tt" format, the Perl Template Toolkit.

fetch_main

This function is exported into the main package. For all modules which registered with an "input" capability for the requested file format at the time this is called, it will call the run() function on behalf of each of the packages.

$obj = WebFetch::new( param => "value", [...] )

Generally, the new function should be inherited and used from a derived class. However, WebFetch provides an AUTOLOAD function which will catch wayward function calls from a subclass, and redirect it to the appropriate function in the calling class, if it exists.

The AUTOLOAD feature is needed because, for example, when an object is instantiated in a WebFetch::Input::* class, it will later be passed to a WebFetch::Output::* class, whose data method functions can be accessed this way as if the WebFetch object had become a member of that class.

$obj->init( ... )

This is called from the new function that modules inherit from WebFetch. If subclasses override it, they should still call it before completion. It takes "name" => "value" pairs which are all placed verbatim as attributes in $obj.

$obj->set_param(key, value)

This sets a value under the given key in the WebFetch object.

Some keys are intercepted to be grouped into their own sub-hierarchy. The keys "locale" and "time_zone" are placed in a "datetime_settings" hash under the object.

If the parameter is one of the intercepted values but the destination hierarchy already exists as a non-hash value, then it throws an exception.

The method does not return a value. If it doens't throw an exception, other outcomes are success.

WebFetch::mod_load ( $class )

This specifies a WebFetch module (Perl class) which needs to be loaded. In case of an error, it throws an exception.

WebFetch::run

This function can be called by the main::fetch_main function provided by WebFetch or by another user function. This handles command-line processing for some standard options, calling the module-specific fetch function and WebFetch's $obj->save function to save the contents to one or more files.

The command-line processing for some standard options are as follows:

--dir directory

(required) the directory in which to write output files

--group group

(optional) the group ID to set the output file(s) to

--mode mode

(optional) the file mode (permissions) to set the output file(s) to

--save_file save-file-path

(optional) save a copy of the fetched info in the file named by this parameter. The contents of the file are determined by the --dest_format parameter. If --dest_format isn't defined but only one module has registered a file format for saving, then that will be used by default.

--quiet

(optional) suppress printed warnings for HTTP errors (applies only to modules which use the WebFetch::get() function) in case they are not desired for cron outputs

--debug

(optional) print verbose debugging outputs, only useful for developers adding new WebFetch-based modules or finding/reporting a bug in an existing module

Modules derived from WebFetch may add their own command-line options that WebFetch::run() will use by defining a WebFetch configuration entry called "Options", containing the name/value pairs defined in Perl's Getopts::Long module. Derived modules can also add to the command-line usage error message by defining a configuration entry called "Usage" with a string of the additional parameters, as they should appear in the usage message. See the WebFetch->module_register() and WebFetch->config() class methods for setting configuration entries.

For backward compatibility, WebFetch also looks for @Options and $Usage in the calling module's symbol table if they aren't found in the WebFetch configuration. However this method is deprecated and should not be used in new code. Perl coding best practices have evolved to recommend against using package variables in the years since the API was first defined.

$obj->do_actions

do_actions was added in WebFetch 0.10 as part of the WebFetch Embedding API. Upon entry to this function, $obj must contain the following attributes:

data

is a reference to a hash containing the following three (required) keys:

fields

is a reference to an array containing the names of the fetched data fields in the order they appear in the records of the data array. This is necessary to define what each field is called because any kind of data can be fetched from the web.

wk_names

is a reference to a hash which maps from a key string with a "well-known" (to WebFetch) field type to a field name used in this table. The well-known names are defined as follows:

title

a one-liner banner or title text (plain text, no HTML tags)

url

URL or file path (as appropriate) to the news source

id

unique identifier string for the entry

date

a date stamp (and optional timestamp), which must be program-readable as ISO 8601 date/time format (via DateTime::Format::ISO8601), Unix date command output (via Date::Calc's Parse_Date() function) or as "YYYY-MM-DD" date string format. For backward compatibility, "YYYYMMDD" format is also accepted, though technically that format was deprecated from ISO 8601 in 2004. If the date cannot be parsed by these methods, either translate it to ISO 8601 when your module captures it or do not define this well-known field.

summary

a paragraph of summary text in HTML

comments

number of comments/replies at the news site (plain text, no HTML tags)

author

a name, handle or login name representing the author of the news item (plain text, no HTML tags)

category

a word or short phrase representing the category, topic or department of the news item (plain text, no HTML tags)

location

a location associated with the news item (plain text, no HTML tags)

The field names for this table are defined in the fields array.

The hash only maps for the fields available in the table. If no field representing a given well-known name is present in the data fields, that well-known name key must not be defined in this hash.

records

an array containing the data records. Each record is itself a reference to an array of strings which are the data fields. This is effectively a two-dimensional array or a table.

Only one table-type set of data is permitted per fetch operation. If more are needed, they should be arranged as separate fetches with different parameters.

actions

is a reference to a hash. The hash keys are names for handler functions. The WebFetch core provides internal handler functions called fmt_handler_html (for HTML output), fmt_handler_xml (for XML output), fmt_handler_wf (for WebFetch::General format), However, WebFetch modules may provide additional format handler functions of their own by prepending "fmt_handler_" to the key string used in the actions array.

The values are array references containing "action specs", which are themselves arrays of parameters that will be passed to the handler functions for generating output in a specific format. There may be more than one entry for a given format if multiple outputs with different parameters are needed.

The presence of values in this field mean that output is to be generated in the specified format. The presence of these would have been chosed by the WebFetch module that created them - possibly by default settings or by a command-line argument that directed a specific output format to be used.

For each valid action spec, a separate "savable" (contents to be placed in a file) will be generated from the contents of the data variable.

The valid (but all optional) keys are

html

the value must be a reference to an array which specifies all the HTML generation (html_gen) operations that will take place upon the data. Each entry in the array is itself an array reference, containing the following parameters for a call to html_gen():

filename

a file name or path string (relative to the WebFetch output directory unless a full path is given) for output of HTML text.

params

a hash reference containing optional name/value parameters for the HTML format handler.

filter_func

(optional) a reference to code that, given a reference to an entry in @{$self->{data}{records}}, returns true (1) or false (0) for whether it will be included in the HTML output. By default, all records are included.

sort_func

(optional) a reference to code that, given references to two entries in @{$self->{data}{records}}, returns the sort comparison value for the order they should be in. By default, no sorting is done and all records (subject to filtering) are accepted in order.

format_func

(optional) a refernce to code that, given a reference to an entry in @{$self->{data}{records}}, stores a savable representation of the string.

Additional valid keys may be created by modules that inherit from WebFetch by supplying a method/function named with "fmt_handler_" preceding the string used for the key. For example, for an "xyz" format, the handler function would be fmt_handler_xyz. The value (the "action spec") of the hash entry must be an array reference. Within that array are "action spec entries", each of which is a reference to an array containing the list of parameters that will be passed verbatim to the fmt_handler_xyz function.

When the format handler function returns, it is expected to have created entries in the $obj->{savables} array (even if they only contain error messages explaining a failure), which will be used by $obj->save() to save the files and print the error messages.

For coding examples, use the fmt_handler_* functions in WebFetch.pm itself.

$obj->fetch

This function must be provided by each derived module to perform the fetch operaton specific to that module. It will be called from new() so you should not call it directly. Your fetch function should extract some data from somewhere and place of it in HTML or other meaningful form in the "savable" array.

TODO: cleanup references to WebFetch 0.09 and 0.10 APIs.

Upon entry to this function, $obj must contain the following attributes:

dir

The name of the directory to save in. (If called from the command-line, this will already have been provided by the required --dir parameter.)

savable

a reference to an array where the "savable" items will be placed by the $obj->fetch function. (You only need to provide an array reference - other WebFetch functions can write to it.)

In WebFetch 0.10 and later, this parameter should no longer be supplied by the fetch function (unless you wish to use 0.09 backward compatibility) because it is filled in by the do_actions after the fetch function is completed based on the data and actions variables that are set in the fetch function. (See below.)

Each entry of the savable array is a hash reference with the following attributes:

file

file name to save in

content

scalar w/ entire text or raw content to write to the file

group

(optional) group setting to apply to file

mode

(optional) file permissions to apply to file

Contents of savable items may be generated directly by derived modules or with WebFetch's html_gen, html_savable or raw_savable functions. These functions will set the group and mode parameters from the object's own settings, which in turn could have originated from the WebFetch command-line if this was called that way.

Note that the fetch functions requirements changed in WebFetch 0.10. The old requirement (0.09 and earlier) is supported for backward compatibility.

In WebFetch 0.09 and earlier, upon exit from this function, the $obj->savable array must contain one entry for each file to be saved. More than one array entry means more than one file to save. The WebFetch infrastructure will save them, retaining backup copies and setting file modes as needed.

Beginning in WebFetch 0.10, the "WebFetch embedding" capability was introduced. In order to do this, the captured data of the fetch function had to be externalized where other Perl routines could access it. So the fetch function now only populates data structures (including code references necessary to process the data.)

Upon exit from the function, the following variables must be set in $obj:

data

is a reference to a hash which will be used by the do_actions function. (See above.)

actions

is a reference to a hash which will be used by the do_actions function. (See above.)

$obj->get

This WebFetch utility function will get a URL and return a reference to a scalar with the retrieved contents. Upon entry to this function, $obj must contain the following attributes:

source

the URL to get

quiet

a flag which, when set to a non-zero (true) value, suppresses printing of HTTP request errors on STDERR

$obj->wf_export ( $filename, $fields, $links, [ $comment, [ $param ]] )

In WebFetch 0.10 and later, this should be used only in format handler functions. See do_handlers() for details.

This WebFetch utility function generates contents for a WebFetch export file, which can be placed on a web server to be read by other WebFetch sites. The WebFetch::General module reads this format. $obj->wf_export has the following parameters:

$filename

the file to save the WebFetch export contents to; this will be placed in the savable record with the contents so the save function knows were to write them

$fields

a reference to an array containing a list of the names of the data fields (in each entry of the @$lines array)

$lines

a reference to an array of arrays; the outer array contains each line of the exported data; the inner array is a list of the fields within that line corresponding in index number to the field names in the $fields array

$comment

(optional) a Human-readable string comment (probably describing the purpose of the format and the definitions of the fields used) to be placed at the top of the exported file

$param

(optional) a reference to a hash of global parameters for the exported data. This is currently unused but reserved for future versions of WebFetch.

In WebFetch 0.10 and later, this should be used only in format handler functions. See do_handlers() for details.

This WebFetch utility function generates some common formats of HTML output used by WebFetch-derived modules. The HTML output is stored in the $obj->{savable} array, for which all the files in that array can later be saved by the $obj->save function. It has the following parameters:

$filename

the file name to save the generated contents to; this will be placed in the savable record with the contents so the save function knows were to write them

$format_func

a refernce to code that formats each entry in @$links into a line of HTML

a reference to an array of arrays of parameters for &$format_func; each entry in the outer array is contents for a separate HTML line and a separate call to &$format_func

Upon entry to this function, $obj must contain the following attributes:

number of lines/links to display

savable

reference to an array of hashes which this function will use as storage for filenames and contents to save (you only need to provide an array reference - the function will write to it)

See $obj->fetch for details on the contents of the savable parameter

table_sections

(optional) if present, this specifies the number of table columns to use; the number of links from num_links will be divided evenly between the columns

style

(optional) a hash reference with style parameter names/values that can modify the behavior of the funciton to use different HTML styles. The recognized values are enumerated with WebFetch's --style command line option. (When they reach this point, they are no longer a comma-delimited string - WebFetch or another module has parsed them into a hash with the style name as the key and the integer 1 for the value.)

url

(optional) an alternative URL to fetch from. In WebFetch modules that fetch from a URL, this will override the default URL in the module. In other modules, it has no effect but its presence won't cause an error.

$obj->html_savable( $filename, $content )

In WebFetch 0.10 and later, this should be used only in format handler functions. See do_actions() for details.

This WebFetch utility function stores pre-generated HTML in a new entry in the $obj->{savable} array, for later writing to a file. It's basically a simple wrapper that puts HTML comments warning that it's machine-generated around the provided HTML text. This is generally a good idea so that neophyte webmasters (and you know there are a lot of them in the world :-) will see the warning before trying to manually modify your automatically-generated text.

See $obj->fetch for details on the contents of the savable parameter

$obj->raw_savable( $filename, $content )

In WebFetch 0.10 and later, this should be used only in format handler functions. See do_actions() for details.

This WebFetch utility function stores any raw content and a filename in the $obj->{savable} array, in preparation for writing to that file. (The actual save operation may also automatically include keeping backup files and setting the group and mode of the file.)

See $obj->fetch for details on the contents of the savable parameter

$obj->direct_fetch_savable( $filename, $source )

This should be used only in format handler functions. See do_actions() for details.

This adds a task for the save function to fetch a URL and save it verbatim in a file. This can be used to download links contained in a news feed.

$obj->no_savables_ok

This can be used by an output function which handles its own intricate output operation (such as WebFetch::Output::TWiki). If the savables array is empty, it would cause an error. Using this function drops a note in it which basically says that's OK.

$obj->save

This WebFetch utility function goes through all the entries in the $obj->{savable} array and saves their contents, providing several services such as keeping backup copies, and setting the group and mode of the file, if requested to do so.

If you call a WebFetch-derived module from the command-line run() or fetch_main() functions, this will already be done for you. Otherwise you will need to call it after populating the savable array with one entry per file to save.

Upon entry to this function, $obj must contain the following attributes:

dir

directory to save files in

savable

names and contents for files to save

See $obj->fetch for details on the contents of the savable parameter

WebFetch::parse_date([{locale => "locale", time_zone => "time zone"}], $raw_time_str)

This parses a time string into a time or date structure which can be used by gen_timestamp() or anchor_timestr().

If the string can be parsed as a simple date in the format of YYYY-MM-DD or YYYYMMDD, it returns an array of parameters which can be passed to DateTime->new(). Given in this context, gen_timestamp() or anchor_timestr() recognize that means this is only a date with no time. (DateTime would fill in a time for midnight, which could be shifted by hours if a timezone is added, making a date-only condition nearly impossible to detect.)

If the time can be parsed by DateTime::Format::ISO8601, that result is returned.

If the time can be parsed by Date::Calc's Parse_Date(), a date-only array result is returned as above.

If the string can't be parsed, it returns undef;

WebFetch::gen_timestamp([{locale => "locale", time_zone => "time zone"}], $time_ref)

This takes a reference received from parse_date() above and returns a string with the date in current locale format.

anchor_timestr([{time_zone => "time zone"}], $time_ref)

This takes a reference received from parse_date() above and returns a timestamp string which can be used as a hypertext link anchor, such as in HTML. The string will be the numbers from the date, and possible time of day, delimited by dashes '-'. If a time zone is provided, it will be used.

For example, August 5, 2022 at 19:30 becomes "2022-08-05-19-30-00".

AUTOLOAD functionality

When a WebFetch input object is passed to an output class, operations on $self would not usually work. WebFetch subclasses are considered to be cooperating with each other. So WebFetch provides AUTOLOAD functionality to catch undefined function calls for its subclasses. If the calling class provides a function by the name that was attempted, then it will be redirected there.

WRITING WebFetch-DERIVED MODULES

The easiest way to make a new WebFetch-derived module is to start from the module closest to your fetch operation and modify it. Make sure to change all of the following:

fetch function

The fetch function is the meat of the operation. Get the desired info from a local file or remote site and place the contents that need to be saved in the savable parameter.

module name

Be sure to catch and change them all.

file names

The code and documentation may refer to output files by name.

module parameters

Change the URL, number of links, etc as necessary.

command-line parameters

If you need to add command-line parameters, set both the Options and Usage configuration parameters when your module calls module_register(). Don't forget to add documentation for your command-line options and remove old documentation for any you removed.

When adding documentation, if the existing formatting isn't enough for your changes, there's more information about Perl's POD ("plain old documentation") embedded documentation format at http://www.cpan.org/doc/manual/html/pod/perlpod.html

authors

Do not modify the names unless instructed to do so. The maintainers have discretion whether one's contributions are significant enough to qualify as a co-author.

Please consider contributing any useful changes back to the WebFetch project at maint@webfetch.org.

ACKNOWLEDGEMENTS

WebFetch was written by Ian Kluft Send patches, bug reports, suggestions and questions to maint@webfetch.org.

Some changes in versions 0.12-0.13 (Aug-Sep 2009) were made for and sponsored by Twiki Inc (formerly TWiki.Net).

LICENSE

WebFetch is Open Source software licensed under the GNU General Public License Version 3. See https://www.gnu.org/licenses/gpl-3.0-standalone.html.

SEE ALSO

Included in WebFetch module: WebFetch::Input::PerlStruct, WebFetch::Input::SiteNews, WebFetch::Output::Dump, WebFetch::Data::Config, WebFetch::Data::Record, WebFetch::Data::Store

Modules separated to contain external module dependencies: WebFetch::Input::Atom, WebFetch::RSS, WebFetch::Output::TT, WebFetch::Output::TWiki,

Source code repository: https://github.com/ikluft/WebFetch

BUGS AND LIMITATIONS

Please report bugs via GitHub at https://github.com/ikluft/WebFetch/issues

Patches and enhancements may be submitted via a pull request at https://github.com/ikluft/WebFetch/pulls

AUTHOR

Ian Kluft <https://github.com/ikluft>

COPYRIGHT AND LICENSE

This software is Copyright (c) 1998-2022 by Ian Kluft.

This is free software, licensed under:

The GNU General Public License, Version 3, June 2007