NAME
HTML::TableParser - Extract data from an HTML table
SYNOPSIS
use HTML::TableParser;
$p = HTML::TableParser->new( \@reqs, \%attr );
$p->parse_file( 'foo.html' );
@reqs = (
{
id => 1, # table id
cols => [ 'Object Type' ], # column name exact match
colre => [ qr/object/ ], # column name RE match
obj => $obj, # method callbacks
},
{
id => 1.1, # id for embedded table
hdr => \&header, # function callback
row => \&row, # function callback
start => \&start, # function callback
end => \&end, # function callback
udata => { Snack => 'Food' }, # arbitrary user data
}
);
# create parser object
$p = HTML::TableParser->new( \@reqs,
{ Decode => 1, Trim => 1, Chomp => 1 } );
$p->parse_file( 'foo.html' );
# function callbacks
sub start {
my ( $id, $line, $udata ) = @_;
#...
}
sub end {
my ( $id, $line, $udata ) = @_;
#...
}
sub header {
my ( $id, $line, $cols, $udata ) = @_;
#...
}
sub row {
my ( $id, $line, $cols, $udata ) = @_;
#...
}
DESCRIPTION
HTML::TableParser uses HTML::Parser to extract data from an HTML table. The data is returned via a series of user defined callback functions or methods. Specific tables may be selected either by a unique table id or by matching against the column names. Multiple tables may be parsed simultaneously in the document.
Table Selection
There are several ways to indicate which tables in the HTML document you want to extract data from:
- id
-
Each table is given a unique id relative to its parent based upon its order and nesting. The first top level table has id
1
, the second2
, etc. The first table nested in table1
has id1.1
, the second1.2
, etc. The first table nested in table1.1
has id1.1.1
, etc. - column name exact match
-
exact matches against one or more column names
- column name RE match
-
matches column names against one or more regular expressions.
Data Extraction
As the parser traverses the table, it will pass data to user provided callback functions or methods after it has digested particular structures in the table. All functions are passed the table id (as described above), the line number in the HTML source where the table was found, and a reference to any table specific user provided data.
- Table Start
-
The start callback is invoked when a matched table has been found.
- Table End
-
The end callback is invoked after a matched table has been parsed.
- Header
-
The hdr callback is invoked after the table header has been read in. Some tables do not use the <th> tag to indicate a header, so this function may not be called. It is passed the column names.
- Row
-
The row callback is invoked after a row in the table has been read. It is passed the column data.
- Warn
-
The warn callback is invoked when a non-fatal error occurs during parsing. Fatal errors croak.
- New
-
This is the class method to call to create a new object when HTML::TableParser is supposed to create new objects upon table start.
Callbacks may be functions or methods or a mixture of both. In the latter case, an object must be passed to the constructor.
Callback API
The callbacks are invoked as follows:
start( $tbl_id, $line_no, $udata );
end( $tbl_id, $line_no, $udata );
hdr( $tbl_id, $line_no, \@col_names, $udata );
row( $tbl_id, $line_no, \@data, $udata );
warn( $tbl_id, $message, $udata );
new( $tbl_id, $udata );
Data Cleanup
There are several cleanup operations that may be performed:
- Chomp
-
chomp() the data
- Decode
-
Run the data through HTML::Entities::decode.
- Trim
-
remove leading and trailing white space.
Data Organization
Column names are derived from cells delimited by the <th> and </th> tags. Some tables have header cells which span one or more columns or rows to make things look nice. HTML::TableParser determines the actual number of columns used and provides column names for each column, repeating names for spanned columns and concatenating spanned rows and columns. For example, if the table header looks like this:
+----+--------+----------+-------------+-------------------+
| | | Eq J2000 | | Velocity/Redshift |
| No | Object |----------| Object Type |-------------------|
| | | RA | Dec | | km/s | z | Qual |
+----+--------+----------+-------------+-------------------+
The columns will be:
No
Object
Eq J2000 RA
Eq J2000 Dec
Object Type
Velocity/Redshift km/s
Velocity/Redshift z
Velocity/Redshift Qual
Row data are derived from cells delimited by the <td> and </td> tags. Cells which span more than one column or row are handled correctly, i.e. the values are duplicated in the appropriate places.
METHODS
- new
-
$p = HTML::TableParser->new( \@reqs, \%attr );
This is the class constructor. It is passed a list of table requests as well as attributes which specify defaults for common operations.
Table Requests
A table request is a hash whose elements select the identification method, the callbacks, and any table-specific data cleanup.
Elements used to identify the table are
- id
-
a scalar containing the table id to match. If it is the string 'DEFAULT' it will match for every table.
- cols
-
an arrayref containing the column names to match, or a scalar containing a single column name
- colre
-
an arrayref containing the regular expressions to match, or a scalar containing a single reqular expression
More than one of these may be used for a single table request. A a request may match more than one table. By default a request is used only once (even the
DEFAULT
id match!). Set theMultiMatch
attribute to enable multiple matches per request.When attempting to match a table, the following steps are taken:
The table id is compared to the requests which contain an explicit id match. The first such match is used (in the order given in the passed array).
If no explicit id match is found, column name matches are attempted. The first such match is used (in the order given in the passed array)
If no column name match is found (or there were none requested), the request with an id match of
DEFAULT
is used.
Callbacks
Callback functions are specified with the callback attributes
start
,end
,hdr
,row
, andwarn
. They should be set to code references, i.e.%table_req = ( ..., start => \&start_func, end => \&end_func )
To use methods, specify the object with the
obj
key, and the method names via the callback attributes, which should be set to strings. If you don't specify method names they will default to (you guessed it)start
,end
,hdr
,row
, andwarn
.$obj = SomeClass->new(); # ... %table_req_1 = ( ..., obj => $obj ); %table_req_2 = ( ..., obj => $obj, start => 'start', end => 'end' );
You can also have HTML::TableParser create a new object for you for each table by specifying the
class
attribute. By default the constructor is assumed to be the class new() method; if not, specify it using thenew
attribute:use MyClass; %table_req = ( ..., class => 'MyClass', new => 'mynew' );
To use a function instead of a method for a particular callback, set the callback attribute to a code reference:
%table_req = ( ..., obj => $obj, end => \&end_func );
You don't have to provide all the callbacks. You should not use both
obj
andclass
in the same table request.You can specify arbitrary data to be passed to the callback functions via the
udata
attribute:%table_req = ( ..., udata => \%hash_of_my_special_stuff )
Data cleanup operations may be specified uniquely for each table. The available keys are
Chomp
,Decode
,Trim
.Attributes
The
%attr
hash provides default values for some of the table request attributes, namely the data cleanup operations (Chomp
,Decode
,Trim
), and the multi match attributeMultiMatch
, i.e.,$p = HTML::TableParser->new( \@reqs, { Chomp => 1 } );
will set Chomp on for all of the table requests, unless overriden by them.
Decode defaults to on; all of the others default to off.
- parse_file
-
This is the same function as in HTML::Parser.
- parse
-
This is the same function as in HTML::Parser.
AUTHOR
Diab Jerius (djerius@cfa.harvard.edu)