HTML::TableExtract Examples

Each table is labeled in the first row with coordinates in terms of depth and count, which both start at 0. Some of the tables have headers in the second row; although in this example these header cells are in fact <th> tags, header cells can be either <th> or <td>. The remaining cells in the table indicate row and column information from that cell, along with the table coordinates: depth,count:row,column. Rows and columns begin at 0 as well, so the table label and headers, if present, will affect these cell coordinates.

In the illustrations of what is extracted from these tables, content in italics is notational in nature; it was not actually extracted from the tables. In particular, whenever headers are used for extraction, the order in which the headers were provided is noted by listing the headers, but the header row is not actually extracted from the target table.

It might be helpful to open a new browser window with this table visible so that the table can be easily examined when scrolling through the examples.

Table (0,0)
0,0:1,0
Table (1,0)
EastCentralWest
1,0:2,01,0:2,11,0:2,2
1,0:3,01,0:3,2
1,0:4,01,0:4,2
1,0:5,01,0:5,11,0:5,2
0,0:1,1
Table (1,1)
LeftMiddleRight
1,1:2,01,1:2,11,1:2,2
1,1:3,01,1:3,11,1:3,2
1,1:4,01,1:4,11,1:4,2
1,1:5,01,1:5,11,1:5,2
0,0:2,0
Table (1,2)
LeftRight
1,2:2,0
Table (2,0)
PacificAtlantic
2,0:2,02,0:2,1
2,0:3,02,0:3,1
1,2:2,1
Table (2,1)
LeftyRighty
2,1:2,02,1:2,1
2,1:3,02,1:3,1
1,2:3,01,2:3,1
1,2:4,01,2:4,1
1,2:5,01,2:5,1
0,0:2,1
Table (1,3)
PacificPlainsAtlantic
1,3:2,01,3:2,11,3:2,2
1,3:3,11,3:3,2
1,3:4,01,3:4,2
1,3:5,01,3:5,2

Example 1
$te = new HTML::TableExtract( headers => [qw(Right Left)] );
$te->parse($html_string);

Result:
Extracted from table (1,1)
Order: Right, Left
1,1:2,21,1:2,0
1,1:3,21,1:3,0
1,1:4,21,1:4,0
1,1:5,21,1:5,0
Extracted from table (2,1)
Order: Right, Left
2,1:2,12,1:2,0
2,1:3,12,1:3,0
Extracted from table (1,2)
Order: Right, Left
1,2:2,11,2:2,0
1,2:3,11,2:3,0
1,2:4,11,2:4,0
1,2:5,11,2:5,0


With headers, depth and count are irrelevant; all tables with columns matching those headers are extracted. Matches are accomplished as case-insensitive, non-anchored regular expressions. Columns are automatically rearranged in the same order as the headers were provided, so in this case we have reversed left and right. Rows above and including the rows where the headers were found are ignored; only the rows beneath the headers are extracted. Only the columns that line up with specific headers are retained.
Example 2
$te = new HTML::TableExtract( headers => [qw(Lefty Righty)] );
$te->parse($html_string);

Result:
Extracted from table (2,1)
Order: Lefty, Righty
2,1:2,02,1:2,1
2,1:3,02,1:3,1


Using basic header extraction, tables can be reliably extracted from a document no matter how the HTML changes around them or deeply nested they are.
Example 3
@tes = (
	new HTML::TableExtract( headers => [qw(Pacific Plains Atlantic)] ),
	new HTML::TableExtract( headers => [qw(Atlantic Pacific Plains)] ),
	new HTML::TableExtract( headers => [qw(Atlantic Plains)] ),
	new HTML::TableExtract( headers => [qw(Plains Pacific)] )
       );
grep($_->parse($html_string), @tes);

Result:
Extracted from table (1,3)
Order: Pacific, Plains, Atlantic
1,3:2,01,3:2,11,3:2,2
1,3:3,11,3:3,2
1,3:4,01,3:4,2
1,3:5,01,3:5,2
Extracted from table (1,3)
Order: Atlantic, Pacific, Plains
1,3:2,21,3:2,01,3:2,1
1,3:3,21,3:3,1
1,3:4,21,3:4,0
1,3:5,21,3:5,0
Extracted from table (1,3)
Order: Atlantic, Plains
1,3:2,21,3:2,1
1,3:3,21,3:3,1
1,3:4,2
1,3:5,2
Extracted from table (1,3)
Order: Plains, Pacific
1,3:2,11,3:2,0
1,3:3,1
1,3:4,0
1,3:5,0


The tables above represent different ways of extracting information from the same table using headers; notice how the column order is automatically adjusted to reflect the order in which the headers were provided. Gridmapping preserves the columns that you see in a browser. Tables are actually HTML tree structures, so when cell spans are involved, the "grid" is an illusion. Gridmapping superimposes a grid structure of 1x1 cells over the table, and reports columns intuitively. (note that the cell coordinates in this case represent these grid coordinates, rather than tree coordinates).
Example 4
@tes = (
	new HTML::TableExtract( depth => 1, count => 3 ),
	new HTML::TableExtract( depth => 1, count => 3, gridmap => 0 )
       );
grep($_->parse($html_string), @tes);

Result:
Extracted from table (1,3)
Table (1,3)
PacificPlainsAtlantic
1,3:2,01,3:2,11,3:2,2
1,3:3,11,3:3,2
1,3:4,01,3:4,2
1,3:5,01,3:5,2
Extracted from table (1,3)
Table (1,3)
PacificPlainsAtlantic
1,3:2,01,3:2,11,3:2,2
1,3:3,11,3:3,2
1,3:4,01,3:4,2
1,3:5,01,3:5,2


Here we target the same table using depth and count. Taken together, depth and count uniquely specify at table in an HTML document, though it does introduce more context than using headers. Notice also that the entire table is retrieved, not just the columns beneath the headers. In the first example, gridmapping is enabled by default. In the second, it is explicity disabled in order to illustrate the tree ordering of cells.
Example 5
$te = new HTML::TableExtract( depth => 2 );
$te->parse($html_string);

Result:
Extracted from table (2,0)
Table (2,0)
PacificAtlantic
2,0:2,02,0:2,1
2,0:3,02,0:3,1
Extracted from table (2,1)
Table (2,1)
LeftyRighty
2,1:2,02,1:2,1
2,1:3,02,1:3,1


When only a depth is specified, all tables at that depth are returned.
Example 6
$te = new HTML::TableExtract( count => 1 );
$te->parse($html_string);

Result:
Extracted from table (1,1)
Table (1,1)
LeftMiddleRight
1,1:2,01,1:2,11,1:2,2
1,1:3,01,1:3,11,1:3,2
1,1:4,01,1:4,11,1:4,2
1,1:5,01,1:5,11,1:5,2
Extracted from table (2,1)
Table (2,1)
LeftyRighty
2,1:2,02,1:2,1
2,1:3,02,1:3,1


When only a count is specified, all tables at that count from each depth are returned. In this example, the second table within each depth is extracted (both depth and count begin with 0).
Example 7
$te = new HTML::TableExtract( count => 1, headers => [qw(Left Middle Right)] );
$te->parse($html_string);

Result:
Extracted from table (1,1)
Order: Left, Middle, Right
1,1:2,01,1:2,11,1:2,2
1,1:3,01,1:3,11,1:3,2
1,1:4,01,1:4,11,1:4,2
1,1:5,01,1:5,11,1:5,2


When constraints are specified together, they each have a veto power on whether to extract the table. In this case, the same two tables in the prior example matched on this count, but the header constraint discarded the one without the proper headers.