NAME
HTML::Untemplate - undo what the template engine does
VERSION
version 0.002
DESCRIPTION
Despite being named similarly to HTML::Template, this distribution is not directly related to it. Instead, it attempts to reverse the templating action, whatever the template agent used.
Why?
Suppose you have a CMS. Typical CMS works roughly as this (data flows bottom-down):
RDBMS
scripting language
HTML
HTTP server
(...)
HTTP agent
layout engine
screen
user
Consider the first 3 steps: RDBMS => scripting language => HTML
This is "applying template".
Now, consider this: HTML => scripting language => RDBMS
I would call that "un-applying template", or "untemplate" :)
The practical application of this set of tools to assist in creation of web scrappers.
CLI tools
xpathify
The xpathify tool flatterns the HTML tree into key/value list:
<!DOCTYPE html>
<html>
<head>
<title>Hello HTML</title>
</head>
<body>
<h1>Hello World!</h1>
<p>This is a sample HTML</p>
Beware!
<p>HTML is <b>not</b> XML!</p>
Have a nice day.
</body>
</html>
Becomes:
/html/head/title/text() Hello HTML/html/body/h1/text() Hello World!
/html/body/p[1]/text() This is a sample HTML
/html/body/text() Beware!
/html/body/p[2]/text() HTML is
/html/body/p[2]/b/text() not
/html/body/p[2]/text() XML!
/html/body/text() Have a nice day.
The keys are in XPath format, while the values are respective content from the HTML tree. Theoretically, it could be possible to reassemble the HTML tree from the flat key/value list this tool generates.
untemplate
The untemplate tool flatterns a set of HTML documents using the algorithm from xpathify. Then, it strips the shared key/value pairs. The "rest" is composed of original values fed into the template engine.
And this is how the result actually looks like with some simple real-world examples (quotes 1839 and 2486 from bash.org):
/html/head/title/text()bash_org_1839 QDB: Quote #1839
bash_org_2486 QDB: Quote #2486
/html/body/form[@name='tsearch']/center/table[1]/tr/td[2]/font/b/text()
bash_org_1839 Quote #1839
bash_org_2486 Quote #2486
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a/@href
bash_org_1839 ?1839
bash_org_2486 ?2486
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a/b/text()
bash_org_1839 #1839
bash_org_2486 #2486
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a[@class='qa'][1]/@href
bash_org_1839 ./?le=cc8456a913b26eb7364e4e9a94348d04&rox=1839
bash_org_2486 ./?le=cc8456a913b26eb7364e4e9a94348d04&rox=2486
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/text()
bash_org_1839 (245)
bash_org_2486 (230)
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a[@class='qa'][2]/@href
bash_org_1839 ./?le=cc8456a913b26eb7364e4e9a94348d04&sox=1839
bash_org_2486 ./?le=cc8456a913b26eb7364e4e9a94348d04&sox=2486
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a[@class='qa'][3]/@href
bash_org_1839 ./?le=cc8456a913b26eb7364e4e9a94348d04&sux=1839
bash_org_2486 ./?le=cc8456a913b26eb7364e4e9a94348d04&sux=2486
/html/body/p/center[1]/table/tr/td[1]/p[@class='qt']/text()
bash_org_1839 <maff> who needs showers when you've got an assortment of feminine products
bash_org_2486 <R`:#heroin> Is this for recovery or indulgence?
/html/body/p/center[2]/table/tr[2]/td[@class='footertext'][1]/text()
bash_org_1839 8.3642
bash_org_2486 0.0016
Modules
May be used to serialize/flattern HTML documents:
HTML::Linear - represent HTML::Tree as a flat list
HTML::Linear::Element - represent elements to populate HTML::Linear
HTML::Linear::Path - represent paths inside HTML::Tree
SEE ALSO
AUTHOR
Stanislaw Pusep <stas@sysd.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.