NAME
urhtml_stats
- Show complexity metric and other stats for web page
SYNOPSIS
ur_html_stats [uri]
EXAMPLE
urhtml_stats http://perl.org
DESCRIPTION
Given a URI, parses it as HTML and prints a complexity metric and other statistics. The complexity metric is the average depth (or nesting level), in elements, of a character, divided by the logarithm of the length of the HTML.
Other statistics follow, formatted as an HTML table. There is a row for each element type, with
The maximum nesting depth of that element (this time only taking into account nesting within that particular element).
A count of the elements of that kind in the document
The total number of character in elements of that type. This counts characters in nested elements multiple times. For example, if a page contains a table within a table. Characters in the inner table will be counted twice, once as characters in the outer table and again as characters in the inner table.
The average size of elements of this type, in characters.
Here is the first part of the output for the http://perl.org
.
http://perl.org
Complexity = 0.746
Element Maximum Number of Size in Average
Nesting Elements Characters Size
a 1 56 3634 64
body 1 1 12171 12171
div 5 30 33605 1120
em 1 1 13 13
h1 1 1 60 60
h4 1 11 932 84
THE COMPLEXITY METRIC
I originally was tempted to call the complexity metric a "quality metric", but decided that was going too far. Well designed websites often have low numbers, but high numbers don't mean low quality -- it depends on what the mission is, and how well complexity is being leveraged to serve that mission.
To obtain the complexity metric, the nesting depth of the average character is divided by the logarithm of the length of the HTML. This the idea is that as a web page grows, all else being equal, it is reasonable for the nesting depth to grow logarithmically, but no faster.
How seriously should you take any of this? I am frankly not sure. The main purpose of this program was not to analyze web pages, but to draw attention to the underlying technology. Speaking of which ...
PURPOSE
This program is a demo of a demo. It purpose is to show how easy it is to write applications which look at the structure of web pages using Marpa::UrHTML.
Determining the structure of an HTML document has in the past been considered a very difficult programming task, requiring lots of special case coding. Marpa::UrHTML was written in a few days, and the resulting grammar and code is very natural and straight-forward.
Other parsers may be preferable to Marpa::UrHTML as parsers. They had better be perfect, because the code in them is excruciatingly difficult. The logic in Marpa::UrHTML, as the documentation will show, is very straightforward. It is much easier to understand, and therefore would be much easier to change, than previous approaches.
The transparency of Marpa::UrHTML, in turn, comes from the flexibility and power of Marpa, the underlying parser. Marpa is a general BNF parser based on a new algorithm derived from Jay Earley's.
AUTHOR
Jeffrey Kegler
BUGS
Please report any bugs or feature requests to bug-parse-marpa at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Marpa. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Marpa
You can also look for information at:
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
RT: CPAN's request tracker
Search CPAN
ACKNOWLEDGMENTS
The starting template for this code was HTML::TokeParser, by Gisle Aas.
LICENSE AND COPYRIGHT
Copyright 2007-2009 Jeffrey Kegler, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.