NAME
HTML::WikiConverter - An HTML to wiki markup converter
SYNOPSIS
use HTML::WikiConverter;
my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' );
print $wc->html2wiki( $html );
DESCRIPTION
HTML::WikiConverter is an HTML to wiki converter. It can convert HTML source into a variety of wiki markups, called wiki "dialects".
METHODS
- $wc = new HTML::WikiConverter( dialect => '...', [ %attrs ] )
-
Returns a converter for the specified dialect. Dies if 'dialect' is not provided or is not installed on your system. Additional parameters are optional and can be specified in %attrs:
base_uri URI to use for converting relative URIs to absolute ones wiki_uri URI used in determining which links are wiki links. For example, the English Wikipedia would use 'http://en.wikipedia.org/wiki/'
- $wiki = $wc->html2wiki( $html )
-
Converts the HTML source into wiki markup for the current dialect.
- $html = $wc->parsed_html
-
Returns the HTML representative of the last-parsed syntax tree. Use this to see how your input HTML was parsed internally, often useful for debugging.
- $base_uri = $wc->base_uri( [ $new_base_uri ] )
-
Gets or sets the 'base_uri' option used for converting relative to absolute URIs.
- $wiki_uri = $wc->wiki_uri( [ $new_wiki_uri ] )
-
Gets or sets the 'wiki_uri' option used for determining which links are links to wiki pages.
UTILITY METHODS
These methods are for use only by dialect modules.
- $wiki = $wc->get_elem_contents( $node )
-
Converts the contents of $node (i.e. its children) into wiki markup.
- $title = $wc->get_wiki_page( $url )
-
Attempts to extract the title of a wiki page from the given URL, returning the title on success, undef on failure. If 'wiki_uri' is empty, this method always return undef.
- $bool = $wc->is_camel_case( $str )
-
Returns true if the string in $str is in CamelCase, false otherwise. Code is taken from CGI::Kwiki's formatting module.
- $attr_str = $wc->get_attr_str( $node, @attrs )
-
Returns a string containing the specified attributes in the given node. The returned string is suitable for insertion into an HTML tag.
DIALECTS
HTML::WikiConverter can convert HTML into markup for a variety of wiki engines. The markup used by a particular engine is called a wiki markup dialect. Support is added for dialects by installing dialect modules which provide the rules for how HTML is converted into that dialect's wiki markup.
Dialect modules are registered in the HTML::WikiConverter::
namespace an are usually given names in CamelCase. For example, the rules for the MediaWiki dialect are provided in HTML::WikiConverter::MediaWiki
. And PhpWiki is specified in HTML::WikiConverter::PhpWiki
.
Supported dialects
HTML::WikiConverter supports conversions for the following dialects:
Kwiki
MediaWiki
MoinMoin
PhpWiki
PmWiki
UseMod
While under most conditions the each will produce satisfactory wiki markup, the complete syntactic sugar of each dialect has not yet been implemented. Suggestions, especially in the form of patches, are very welcome.
Of these, the MediaWiki dialect is probably the most complete. I am a Wikipediholic, after all. :-)
Conversion rules
To interface with HTML::WikiConverter, dialect modules must define a single rules()
class method. It returns a reference to a hash of rules that specify how individual HTML elements are converted to wiki markup. The following rules are recognized:
start
end
preserve
attributes
replace
alias
block
line_format
line_prefix
trim
trim_leading
trim_trailing
For example, the following rules()
method could be used for a wiki dialect that uses *asterisks* for bold and _underscores_ for italic text:
sub rules {
return {
b => { start => '*', end => '*' },
i => { start => '_', end => '_' }
};
}
To add <strong> and <em> as aliases of <b> and <i>, use the 'alias' rule:
sub rules {
return {
b => { start => '*', end => '*' },
strong => { alias => 'b' },
i => { start => '_', end => '_' },
em => { alias => 'i' }
};
}
(If you specify the 'alias' rule, no other rules are allowed.)
Many wiki dialects separate paragraphs and other block-level elements with a blank line. To indicate this, use the 'block' keyword:
p => { block => 1 }
(Note that if a block-level element is nested inside another block-level element, blank lines are only added to the outermost block-level element.)
However, many such wiki engines require that the text of a paragraph be contained on a single line of text. Or that a paragraph cannot contain any blank lines. These formatting options can be specified using the 'line_format' keyword, which can be assigned the value 'single', 'multi', or 'blocks'.
If the element must be contained on a single line, then the 'line_format' option should be 'single'. If the element can span multiple lines, but there can be no blank lines contained within, then it should be 'multi'. If blank lines (which delimit blocks) are allowed, then it should be 'blocks'. For example, paragraphs are specified like so in the MediaWiki dialect:
p => { block => 1, line_format => 'multi', trim => 1 }
The 'trim' option indicates that leading and trailing whitespace should be stripped from the paragraph before other rules are processed. You can use 'trim_leading' and 'trim_trailing' if you only want whitespace trimmed from one end of the content.
Some multi-line elements require that each line of output be prefixed with a particular string. For example, preformatted text in the MediaWiki dialect is prefixed with one or more spaces. This is specified using the 'line_prefix' option:
pre => { block => 1, line_prefix => ' ' }
In some cases, conversion from HTML to wiki markup is as simple as string replacement. When you want to replace a tag and its contents with a particular string, use the 'replace' option. For example, in the PhpWiki dialect, three percent signs '%%%' represents a linebreak <br>, hence the rule:
br => { replace => '%%%' }
(If you specify the 'replace' option, no other options are allowed.)
Finally, many wiki dialects allow a subset of HTML in their markup, such as for superscripts, subscripts, and text centering. HTML tags may be preserved using the 'preserve' option. For example, to allow the <font> tag in wiki markup, one might say:
font => { preserve => 1 }
(The 'preserve' rule cannot be combined with the 'start' or 'end' rules.)
Preserved tags may also specify a whitelist of attributes that may also passthrough from HTML to wiki markup. This is done with the 'attributes' option:
font => { preserve => 1, attributes => [ qw/ font size / ] }
(The 'attributes' rule must be used in conjunction with the 'preserve' rule.)
Dynamic rules
Instead of simple strings, you may use coderefs as option values for the 'start', 'end', 'replace', and 'line_prefix' rules. If you do, the code will be called with three arguments: 1) the current HTML::WikiConverter instance, 2) the current HTML::Element node, and 3) the rules for that node (as a hashref).
Specifying rules dynamically is often useful for handling nested elements. For example, the MoinMoin dialect uses the following rules for lists:
ul => { line_format => 'multi', block => 1, line_prefix => ' ' }
li => { start => \&_li_start, trim_leading => 1 }
ol => { alias => 'ul' }
It then defines _li_start() like so:
sub _li_start {
my( $wc, $node, $rules ) = @_;
my $bullet = '';
$bullet = '*' if $node->parent->tag eq 'ul';
$bullet = '1.' if $node->parent->tag eq 'ol';
return "\n$bullet ";
}
This ensures that every unordered list item is prefixed with '*' and every ordered list item is prefixed with '1.', per the MoinMoin markup. It also ensures that each list item is on a separate line and that there is a space between the prefix and the content of the list item.
Rule validation
Certain rule combinations are not allowed. For example, the 'replace' and 'alias' rules cannot be combined with any other rules, and 'attributes' can only be specified alongside 'preserve'. Invalid rule combinations will trigger an error when the dialect module is loaded.
Preprocessing
The first step in converting HTML source to wiki markup is to parse the HTML into a syntax tree using HTML::TreeBuilder
. It is often useful for dialects to preprocess the tree prior to converting it into wiki markup. Dialects that elect to preprocess the tree do so by defining a preprocess_node()
class method, which will be called on each node of the tree (traversal is done in pre-order). The method receives three arguments: 1) the dialect's package name, 2) the current HTML::WikiConverter instance, and 3) the current HTML::Element node being traversed. It may modify the node or decide to ignore it. The return value of the preprocess_node()
method is not used.
Because they are so commonly needed, two preprocessing steps are automatically carried out by HTML::WikiConverter, regardless of the dialect: 1) relative URIs in images and links are converted to absolute URIs (based upon the 'base_uri' parameter), and 2) ignorable text (e.g. between </td> and <td>) is discarded.
SEE ALSO
HTML::TreeBuilder
HTML::Element
AUTHOR
David J. Iberri <diberri@yahoo.com>
COPYRIGHT
Copyright (c) 2004-2005 David J. Iberri
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html