NAME

HTML::GenToc - Generate/insert anchors and a Table of Contents (ToC) for HTML documents.

SYNOPSIS

  use HTML::GenToc;

  # create a new object
  my $toc = new HTML::GenToc();

  my $toc = new HTML::GenToc(title=>"Table of Contents",
			  toc=>$my_toc_file,
			  toc_entry=>{
			    H1=>1,
			    H2=>2
			  },
			  toc_end=>{
			    H1=>'/H1',
			    H2=>'/H2'
			  }
    );

  # add further arguments
  $toc->args(toc_tag=>"BODY",
	     toc_tag_replace=>0,
    );

  # generate anchors for a file
  $toc->generate_anchors(infile=>$html_file,
			 overwrite=>0,
    );

  # generate a ToC from a file
  $toc->generate_toc(infile=>$html_file,
		     footer=>$footer_file,
		     header=>$header_file
    );

DESCRIPTION

HTML::GenToc allows you to specify "significant elements" that will be hyperlinked to in a "Table of Contents" (ToC) for a given set of HTML documents. Also, it does not require said documents to be strict HTML; this makes it suitable for using with templates and meta-languages such as WML.

Basically, the ToC generated is a multi-level level list containing links to the significant elements. HTML::GenToc inserts the links into the ToC to significant elements at a level specified by the user.

Example:

If H1s are specified as level 1, than they appear in the first level list of the ToC. If H2s are specified as a level 2, than they appear in a second level list in the ToC.

Information on the significant elements and what level they should occur are passed in to the methods used by this object, or one can use the defaults.

There are two phases to the ToC generation. The first phase is to put suitable anchors into the HTML documents, and the second phase is to generate the ToC from HTML documents which have anchors in them for the ToC to link to.

For more information on controlling the contents of the created ToC, see "Formatting the ToC".

HTML::GenToc also supports the ability to incorporate the ToC into the HTML document itself via the -inline option. See "Inlining the ToC" for more information.

In order for HTML::GenToc to support linking to significant elements, HTML::GenToc inserts anchors into the significant elements. One can use HTML::GenToc as a filter, outputing the result to another file, or one can overwrite the original file, with the original backed up with a suffix (default: "org") appended to the filename.

METHODS

All arguments can be set when the object is created, and further options can be set on any method (though some may not make sense). Arguments to methods can take either a hash of arguments, or a reference to an array (though the array usage is deprecated and is here for backwards compatibility only).

The arguments get treated differently depending on whether they are given in a hash or a reference to an array. When the arguments are in a hash, the argument-keys are expected to have values matching those required for that argument -- whether that be a boolean, a string, a reference to an array or a reference to a hash. These will replace any value for that argument that might have been there before.

When the arguments are in a reference to an array, it is treated as if it were a command-line: option names are expected to start with '--' or '-', boolean options are set to true as soon as the option is given (no value is expected to follow), boolean options with the word "no" prepended set the option to false, string options are expected to have a string value following, and those options which are internally arrays or hashes are treated as cumulative; that is, the value following the --option is added to the current set for that option, to add more, one just repeats the --option with the next value, and in order to reset that option to empty, the special value of "CLEAR" must be added to the list.

(You can see why I want to phase this out -- it makes the code more complicated and ambiguous and prone to error)

Method -- new

    $toc = new HTML::GenToc();

    $toc = new HTML::GenToc(\@args); # deprecated!

    $toc = new HTML::GenToc(toc_entry=>\%my_toc_entry,
	toc_end=>\%my_toc_end,
	bak=>'bak',
    	...
        );

Creates a new HTML::GenToc object.

The arguments can either be a hash, or (deprecated) a reference to an array of arguments. These arguments will be used in invocations of other methods.

See the other methods for possible arguments.

Method -- args

$toc->args(\@args); # deprecated!

$toc->args(infile=>['myfile.html', 'thatfile.html']);

Updates the current arguments/options of the HTML::GenToc object. Takes either a hash, or (deprecated) a reference to an array of arguments, which will be used in invocations of other methods.

The hash arguments must be just the name of the argument. The array arguments must have a '-' or '--' in front of them.

Common Options

The following arguments apply to both generating anchors and generating table-of-contents phases, so they are shown here, rather than repeating them for each method.

bak

bak => string

If the input file/files is/are being overwritten (overwrite is on), copy the original file to "filename.string". If the value is empty, there is no backup file written. (default:org)

debug

debug => 1

Enable verbose debugging output. Used for debugging this module; in other words, don't bother. (default:off)

file_strings

file_strings => \@content_of_files

If this is set, then the files in the infile array are not read, but the content of this array is used instead, with the names in the infile array just used as names. The infile array and the file_strings array must be the same size. (see also set_file_strings)

(this option is not supported in the old array format)

infile

infile => \@files

infile => ['index.html']

Input file(s). This expects a reference to an array of files.

'--infile', $file (array version, deprecated)

If the arguments are a reference to an array (the old way) then a single filename is expected; if you want to process more than one file in this form, just add another --infile, $filename to the array of arguments. In the arrayref form, use the special name "CLEAR" to clear the current array of input files, if you want to process a different file.

(see in_string and file_strings also)

(default:undefined)

in_string

in_string => $string

Use the given string instead of reading in the content of the first infile. Instead, the first infile is taken just to be the filename to use in referring to this content. This is useful if one is using this module as part of other processing; the filename given could be the name of the final output file which hasn't been written yet. Note that it is still necessary to give at least one infile name when using this option.

notoc_match

notoc_match => string

If there are certain individual tags you don't wish to include in the table of contents, even though they match the "significant elements", then if this pattern matches contents inside the tag (not the body), then that tag will not be included, either in generating anchors nor in generating the ToC. (default: class="notoc")

overwrite

overwrite => 1

Overwrite the input file with the output. If this is in effect, outfile and toc_file are ignored. Used in generate_anchors for creating the anchors "in place" and in generate_toc if the inline option is in effect. (default:off)

quiet

quiet => 1

Suppress informative messages. (default: off)

toc_end

toc_end => \%toc_end_data

%toc_end_data = { tag1 => endtag1, tag2 => endtag2 };

toc_end => { H1 => '/H1', H2 => '/H2' }

For defining significant elements. The tag is the HTML tag which marks the start of the element. The endtag the HTML tag which marks the end of the element. When matching in the input file, case is ignored (but make sure that all your tag options referring to the same tag are exactly the same!).

'--toc_end', 'tag=endtag'

For the (deprecated) array-ref form of arguments, this is a cumulative hash argument; if you wish to clear the default, give '--toc_end', 'CLEAR' to do so.

(default: H1=/H1 H2=/H2)

toc_entry

toc_entry => \%toc_entry_data

%toc_entry_data = { tag1 => level1, tag2 => level2 };

toc_entry => { H1 => 1, H2 => 2 }

For defining significant elements. The tag is the HTML tag which marks the start of the element. The level is what level the tag is considered to be. The value of level must be numeric, and non-zero. If the value is negative, consective entries represented by the significant_element will be separated by the value set by entrysep option.

'--toc_entry', 'tag=level'

For the (deprecated) array-ref form of arguments, this is a cumulative hash argument; if you wish to clear the default, give '--toc_entry', 'CLEAR' to do so.

(default: H1=1 H2=2)

tocmap

tocmap => file

ToC map file defining significant elements. (This is deprecated, and is only here for backwards compatibility with htmltoc.)

This overrides any previously set toc_entry, toc_end, toc_before and toc_after options. It is inadvisable to use both tocmap and toc_entry/toc_end/toc_before/toc_after options in the same call, as it is undefined as to which one will override the other.

See "ToC Map File" for further information.

to_string

to_string => 1

Return the modified HTML output as a string. This does not override other methods of output, except in the case where the user hasn't specified any other method of output; then the default method (STDOUT) is overridden to just output to the string.

Method -- setting

my $of = $toc->setting('outfile');

Get the value of a given option.

Method -- generate_anchors

$toc->generate_anchors(outfile=>"index2.html");

my $result_str = $toc->generate_anchors(to_string=>1);

Generates anchors for the significant elements in the HTML documents. Takes either a hash, or (deprecated) a reference to an array of arguments. These arguments will be used to influence this method's behavour (and if arguments have already been set earlier, they also will be taken into account).

See "Method -- args" for the common options which can be passed into this method.

The following arguments apply only to generating anchors.

outfile

outfile => file

File to write the output to. This is where the modified be-anchored HTML output goes to. Note that it doesn't make sense to use this option if you are processing more than one file. If you give '-' as the filename, then output will go to STDOUT. (default: STDOUT)

set_file_strings

set_file_strings => 1

If set_file_strings is set (even if file_strings is not set) then the transformed output of each file is placed in $toc->setting('file_strings'), replacing any previous content there. Note that if this is set, it has the effect of quietly setting file_strings for a subsequent call to generate_anchors or generate_toc.

(this option is not supported in the old array format)

useorg

useorg => 1

Use pre-existing backup files as the input source; that is, files of the form infile.bak (see infile and bak).

Method -- generate_toc

    $toc->generate_toc(title=>"Contents",
	toc_file=>'toc.html');

    my $result_str = $toc->generate_toc(title=>"The Contents",
	to_string=>1);

Generates a Table of Contents (ToC) for the significant elements in the HTML documents. Takes either a hash, or (deprecated) a reference to an array of arguments, which will be used in invocations of other methods. These arguments will be used to influence this method's behavour (and if arguments have already been set earlier, they also will be taken into account).

See "Method -- args" for the common options which can be passed into this method.

The following arguments apply only to generating a table-of-contents.

entrysep

entrysep => string

Separator string for non-<li> item entries (default: ", ")

footer => file

File containing footer text for ToC.

header => file

File containing header text for ToC.

inline

inline => 1

Put ToC in document at a given point. See "Inlining the ToC" for more information.

ol

ol => 1

Use an ordered list for level 1 ToC entries.

ol_num_levels

ol_num_levels => 2

The number of levels deep the OL listing will go if ol is true. If set to zero, will use an ordered list for all levels. (default:1)

(this option is not supported in the old array format)

textonly

textonly => 1

Use only text content in significant elements.

title

title => string

Title for ToC page (if not using header or inline or toc_only) (default: "Table of Contents")

toc_after

toc_after => \%toc_after_data

%toc_after_data = { tag1 => suffix1, tag2 => suffix2 };

toc_after => { H2=>'</em>' }

For defining layout of significant elements in the ToC.

If the arguments are in hash form, this expects a reference to a hash of tag=>suffix pairs.

The tag is the HTML tag which marks the start of the element. The suffix is what is required to be appended to the Table of Contents entry generated for that tag.

'--toc_after', 'tag=suffix' (array version, deprecated)

If the arguments are in arrayref form, this is a cumulative argument; each instance of --toc_after, value in the array adds another pair to the internal hash; if you wish to clear it, give '--toc_after', 'CLEAR' to do so.

(default: undefined)

toc_before

toc_before => \%toc_before_data

%toc_before_data = { tag1 => prefix1, tag2 => prefix2 };

toc_before=>{ H2=>'<em>' }

For defining the layout of significant elements in the ToC. The tag is the HTML tag which marks the start of the element. The prefix is what is required to be prepended to the Table of Contents entry generated for that tag.

'--toc_before', 'tag=prefix' (array version, deprecated)

For the array-ref form of arguments, this is a cumulative hash argument; if you wish to clear it, give '--toc_before', 'CLEAR' to do so.

(default: undefined)

toc_file

toc_file => file

File to write the table-of-contents output to. If you give '-' as the filename, then output will go to STDOUT. (default: STDOUT)

toclabel

toclabel => string

HTML text that labels the ToC. Always used. (default: "<H1>Table of Contents</H1>")

toc_tag

toc_tag => string

If a ToC is to be included inline, this is the pattern which is used to match the tag where the ToC should be put. This can be a start-tag, an end-tag or a comment, but the < should be left out; that is, if you want the ToC to be placed after the BODY tag, then give "BODY". If you want a special comment tag to make where the ToC should go, then include the comment marks, for example: "!--toc--" (default:BODY)

toc_tag_replace

toc_tag_replace => 1

In conjunction with toc_tag, this is a flag to say whether the given tag should be replaced, or if the ToC should be put after the tag. This can be useful if your toc_tag is a comment and you don't need it after you have the ToC in place. (default:false)

toc_only

toc_only => 1

Output only the Table of Contents, that is, the Table of Contents plus the toclabel. If there is a header or a footer, these will also be output.

If toc_only is false then if there is no header, and inline is not true, then a suitable HTML page header will be output, and if there is no footer and inline is not true, then a HTML page footer will be output.

(default:false)

FILE FORMATS

ToC Map File

For backwards compatibility with htmltoc, this method of specifying significant elements for the ToC is retained. It is, however, deprecated and will be removed in a future version.

The ToC map file allows you to specify what significant elements to include in the ToC, what level they should appear in the ToC, and any text to include before and/or after the ToC entry. The format of the map file is as follows:

significant_element:level:sig_element_end:before_text,after_text
significant_element:level:sig_element_end:before_text,after_text
...

Each line of the map file contains a series of fields separated by the `:' character. The definition of each field is as follows:

  • significant_element

    The tag name of the significant element. Example values are H1, H2, H5. This field is case-insensitive.

  • level

    What level the significant element occupies in the ToC. This value must be numeric, and non-zero. If the value is negative, consective entries represented by the significant_element will be separated by the value set by -entrysep option.

  • sig_element_end (Optional)

    The tag name that signifies the termination of the significant_element.

    Example: The DT tag is a marker in HTML and not a container. However, one can index DT sections of a definition list by using the value DD in the sig_element_end field (this does assume that each DT has a DD following it).

    If the sig_element_end is empty, then the corresponding end tag of the specified significant_element is used. Example: If H1 is the significant_element, then the program looks for a "</H1>" for terminating the significant_element.

    Caution: the sig_element_end value should not contain the `<` and `>' tag delimiters. If you want the sig_element_end to be the end tag of another element than that of the significant_element, than use "/element_name".

    The sig_element_end field is case-insensitive.

  • before_text,after_text (Optional)

    This is literal text that will be inserted before and/or after the ToC entry for the given significant_element. The before_text is separated from the after_text by the `,' character (which implies a comma cannot be contained in the before/after text). See examples following for the use of this field.

In the map file, the first two fields MUST be specified.

Following are a few examples to help illustrate how a ToC map file works.

EXAMPLE 1

The following map file reflects the default mapping used if no map file is explicitly specified:

# Default mapping
# Comments can be inserted in the map file via the '#' character
H1:1 # H1 are level 1 ToC entries
H2:2 # H2 are level 2 ToC entries

EXAMPLE 2

The following map file makes use of the before/after text fields:

# A ToC map file that adds some formatting
H1:1::<STRONG>,</STRONG>      # Make level 1 ToC entries <STRONG>
H2:2::<EM>,</EM>              # Make level 2 entries <EM>
H3:3                          # Make level 3 entries as is

EXAMPLE 3

The following map file tries to index definition terms:

# A ToC map file that can work for Glossary type documents
H1:1
H2:2
DT:3:DD:<EM>,</EM>    # Assumes document has a DD for each DT, otherwise ToC
                   # will get entries with alot of text.

DETAILS

Formatting the ToC

The toc_entry and other related options give you control on how the ToC entries may look, but there are other options to affect the final appearance of the ToC file created.

With the header option, the contents of the given file will be prepended before the generated ToC. This allows you to have introductory text, or any other text, before the ToC.

Note:

If you use the header option, make sure the file specified contains the opening HTML tag, the HEAD element (containing the TITLE element), and the opening BODY tag. However, these tags/elements should not be in the header file if the inline option is used. See "Inlining the ToC" for information on what the header file should contain for inlining the ToC.

With the toclabel option, the contents of the given string will be prepended before the generated ToC (but after any text taken from a header file).

With the footer option, the contents of the file will be appended after the generated ToC.

Note:

If you use the footer, make sure it includes the closing BODY and HTML tags (unless, of course, you are using the inline option).

If the header option is not specified, the appropriate starting HTML markup will be added, unless the toc_only option is specified. If the footer option is not specified, the appropriate closing HTML markup will be added, unless the toc_only option is specified.

If you do not want/need to deal with header, and footer, files, then you are allowed to specify the title, title option, of the ToC file; and it allows you to specify a heading, or label, to put before ToC entries' list, the toclabel option. Both options have default values.

If you do not want HTML page tags to be supplied, and just want the ToC itself, then specify the toc_only option. If there are no header or footer files, then this will simply output the contents of toclabel and the ToC itself.

Inlining the ToC

The ability to incorporate the ToC directly into an HTML document is supported via the inline option.

Inlining will be done on the first file in the list of files processed, and will only be done if that file contains an opening tag matching the toc_tag value.

If overwrite is true, then the first file in the list will be overwritten, with the generated ToC inserted at the appropriate spot. Otherwise a modified version of the first file is output to either STDOUT or to the output file defined by the toc_file option.

The options toc_tag and toc_tag_replace are used to determine where and how the ToC is inserted into the output.

Example 1

    # this is the default
    $toc->args(toc_tag => 'BODY',
	toc_tag_replace => 0);

This will put the generated ToC after the BODY tag of the first file. If the header option is specified, then the contents of the specified file are inserted after the BODY tag. If the toclabel option is not empty, then the text specified by the toclabel option is inserted. Then the ToC is inserted, and finally, if the footer option is specified, it inserts the footer. Then the rest of the input file follows as it was before.

Example 2

    $toc->args(toc_tag => '!--toc--',
	toc_tag_replace => 1);

This will put the generated ToC after the first comment of the form <!--toc-->, and that comment will be replaced by the ToC (in the order header toclabel ToC footer) followed by the rest of the input file.

Note:

The header file should not contain the beginning HTML tag and HEAD element since the HTML file being processed should already contain these tags/elements.

NOTES

  • HTML::GenToc is smart enough to detect anchors inside significant elements. If the anchor defines the NAME attribute, HTML::GenToc uses the value. Else, it adds its own NAME attribute to the anchor.

  • The TITLE element is treated specially if specified in the toc_entry option. It is illegal to insert anchors (A) into TITLE elements. Therefore, HTML::GenToc will actually link to the filename itself instead of the TITLE element of the document.

  • HTML::GenToc will ignore a significant element if it does not contain any non-whitespace characters. A warning message is generated if such a condition exists.

  • If you have a sequence of significant elements that change in a slightly disordered fashion, such as H1 -> H3 -> H2 or even H2 -> H1, though HTML::GenToc deals with this to create a list which is still good HTML, if you are using an ordered list to that depth, then you will get strange numbering, as an extra list element will have been inserted to nest the elements at the correct level.

    For example (H2 -> H1 with ol_num_levels=1):

        1. 
    	* My H2 Header
        2. My H1 Header

    For example (H1 -> H3 -> H2 with ol_num_levels=0 and H3 also being significant):

        1. My H1 Header
    	1. 
    	    1. My H3 Header
    	2. My H2 Header
        2. My Second H1 Header

    In cases such as this it may be better not to use the ol option.

LIMITATIONS

  • HTML::GenToc is not very efficient (memory and speed), and can be extremely slow for large documents.

  • Invalid markup will be generated if a significant element is contained inside of an anchor. For example:

    <A NAME="foo"><H1>The FOO command</H1></A>

    will be converted to (if H1 is a significant element),

    <A NAME="foo"><H1><A NAME="The">The</A> FOO command</H1></A>

    which is illegal since anchors cannot be nested.

    It is better style to put anchor statements within the element to be anchored. For example, the following is preferred:

    <H1><A NAME="foo">The FOO command</A></H1>

    HTML::GenToc will detect the "foo" NAME and use it.

  • NAME attributes without quotes are not recognized.

BUGS

Tell me about them.

PREREQUSITES

HTML::SimpleParse
Data::Dumper (only for debugging purposes)

EXPORT

None by default.

SEE ALSO

perl(1) htmltoc(1) hypertoc(1)

AUTHOR

Kathryn Andersen http://www.katspace.com based on htmltoc by Earl Hood ehood AT medusa.acs.uci.edu

COPYRIGHT

Copyright (C) 1994-1997 Earl Hood, ehood AT medusa.acs.uci.edu Copyright (C) 2002-2004 Kathryn Andersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

9 POD Errors

The following errors were encountered while parsing the POD:

Around line 350:

=back doesn't take any parameters, but you said =back 4

Around line 564:

=back doesn't take any parameters, but you said =back 4

Around line 821:

=back doesn't take any parameters, but you said =back 4

Around line 1827:

=back doesn't take any parameters, but you said =back 4

Around line 1886:

=back doesn't take any parameters, but you said =back 4

Around line 1902:

=back doesn't take any parameters, but you said =back 4

Around line 1972:

=back doesn't take any parameters, but you said =back 4

Around line 2023:

=back doesn't take any parameters, but you said =back 4

Around line 2058:

=back doesn't take any parameters, but you said =back 4