NAME

MKDoc::XML::Tokenizer - Tokenize XML the REX way

SYNOPSIS

  my $tokens = MKDoc::XML::Tokenizer->process_data ($some_xml);
  foreach my $token (@{$tokens})
  {
      print "'" . $token->as_string() . "' is text\n" if (defined $token->is_text());
      print "'" . $token->as_string() . "' is a self closing tag\n" if ($token->is_tag_self_close());
      print "'" . $token->as_string() . "' is an opening tag\n" if ($token->is_tag_open());
      print "'" . $token->as_string() . "' is a closing tag\n" if ($token->is_tag_close());
      print "'" . $token->as_string() . "' is a processing instruction\n" if ($token->is_pi());
      print "'" . $token->as_string() . "' is a declaration\n" if ($token->is_declaration());
      print "'" . $token->as_string() . "' is a comment\n" if ($token->is_comment());
      print "'" . $token->as_string() . "' is a tag\n" if ($token->is_tag());
      print "'" . $token->as_string() . "' is a pseudo-tag (NOT text and NOT tag)\n" if ($token->is_pseudotag());
      print "'" . $token->as_string() . "' is a leaf token (NOT opening tag)\n" if ($token->is_leaf());
  }

SUMMARY

MKDoc::XML::Tokenizer is a module which uses Robert D. Cameron REX technique to parse XML (ignore the carriage returns):

  [^<]+|<(?:!(?:--(?:[^-]*-(?:[^-][^-]*-)*->?)?|\[CDATA\[(?:[^\]]*](?:[^\]]+])
  *]+(?:[^\]>][^\]]*](?:[^\]]+])*]+)*>)?|DOCTYPE(?:[ \n\t\r]+(?:[A-Za-z_:]|[^\
  x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*(?:[ \n\t\r]+(?:(?:[A-Za-z_:]|[^\
  x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*|"[^"]*"|'[^']*'))*(?:[ \n\t\r]+)
  ?(?:\[(?:<(?:!(?:--[^-]*-(?:[^-][^-]*-)*->|[^-](?:[^\]"'><]+|"[^"]*"|'[^']*'
  )*>)|\?(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*(?:\?>|[\
  n\r\t ][^?]*\?+(?:[^>?][^?]*\?+)*>))|%(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0
  -9_:.-]|[^\x00-\x7F])*;|[ \n\t\r]+)*](?:[ \n\t\r]+)?)?>?)?)?|\?(?:(?:[A-Za-z
  _:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x00-\x7F])*(?:\?>|[\n\r\t ][^?]*\?+(?
  :[^>?][^?]*\?+)*>)?)?|/(?:(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]|[^\x
  00-\x7F])*(?:[ \n\t\r]+)?>?)?|(?:(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.
  -]|[^\x00-\x7F])*(?:[ \n\t\r]+(?:[A-Za-z_:]|[^\x00-\x7F])(?:[A-Za-z0-9_:.-]|
  [^\x00-\x7F])*(?:[ \n\t\r]+)?=(?:[ \n\t\r]+)?(?:"[^<"]*"|'[^<']*'))*(?:[ \n\
  t\r]+)?/?>?)?)

That's right. One big regex, and it works rather well.

API

my $tokens = MKDoc::XML::Tokenizer->process_data ($some_xml);

Splits $some_xml into a list of MKDoc::XML::Token objects and returns an array reference to the list of tokens.

my $tokens = MKDoc::XML::Tokenizer->process_file ('/some/file.xml');

Same as MKDoc::XML::Tokenizer->process_data ($some_xml), except that it reads $some_xml from '/some/file.xml'.

NOTES

MKDoc::XML::Tokenizer works with MKDoc::XML::Token, which can be used when building a full tree is not necessary. If you need to build a tree, look at MKDoc::XML::TreeBuilder.

AUTHOR

Author: Jean-Michel Hiver <jhiver@mkdoc.com>

This module is free software and is distributed under the same license as Perl itself. Use it at your own risk.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)