NAME
Text::Delimited::Marpa
- Extract delimited text sequences from strings
Synopsis
This is scripts/synopsis.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Delimited::Marpa ':constants';
# -----------
my(%count) = (fail => 0, success => 0, total => 0);
my($parser) = Text::Delimited::Marpa -> new
(
open => '/*',
close => '*/',
options => print_errors | print_warnings | mismatch_is_fatal,
);
my(@text) =
(
q|Start /* One /* Two /* Three */ Four */ Five */ Finish|,
);
my($result);
my($text);
for my $i (0 .. $#text)
{
$count{total}++;
$text = $text[$i];
print "Parsing |$text|. pos: ", $parser -> pos, '. length: ', $parser -> length, "\n";
$result = $parser -> parse(text => \$text);
print "Parse result: $result (0 is success)\n";
if ($result == 0)
{
$count{success}++;
print join("\n", @{$parser -> tree -> tree2string}), "\n";
}
}
$count{fail} = $count{total} - $count{success};
print "\n";
print 'Statistics: ', join(', ', map{"$_ => $count{$_}"} sort keys %count), ". \n";
This is the output of synopsis.pl:
Parsing |Start /* One /* Two /* Three */ Four */ Five */ Finish|. pos: 0. length: 0
Parse result: 0 (0 is success)
root. Attributes: {end => "0", length => "0", start => "0", text => "", uid => "0"}
|--- span. Attributes: {end => "44", length => "37", start => "8", text => " One /* Two /* Three */ Four */ Five ", uid => "1"}
|--- span. Attributes: {end => "36", length => "22", start => "15", text => " Two /* Three */ Four ", uid => "2"}
|--- span. Attributes: {end => "28", length => "7", start => "22", text => " Three ", uid => "3"}
Statistics: fail => 0, success => 1, total => 1.
See also scripts/tiny.pl and scripts/traverse.pl.
Description
Text::Delimited::Marpa provides a Marpa::R2-based parser for extracting delimited text sequences from strings. The text between the delimiters is stored as nodes in a tree managed by Tree. The delimiters, and the text outside the delimiters, is not saved in the tree.
Nested strings with the same delimiters are saved as daughters of their enclosing strings' tree nodes. As you can see from the output just above, this nesting process is repeated as many times as the delimiters themselves are nested.
You can ignore the nested, delimited, strings by just processing the daughters of the tree's root node.
This module is a companion to Text::Balanced::Marpa. The differences are discussed in the "FAQ" below.
See the "FAQ" for various topics, including:
- o UFT8 handling
-
See t/utf8.t.
- o Escaping delimiters within the text
-
See t/escapes.t.
- o Options to make mismatched delimiters fatal errors
-
See t/escapes.t and t/perl.delimiters.
- o Processing the tree-structured output
-
See scripts/traverse.pl.
- o Emulating Text::Xslate's use of '<:' and ':>
-
See t/colons.t and t/percents.t.
- o Skipping (leading) characters in the input string
-
See t/skip.prefix.t.
- o Implementing hard-to-read text strings as delimiters
-
See t/silly.delimiters.
Distributions
This module is available as a Unix-style distro (*.tgz).
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.
Installation
Install Text::Delimited::Marpa as you would any Perl
module:
Run:
cpanm Text::Delimited::Marpa
or run:
sudo cpan Text::Delimited::Marpa
or unpack the distro, and then either:
perl Build.PL
./Build
./Build test
sudo ./Build install
or:
perl Makefile.PL
make (or dmake or nmake)
make test
make install
Constructor and Initialization
new()
is called as my($parser) = Text::Delimited::Marpa -> new(k1 => v1, k2 => v2, ...)
.
It returns a new object of type Text::Delimited::Marpa
.
Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "text([$stringref])"]):
- o close => $string
-
The closing delimiter.
A value for this option is mandatory.
Default: None.
- o length => $integer
-
The maxiumum length of the input string to process.
This parameter works in conjunction with the
pos
parameter.length
can also be used as a key in the hash passed to "parse([%hash])".See the "FAQ" for details.
Default: Calls Perl's length() function on the input string.
- o next_few_limit => $integer
-
This controls how many characters are printed when displaying 'the next few chars'.
It only affects debug output.
Default: 20.
- o open => $string
-
The opening delimiter.
See the "FAQ" for details and warnings.
A value for this option is mandatory.
Default: None.
- o options => $bit_string
-
This allows you to turn on various options.
options
can also be used as a key in the hash passed to "parse([%hash])".Default: 0 (nothing is fatal).
See the "FAQ" for details.
- o pos => $integer
-
The offset within the input string at which to start processing.
This parameter works in conjunction with the
length
parameter.pos
can also be used as a key in the hash passed to "parse([%hash])".See the "FAQ" for details.
Note: The first character in the input string is at pos == 0.
Default: 0.
- o text => $stringref
-
This is a reference to the string to be parsed. A stringref is used to avoid copying what could potentially be a very long string.
text
can also be used as a key in the hash passed to "parse([%hash])".Default: \''.
Methods
bnf()
Returns a string containing the grammar constructed based on user input.
close()
Get the closing delimiter.
See also "open()".
See the "FAQ" for details and warnings.
'close' is a parameter to "new()". See "Constructor and Initialization" for details.
delimiter_action()
Returns a hashref, where the keys are delimiters and the values are either 'open' or 'close'.
error_message()
Returns the last error or warning message set.
Error messages always start with 'Error: '. Messages never end with "\n".
Parsing error strings is not a good idea, ever though this module's format for them is fixed.
See "error_number()".
error_number()
Returns the last error or warning number set.
Warnings have values < 0, and errors have values > 0.
If the value is > 0, the message has the prefix 'Error: ', and if the value is < 0, it has the prefix 'Warning: '. If this is not the case, it's a reportable bug.
Possible values for error_number() and error_message():
- o 0 => ""
-
This is the default value.
- o 1/-1 => "The # of open delimiters ($lexeme) does not match the # of close delimiters. Left over: $integer"
-
If "error_number()" returns 1, it's an error, and if it returns -1 it's a warning.
You can set the option
overlap_is_fatal
to make it fatal. - o 2/-2 => (Not used)
- o 3/-3 => "Ambiguous parse. Status: $status. Terminals expected: a, b, ..."
-
This message is only produced when the parse is ambiguous.
If "error_number()" returns 3, it's an error, and if it returns -3 it's a warning.
You can set the option
ambiguity_is_fatal
to make it fatal. - o 4 => "Backslash is forbidden as a delimiter character"
-
This preempts some types of sabotage.
This message can never be just a warning message.
- o 5 => "Single-quotes are forbidden in multi-character delimiters"
-
This limitation is due to the syntax of Marpa's DSL.
This message can never be just a warning message.
- o 6/-6 => "Parse exhausted"
-
If "error_number()" returns 6, it's an error, and if it returns -6 it's a warning.
You can set the option
exhaustion_is_fatal
to make it fatal. - o 7 => 'Single-quote is forbidden as an escape character'
-
This limitation is due to the syntax of Marpa's DSL.
This message can never be just a warning message.
- o 8 => "There must be at least 1 pair of open/close delimiters"
-
This message can never be just a warning message.
- o 10 => "Unexpected event name 'xyz'"
-
Marpa has trigged an event and it's name is not in the hash of event names derived from the BNF.
This message can never be just a warning message.
- o 11 => "The code does not handle these events simultaneously: a, b, ..."
-
The code is written to handle single events at a time, or in rare cases, 2 events at the same time. But here, multiple events have been triggered and the code cannot handle the given combination.
This message can never be just a warning message.
See "error_message()".
escape_char()
Get the escape char.
known_events()
Returns a hashref where the keys are event names and the values are 1.
length([$integer])
Here, the [] indicate an optional parameter.
Get or set the length of the input string to process.
See also the "FAQ" and "pos([$integer])".
'length' is a parameter to "new()". See "Constructor and Initialization" for details.
matching_delimiter()
Returns a hashref where the keys are opening delimiters and the values are the corresponding closing delimiters.
new()
See "Constructor and Initialization" for details on the parameters accepted by "new()".
next_few_chars($stringref, $offset)
Returns a substring of $s, starting at $offset, for use in debug messages.
See next_few_limit([$integer]).
next_few_limit([$integer])
Here, the [] indicate an optional parameter.
Get or set the number of characters called 'the next few chars', which are printed during debugging.
'next_few_limit' is a parameter to "new()". See "Constructor and Initialization" for details.
open()
Get the opening delimiter.
See also "close()".
See the "FAQ" for details and warnings.
'open' is a parameter to "new()". See "Constructor and Initialization" for details.
options([$bit_string])
Here, the [] indicate an optional parameter.
Get or set the option flags.
For typical usage, see scripts/synopsis.pl.
See the "FAQ" for details.
'options' is a parameter to "new()". See "Constructor and Initialization" for details.
parse([%hash])
Here, the [] indicate an optional parameter.
This is the only method the user needs to call. All data can be supplied when calling "new()".
You can of course call other methods (e.g. "text([$stringref])" ) after calling "new()" but before calling parse()
.
The optional hash takes these ($key => $value) pairs (exactly the same as for "new()"):
Note: If a value is passed to parse()
, it takes precedence over any value with the same key passed to "new()", and over any value previously passed to the method whose name is $key. Further, the value passed to parse()
is always passed to the corresponding method (i.e. whose name is $key), meaning any subsequent call to that method returns the value passed to parse()
.
Returns 0 for success and 1 for failure.
If the value is 1, you should call "error_number()" to find out what happened.
pos([$integer])
Here, the [] indicate an optional parameter.
Get or set the offset within the input string at which to start processing.
See also the "FAQ" and "length([$integer])".
'pos' is a parameter to "new()". See "Constructor and Initialization" for details.
text([$stringref])
Here, the [] indicate an optional parameter.
Get or set a reference to the string to be parsed.
'text' is a parameter to "new()". See "Constructor and Initialization" for details.
tree()
Returns an object of type Tree, which holds the parsed data.
Obviously, it only makes sense to call tree()
after calling parse()
.
See scripts/traverse.pl for sample code which processes this tree's nodes.
FAQ
What are the differences between Text::Balanced::Marpa and Text::Delimited::Marpa?
I think this is shown most clearly by getting the 2 modules to process the same string. So, using this as input:
'a <:b <:c:> d:> e <:f <: g <:h:> i:> j:> k'
Output from Text::Balanced::Marpa's scripts/tiny.pl:
(# 2) | 1 2 3 4 5 6 7 8 9
|0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Parsing |Skip me ->a <:b <:c:> d:> e <:f <: g <:h:> i:> j:> k|. pos: 10. length: 42
Parse result: 0 (0 is success)
root. Attributes: {text => "", uid => "0"}
|--- text. Attributes: {text => "a ", uid => "1"}
|--- open. Attributes: {text => "<:", uid => "2"}
| |--- text. Attributes: {text => "b ", uid => "3"}
| |--- open. Attributes: {text => "<:", uid => "4"}
| | |--- text. Attributes: {text => "c", uid => "5"}
| |--- close. Attributes: {text => ":>", uid => "6"}
| |--- text. Attributes: {text => " d", uid => "7"}
|--- close. Attributes: {text => ":>", uid => "8"}
|--- text. Attributes: {text => " e ", uid => "9"}
|--- open. Attributes: {text => "<:", uid => "10"}
| |--- text. Attributes: {text => "f ", uid => "11"}
| |--- open. Attributes: {text => "<:", uid => "12"}
| | |--- text. Attributes: {text => " g ", uid => "13"}
| | |--- open. Attributes: {text => "<:", uid => "14"}
| | | |--- text. Attributes: {text => "h", uid => "15"}
| | |--- close. Attributes: {text => ":>", uid => "16"}
| | |--- text. Attributes: {text => " i", uid => "17"}
| |--- close. Attributes: {text => ":>", uid => "18"}
| |--- text. Attributes: {text => " j", uid => "19"}
|--- close. Attributes: {text => ":>", uid => "20"}
|--- text. Attributes: {text => " k", uid => "21"}
Output from Text::Delimited::Marpa's scripts/tiny.pl:
(# 2) | 1 2 3 4 5 6 7 8 9
|0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Parsing |Skip me ->a <:b <:c:> d:> e <:f <: g <:h:> i:> j:> k|. pos: 10. length: 42
Parse result: 0 (0 is success)
root. Attributes: {end => "0", length => "0", start => "0", text => "", uid => "0"}
|--- span. Attributes: {end => "22", length => "9", start => "14", text => "b <:c:> d", uid => "1"}
| |--- span. Attributes: {end => "18", length => "1", start => "18", text => "c", uid => "2"}
|--- span. Attributes: {end => "47", length => "18", start => "30", text => "f <: g <:h:> i:> j", uid => "3"}
|--- span. Attributes: {end => "43", length => "10", start => "34", text => " g <:h:> i", uid => "4"}
|--- span. Attributes: {end => "39", length => "1", start => "39", text => "h", uid => "5"}
Another example, using the same input string, but manually processing the tree nodes. Parent-daughter relationships are here represented by indentation.
Output from Text::Balanced::Marpa's scripts/traverse.pl:
| 1 2 3 4 5
|012345678901234567890123456789012345678901234567890
Parsing |a <:b <:c:> d:> e <:f <: g <:h:> i:> j:> k|.
Span Text
1 |a |
2 |<:|
3 |b |
4 |<:|
5 |c|
6 |:>|
7 | d|
8 |:>|
9 | e |
10 |<:|
11 |f |
12 |<:|
13 | g |
14 |<:|
15 |h|
16 |:>|
17 | i|
18 |:>|
19 | j|
20 |:>|
21 | k|
Output from Text::Delimited::Marpa's scripts/traverse.pl:
| 1 2 3 4 5
|012345678901234567890123456789012345678901234567890
Parsing |a <:b <:c:> d:> e <:f <: g <:h:> i:> j:> k|.
Span Start End Length Text
1 4 12 9 |b <:c:> d|
2 8 8 1 |c|
3 20 37 18 |f <: g <:h:> i:> j|
4 24 33 10 | g <:h:> i|
5 29 29 1 |h|
How do I ignore embedded strings which have the same delimiters as their containing strings?
You can ignore the nested, delimited, strings by just processing the daughters of the tree's root node.
Where are the error messages and numbers described?
See "error_message()" and "error_number()".
How do I escape delimiters?
By backslash-escaping the first character of all open and close delimiters which appear in the text.
As an example, if the delimiters are '<:' and ':>', this means you have to escape all the '<' chars and all the colons in the text.
The backslash is preserved in the output.
If you don't want to use backslash for escaping, or can't, you can pass a different escape character to "new()".
See t/escapes.t.
How do the length and pos parameters to new() work?
The recognizer - an object of type Marpa::R2::Scanless::R - is called in a loop, like this:
for
(
$pos = $self -> recce -> read($stringref, $pos, $length);
$pos < $length;
$pos = $self -> recce -> resume($pos)
)
"pos([$integer])" and "length([$integer])" can be used to initialize $pos and $length.
Note: The first character in the input string is at pos == 0.
See https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/R.pod#read for details.
Does this package support Unicode/UTF8?
Yes. See t/escapes.t and t/utf8.t.
Does this package handler Perl delimiters (e.g. q|..|, qq|..|, qr/../, qw/../)?
See t/perl.delimiters.t.
Warning: Calling mutators after calling new()
The only mutator which works after calling new() is "text([$stringref])".
In particular, you can't call "escape_char()", "open()" or "close()" after calling "new()". This is because parameters passed to new()
are interpolated into the grammar before parsing begins. And that's why the docs for those methods all say 'Get the...' and not 'Get and set the...'.
To make the code work, you would have to manually call _validate_open_close(). But even then a lot of things would have to be re-initialized to give the code any hope of working.
What is the format of the 'open' and 'close' parameters to new()?
Each of these parameters takes a string as a value, and these are used as the opening and closing delimiter pair.
See scripts/synopsis.pl and scripts/tiny.pl.
What are the possible values for the 'options' parameter to new()?
Firstly, to make these constants available, you must say:
use Text::Delimited::Marpa ':constants';
Secondly, more detail on errors and warnings can be found at "error_number()".
Thirdly, for usage of these option flags, see t/angle.brackets.t, t/colons.t, t/escapes.t, t/percents.t and scripts/tiny.pl.
Now the flags themselves:
- o nothing_is_fatal
-
This is the default.
nothing_is_fatal
has the value of 0. - o print_errors
-
Print errors if this flag is set.
print_errors
has the value of 1. - o print_warnings
-
Print various warnings if this flag is set:
- o The ambiguity status and terminals expected, if the parse is ambiguous
- o See "error_number()" for other warnings which might be printed
-
Ambiguity is not, in and of itself, an error. But see the
ambiguity_is_fatal
option, below.
It's tempting to call this option
warnings
, but Perl already hasuse warnings
, so I didn't.print_warnings
has the value of 2. - o print_debugs
-
Print extra stuff if this flag is set.
print_debugs
has the value of 4. - o mismatch_is_fatal
-
This means a fatal error occurs when the number of open delimiters does not match the number of close delimiters.
overlap_is_fatal
has the value of 8. - o ambiguity_is_fatal
-
This makes "error_number()" return 3 rather than -3.
ambiguity_is_fatal
has the value of 16. - o exhaustion_is_fatal
-
This makes "error_number()" return 6 rather than -6.
exhaustion_is_fatal
has the value of 32.
How do I print the tree built by the parser?
See "Synopsis".
How do I make use of the tree built by the parser?
See scripts/traverse.pl.
How is the parsed data held in RAM?
The parsed output is held in a tree managed by Tree.
The tree always has a root node, which has nothing to do with the input data. So, even an empty imput string will produce a tree with 1 node. This root has an empty hashref associated with it.
Nodes have a name and a hashref of attributes.
The name indicates the type of node. Names are one of these literals:
The (key => value) pairs in the hashref are:
- o end => $integer
-
The offset into the original stringref at which the current span of text ends.
- o length => $integer
-
The number of characters in the current span.
- o start => $integer
-
The offset into the original stringref at which the current span of text starts.
- o text => $string
-
If the node name is 'text', $string is the verbatim text from the document.
Verbatim means, for example, that backslashes in the input are preserved.
What is the homepage of Marpa?
http://savage.net.au/Marpa.html.
That page has a long list of links.
How do I run author tests?
This runs both standard and author tests:
shell> perl Build.PL; ./Build; ./Build authortest
TODO
- o Advanced error reporting
-
See https://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2014/11/delimiter.html.
Perhaps this could be a sub-class?
- o I8N support for error messages
- o An explicit test program for parse exhaustion
See Also
Tree and Tree::Persist.
Machine-Readable Change Log
The file Changes was converted into Changelog.ini by Module::Metadata::Changes.
Version Numbers
Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.
Thanks
Thanks to Jeffrey Kegler, who wrote Marpa and Marpa::R2.
And thanks to rns (Ruslan Shvedov) for writing the grammar for double-quoted strings used in MarpaX::Demo::SampleScripts's scripts/quoted.strings.02.pl. I adapted it to HTML (see scripts/quoted.strings.05.pl in that module), and then incorporated the grammar into GraphViz2::Marpa, and - after more extensions - into this module.
Lastly, thanks to Robert Rothenberg for Const::Exporter, a module which works the same way Perl does.
Repository
https://github.com/ronsavage/Text-Delimited-Marpa
Support
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=Text::Delimited::Marpa.
Author
Text::Delimited::Marpa was written by Ron Savage <ron@savage.net.au> in 2014.
Marpa's homepage: http://savage.net.au/Marpa.html.
My homepage: http://savage.net.au/.
Copyright
Australian copyright (c) 2015, Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License 2.0, a copy of which is available at:
http://opensource.org/licenses/alphabetical.