NAME

MarpaX::Demo::StringParser - Conditional preservation of whitespace while parsing

Synopsis

Typical usage:

perl -Ilib scripts/parse.pl -d '[noddy]{color:blue}' -r 1 -v 1 -t output.tokens

Complex graphs work too: Try -d '[node.1]{a:b;c:d}->{e:f;}->{g:h}[node.2]{i:j}->[node.3]{k:l}'

The following refer to data shipped with the distro:

perl -Ilib scripts/parse.pl -i data/node.04.ge -r 1 -t node.04.tokens
diff data/node.04.tokens node.04.tokens

You can use scripts/parse.sh to simplify this process:

scripts/parse.sh data/node.04.ge node.04.tokens -r 1

See the demo page for sample output.

Also, there is an article based on this module.

Description

This module a demonstration of how to use Marpa::R2's capabilities to have the parser within Marpa call back to code in your own module, to handle certain cases where you don't want Marpa's default processing to occur.

A classic case of this is when you wish to preserve whitespace in some contexts, but also want Marpa to discard whitespace in all other contexts.

Specifically, MarpaX::Demo::StringParser is a cut-down version of Graph::Easy::Marpa V 2.00, and (the former) provides a Marpa-based parser for parts of Graph::Easy-style graph definitions. The latter module handles the whole language.

Installation

Install MarpaX::Demo::StringParser as you would for any Perl module:

Run:

cpanm MarpaX::Demo::StringParser

or run:

sudo cpan MarpaX::Demo::StringParser

or unpack the distro, and then either:

perl Build.PL
./Build
./Build test
sudo ./Build install

or:

perl Makefile.PL
make (or dmake or nmake)
make test
make install

Scripts Shipped with this Module

All scripts are shipped in the scripts/ directory.

o copy.config.pl

This is for use by the author. It just copies the config file out of the distro, so the script generate.index.pl (which uses HTML template stuff) can find it.

o find.config.pl

This cross-checks the output of copy.config.pl.

o ge2tokens.pl

This transforms all data/*.ge files into their corresponding data/*.tokens files.

o generate.demo.sh

This runs:

o perl -Ilib scripts/ge2tokens.pl
o perl -Ilib ~/bin/ge2svg.pl

See the article mentioned in the Synopsis for this script.

o perl -Ilib scripts/generate.index.pl

And then copies the demo output to my dev web server's doc root, where I can inspect it.

o generate.index.pl

This constructs a web page containing all the html/*.svg files.

o parse.pl

This runs a parse on a single input file. Run .parse.pl -h' for details.

o parse.sh

This simplifies running parse.pl.

o pod2html.sh

This converts all lib/*.pm files into their corresponding *.html versions, for proof-reading and uploading to my real web site.

Constructor and Initialization

new() is called as my($parser) = MarpaX::Demo::StringParser -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type MarpaX::Demo::StringParser.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. description($graph)]):

o description => '[node.1]->[node.2]'

Specify a string for the graph definition.

You are strongly encouraged to surround this string with '...' to protect it from your shell if using this module directly from the command line.

See also the 'input_file' key which reads the graph from a file.

The 'description' key takes precedence over the 'input_file' key.

o input_file => $graph_file_name

Read the graph definition from this file.

See also the 'graph' key to read the graph from the command line.

The whole file is slurped in as 1 graph.

The first lines of the file can start with /^\s*#/, and will be discarded as comments.

The 'description' key takes precedence over the 'input_file' key.

o report_tokens => $Boolean

Calls "report()" to report, via the log, the items recognized by the parser.

o token_file => $file_name

The name of the CSV file in which parsed tokens are to be saved.

If '', the file is not written.

Default: ''.

o verbose => $integer

Prints more (1, 2) or less (0) progress messages.

Methods

attribute_list($attribute_string)

Returns nothing.

Processes the attribute string returned by Marpa::R2 when it pauses during the processing of a set of attributes.

Then, pushes this set of attributes onto a stack.

The stack's elements are documented below in "FAQ" under "How is the parsed graph stored in RAM?".

description([$graph])

Here, the [] indicate an optional parameter.

Gets or sets the graph string to be parsed.

The value supplied to the description() method takes precedence over the value read from the input file.

Also, description is an option to new().

format($item)

Returns a string containing a nicely formatted version of the keys and values of the hashref $item.

$item must be an element of the stack of tokens output by the parse.

The stack's elements are documented below in "FAQ" under "How is the parsed graph stored in RAM?".

generate_token_file($file_name)

Returns nothing.

Writes a CSV file of tokens output by the parse if new() was called with the token_file option.

get_graph_from_command_line()

If the caller has requested a graph be parsed from the command line, with the description option to new(), get it now.

Called as appropriate by run().

get_graph_from_file()

If the caller has requested a graph be parsed from a file, with the input_file option to new(), get it now.

Called as appropriate by run().

grammar()

Returns an object of type Marpa::R2::Scanless::G.

graph([$graph])

Here, the [] indicate an optional parameter.

Gets or sets the value of the graph definition string.

graph_text([$graph])

Here, the [] indicate an optional parameter.

Returns the value of the graph definition string, from either the command line or a file.

input_file([$graph_file_name])

Here, the [] indicate an optional parameter.

Gets or sets the name of the file to read the graph definition from.

See also the description() method.

The whole file is slurped in as 1 graph.

The first lines of the file can start with /^\s*#/, and will be discarded as comments.

The value supplied to the description() method takes precedence over the value read from the input file.

Also, input_file is an option to new().

node()

Returns nothing.

Processes the node name string returned by Marpa::R2 when it pauses during the processing of '[' ... ']'.

Then, pushes this node name onto a stack.

The stack's elements are documented below in the "FAQ" under "How is the parsed graph stored in RAM?".

parser()

Returns an object of type Marpa::R2::Scanless::R.

process()

Does the real work. Called by run() after processing the user's options.

renumber_items()

Ensures each item in the stack as a sequential number 1 .. N.

report()

Reports (prints) the list of items recognized by the parser.

report_tokens([0 or 1])

The [] indicate an optional parameter.

Gets or sets the value which determines whether or not to report the items recognised by the parser.

Also, report_tokens is an option to new().

run()

This is the only method the caller needs to call. All parameters are supplied to new().

Returns 0 for success and 1 for failure.

verbose([0 .. 2])

The [] indicate an optional parameter.

Gets or sets the value which determines how many progress reports are printed.

Also, verbose is an option to new().

FAQ

Does this module handle utf8?

Yes. See the last sample on the demo page.

In simple terms, what is the grammar you parse?

It's a cut-down version of the DOT language used by AT&T's dot program. See http://graphviz.org.

Firstly, a summary:

Element        Syntax
---------------------
Edge names     Either '->' or '--'
---------------------
Node names     1: Delimited by '[' and ']'.
               2: May be quoted with " or '.
               3: Escaped characters, using '\', are allowed.
               4: Internal spaces in node names are preserved even if not quoted.
---------------------
Attributes     1: Delimited by '{' and '}'.
               2: Within that, any number of "key : value" pairs separated by ';'.
               3: Values may be quoted with " or ' or '<...>' or '<<table>...</table>>'.
               4: Escaped characters, using '\', are allowed.
               5: Internal spaces in attribute values are preserved even if not quoted.
---------------------

Note: Both edges and nodes can have attributes.

See the demo page for many samples.

And now the details:

o Comments

The first lines of the input file can start with /^\s*#/, and will be discarded as comments.

o Line-breaks

These are converted into a single space.

o Nodes

Nodes are delimited by the quote characters '[' and ']'.

Within the quotes, any printable character can be used for a node's name.

Some literals - ']', '"', "'" - can be used in the node's value, but they must satisfy one of these conditions:

o Escaped using '\'

Eg: \].

o Placed inside " ... "
o Placed inside ' ... '

Internal spaces are preserved within a node's name, but leading and trailing spaces are not (unless quoted).

Lastly, the node's name can be empty. I.e.: You use '[]' in the input stream to create an anonymous node.

Samples:

[]
[node.1]
[node 1]
[[node\]]
["[node]"]
[     From here     ] -> [     To there     ]

Note: Node names quoted with a balanced pair or single- or double-quotes will have those quotes stripped.

o Edges

Edge names are either '->' or '--'.

No other edge names are accepted.

Samples:

->
--
o Attributes

Both nodes and edges can have any number of attributes.

Attributes are delimited by the quote characters '{' and '}'.

These attributes are listed immdiately after their owing node or edge.

Each attribute consists of a key:value pair, where ':' must appear literally.

These key:value pairs must be separated by the ';' character.

The values for 'key' are reserved words used by Graphviz's attributes. These keys satisy the regexp /^[a-zA-Z_]+$/.

For the 'value', any printable character can be used.

Some escape sequences are reserved by Graphviz.

Some literals - ';', '}', '<', '>', '"', "'" - can be used in the attribute's value, but they must satisfy one of these conditions:

o Escaped using '\'.

Eg: \;, \}, etc.

o Placed inside " ... "
o Placed inside ' ... '
o Placed inside <...>

This does not mean you can use <<Some text>>. See the next point.

o Placed inside <<table> ... </table>>

Using this construct allows you to use HTML entities such as &amp;, &lt;, &gt; and &quot;.

Internal spaces are preserved within an attribute's value, but leading and trailing spaces are not (unless quoted).

Samples:

[node.1] {color: red; label: Green node}
-> {penwidth: 5; label: From Here to There}
[node.2]
-> {label: "A literal semicolon '\;' in a label"}

Note: That '\;' does not actually need those single-quote characters, since it is within a set of double-quotes.

Note: Attribute values quoted with a balanced pair or single- or double-quotes will have those quotes stripped.

o Graphs

Graphs are sequences of nodes and edges, in any order.

The sample given just above for attributes is in fact a single graph.

A sample:

[node]
[node] ->
-> {label: Start} -> {color: red} [node.1] {color: green] -> [node.2]
[node.1] [node.2] [node.3]

For more samples, see the data/*.ge files shipped with the distro.

How is the parsed graph stored in RAM?

Items are stored in an arrayref managed by Set::Array.

This arrayref is available via the "items()" method.

Each element in the array is a hashref, listed here in alphabetical order by type.

Note: Items are numbered from 1 up.

o Attributes

An attribute can belong to a node or an edge. An attribute definition of '{color: red;}' would produce a hashref of:

{
	count => $n,
	name  => 'color',
	type  => 'attribute',
	value => 'red',
}

An attribute definition of '{color: red; shape: circle;}' will produce 2 hashrefs, i.e. 2 sequential elements in the arrayref:

{
	count => $n,
	name  => 'color',
	type  => 'attribute',
	value => 'red',
}

{
	count => $n + 1,
	name  => 'shape',
	type  => 'attribute',
	value => 'circle',
}
o Edges

An edge definition of '->' would produce a hashref of:

{
	count => $n,
	name  => '->',
	type  => 'edge',
	value => '',
}
o Nodes

A node definition of '[Name]' would produce a hashref of:

{
	count => $n,
	name  => 'Name',
	type  => 'node',
	value => '',
}

A node can have a definition of '[]', which means it has no name. Such nodes are called anonymous (or invisible) because while they take up space in the output stream, they have no printable or visible characters if the output stream is turned into a graph by Graphviz's dot program.

Each anonymous node will have at least these 2 attributes:

{
	count => $n,
	name  => '',
	type  => 'node',
	value => '',
}

{
	count => $n + 1,
	name  => 'color',
	type  => 'attribute',
	value => 'invis',
}

You can of course give your anonymous nodes any attributes, but they will be forced to have these attributes.

Node names are case-sensitive in dot, but that does not matter within the context of this module.

Where are the actions named in the grammar?

In MarpaX::Demo::StringParser::Actions.

How did you generate the html/*.svg files?

With a private script which uses Graph::Easy::Marpa::Renderer::GraphViz2 V 2.00. This script is not shipped in order to avoid a dependency on that module. Also, another private script which validates Build.PL and Makefile.PL would complain about the missing dependency.

See the demo page for details.

Machine-Readable Change Log

The file Changes was converted into Changelog.ini by Module::Metadata::Changes.

Version Numbers

Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=MarpaX::Demo::StringParser.

Author

MarpaX::Demo::StringParser was written by Ron Savage <ron@savage.net.au> in 2013.

Home page: http://savage.net.au/index.html.

Copyright

Australian copyright (c) 2013, Ron Savage.

All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License, a copy of which is available at:
http://www.opensource.org/licenses/index.html