NAME
MarpaX::Languages::SVG::Parser
- A nested SVG parser, using XML::SAX and Marpa::R2
Synopsis
#!/usr/bin/env perl
use strict;
use warnings;
use MarpaX::Languages::SVG::Parser;
# ---------------------------------
my(%option) =
(
input_file_name => 'data/ellipse.01.svg',
);
my($parser) = MarpaX::Languages::SVG::Parser -> new(%option);
my($result) = $parser -> run;
die "Parse failed\n" if ($result == 1);
for my $item (@{$parser -> items -> print})
{
print sprintf "%-16s %-16s %s\n", $$item{type}, $$item{name}, $$item{value};
}
This script ships as scripts/synopsis.pl. Run it as:
shell> perl -Ilib scripts/synopsis.pl
See also scripts/parse.file.pl for code which takes command line parameters. For help, run:
shell> perl -Ilib scripts/parse.file.pl -h
Description
MarpaX::Languages::SVG::Parser
uses XML::SAX and Marpa::R2 to parse SVG into an array of hashrefs.
XML::SAX parses the input file, and then certain tags' attribute values are parsed by Marpa::R2. The attribute values treated specially each have their own BNFs. This is why it's called nested parsing.
Examples of these special cases are the path's 'd' attribute and the 'transform' attribute of various tags.
The SVG versions of the attribute-specific BNFs are here.
See the "FAQ" for details.
Installation
Install MarpaX::Languages::SVG::Parser
as you would for any Perl
module:
Run:
cpanm MarpaX::Languages::SVG::Parser
or run:
sudo cpan MarpaX::Languages::SVG::Parser
or unpack the distro, and then either:
perl Build.PL
./Build
./Build test
sudo ./Build install
or:
perl Makefile.PL
make (or dmake or nmake)
make test
make install
Constructor and Initialization
new()
is called as my($parser) = MarpaX::Languages::SVG::Parser -> new(k1 => v1, k2 => v2, ...)
.
It returns a new object of type MarpaX::Languages::SVG::Parser
.
Key-value pairs accepted in the parameter list (see also the corresponding methods [e.g. "encoding([$encoding])"]):
- o encoding => $string
-
$string takes values such as 'utf-8', and the code converts this into '<:encoding(utf-8)'.
File::Slurp's
read_file()
is used to read the SVG file specified by the input_file_name option.read_file()'s
binmode
option can be used to set the encoding used with this option. However, that will rarely be necessary.See data/ellipse.01.svg and data/utf8.01.svg for sample data. They are both processed by scripts/parse.file.pl without needing to change the default encoding (which is the empty string).
shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg shell> perl -Ilib scripts/parse.file.pl -i data/utf8.01.svg
Default: ''.
- o input_file_name => $string
-
The names the input file to be parsed.
When calling "run(%args)" this is an SVG file (e.g. data/*.svg).
But when calling "test(%args)", this is a text file (e.g. data/*.dat).
This option is mandatory.
Default: ''.
- o logger => aLog::HandlerObject
-
By default, an object of type Log::Handler is created which prints to STDOUT, but given the default setting (maxlevel => 'info'), nothing is actually printed.
See
maxlevel
andminlevel
below.Set
logger
to '' (the empty string) to stop a logger being created.Default: undef.
- o maxlevel => logOption1
-
This option affects Log::Handler objects.
See the Log::Handler::Levels docs.
Since the "report()" method is always called and outputs at log level
info
, the first of these produces no output, whereas the second lists all the parse results. The third adds a tiny bit to the output.shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg -max info shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg -max debug
The extra output produced by
debug
includes the input file name and the string which Marpa::R2 is trying to parse. This helps debug the BNFs themselves.Default: 'notice'.
- o minlevel => logOption2
-
This option affects Log::Handler object.
See the Log::Handler::Levels docs.
Default: 'error'.
No lower levels are used.
- o output_file_name => $string
-
The names the CSV file to be written.
Note: This name is only used when calling "run(%args)". It is of course ignored when calling "test(%args)".
If not set, nothing is written.
The
encoding
option is not used if you choose to create a CSV file with theoutput_file_name
option, because Text::CSV::Encodedjust works
.See data/circle.01.csv and data/utf8.01.csv, which were created by running:
shell> perl -Ilib scripts/parse.file.pl -i data/circle.01.svg -o data/circle.01.csv shell> perl -Ilib scripts/parse.file.pl -i data/utf8.01.svg -o data/utf8.01.csv
Default: ''.
Methods
attribute($attribute)
Get or set the name of the attribute being processed.
This is only used in testing, in calls from scripts/test.file.pl and (indirectly) scripts/test.fileset.pl.
It is needed because the test files, data/*.dat, do not contain tag/attribute names, and hence the code needs to be told explicitly which attribute it is parsing.
Note: attribute
is a parameter to new().
encoding([$encoding])
Here, the [] indicate an optional parameter.
Get or set the encoding to be used by File::Slurp's binmode
option.
Note: encoding
is a parameter to new().
get_encoding(%args)
Allows "run(%args)" and "test(%args)" to accept the encoding in various ways.
%args
is a hash with this optional (key => value) pair:
So, the parser object can be set up via any of these:
Used internally.
input_file_name([$string])
Here, the [] indicate an optional parameter.
Get or set the name of the file to parse.
When calling "run(%args)" this is an SVG file (e.g. data/*.svg).
But when calling "test(%args)", this is a text file (e.g. data/*.dat).
Note: input_file_name
is a parameter to new().
item_count([$new_value])
Here, the [] indicate an optional parameter.
Get or set the counter used to populate the count
key in the hashref in the array of parsed tokens.
Used internally.
See the "FAQ" for details.
items()
Returns the instance of Set::Array which manages the array of hashrefs holding the parsed tokens.
$object -> items -> print returns an array ref.
See "Synopsis" in MarpaX::Languages::SVG::Parser for sample code.
See also "new_item($type, $name, $value)".
log($level, $s)
Calls $self -> logger -> log($level => $s) if ($self -> logger).
logger([$log_object])
Here, the [] indicate an optional parameter.
Get or set the log object.
$log_object
must be a Log::Handler-compatible object.
To disable logging, just set logger to the empty string.
Note: logger
is a parameter to new().
maxlevel([$string])
Here, the [] indicate an optional parameter.
Get or set the value used by the logger object.
This option is only used if an object of type Log::Handler is created. See Log::Handler::Levels.
Note: maxlevel
is a parameter to new().
minlevel([$string])
Here, the [] indicate an optional parameter.
Get or set the value used by the logger object.
This option is only used if an object of type Log::Handler is created. See Log::Handler::Levels.
Note: minlevel
is a parameter to new().
new()
This method is auto-generated by Moo.
new_item($type, $name, $value)
Pushes another hashref onto the stack managed by $self -> items.
See the "FAQ" for details.
output_file_name([$string])
Here, the [] indicate an optional parameter.
Get or set the name of the (optional) CSV file to write.
Note: output_file_name
is a parameter to new().
report()
Prints a nicely-formatted report of the items
array via the logger.
run(%args)
The method which does all the work.
%args
is a hash with this optional (key => value) pair:
Returns 0 for a successful parse and 1 for failure.
The code dies if Marpa::R2 itself can't parse the given string.
See also "test(%args)".
save()
Save the parsed tokens to a CSV file, but only if an output file name was provided in the call to "new()" or to "output_file_name([$string])".
test(%args)
This method is used by scripts/test.fileset.pl, since that calls scripts/test.file.pl, to run tests.
%args
is a hash with this optional (key => value) pair:
Returns 0 for a successful parse and 1 for failure.
See also "run(%args)".
Files Shipped with this Module
Data Files
These are all shipped in the data/ directory.
- o *.log
-
The logs of running this on each *.svg file:
shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.02.svg -max debug > data/ellipse.02.log
The *.log files are generated by scripts/svg2.log.pl.
- o circle.01.csv
-
Output from scripts/parse.file.pl
- o circle.01.svg
-
Test data for scripts/parse.file.pl
- o d.bnf
-
This is the grammar for the 'd' attribute of the 'path' tag.
Note: The module does not read this file. A copy of the grammar is stored at the end of the source code for Marpa::Languages::SVG::Parser::SAXHandler, and read by Data::Section::Simple.
- o d.*.dat
-
Fake data to test d.bnf.
Input for scripts/test.file.pl.
- o html/d.svg
-
This is the graph of the grammar d.bnf.
It was generated by scripts/bnf2graph.pl.
- o ellipse.*.svg
-
Test data for scripts/parse.file.pl
- o line.01.svg
-
Test data for scripts/parse.file.pl
- o points.bnf
-
This grammar is for both the polygon and polyline 'points' attributes.
- o points.*.dat
-
Fake data to test points.bnf.
Input for scripts/test.file.pl.
- o polygon.01.svg
-
Test data for scripts/parse.file.pl
- o polyline.01.svg
-
Test data for scripts/parse.file.pl
- o preserveAspectRatio.bnf
-
This grammar is for the 'preserveAspectRatio' attribute of various tags.
- o preserveAspectRatio.*.dat
-
Fake data to test preserveAspectRatio.bnf.
Input for scripts/test.file.pl.
- o preserveAspectRatio.01.svg
-
Test data for scripts/parse.file.pl
- o html/preserveAspectRatio.svg
-
This is the graph of the grammar preserveAspectRatio.bnf.
It was generated by scripts/bnf2graph.sh.
- o rect.*.svg
-
Test data for scripts/parse.file.pl
- o transform.bnf
-
This grammar is for the 'transform' attribute of various tags.
- o transform.*.dat
-
Fake data to test transform.bnf.
Input for scripts/test.file.pl.
- o utf8.01.csv
-
Output from scripts/parse.file.pl
- o utf8.01.log
-
The log of running:
shell> perl -Ilib scripts/parse.file.pl -i data/utf8.01.svg -max debug > data/utf8.01.log
- o utf8.01.svg
-
Test data for scripts/parse.file.pl
- o viewBox.bnf
-
This grammar is for the 'viewBox' attribute of various tags.
- o viewBox.*.dat
-
Fake data to test viewBox.bnf.
Input for scripts/test.file.pl.
- o html/viewBox.svg
-
This is the graph of the grammar viewBox.bnf.
It was generated by scripts/bnf2graph.sh.
Scripts
These are all shipped in the scripts/ directory.
- o bnf2graph.pl
-
Finds all data/*.bnf files and converts them into html/*.svg.
shell> perl -Ilib scripts/bnf2graph.pl
Requires MarpaX::Grammar::GraphViz2.
- o copy.config.pl
-
This is for use by the author. It just copies the config file out of the distro, so the script generate.demo.pl (which uses HTML template stuff) can find it.
- o find.config.pl
-
This cross-checks the output of copy.config.pl.
- o float.pl
-
This was posted by Jean-Damien Durand on the Marpa Google Group, as a demonstration of a grammar for parsing floats and hex numbers.
- o generate.demo.pl
-
Run by generate.demo.sh.
Input files are data/*.bnf and html/*.svg. Output file is html/*.html.
- o generate.demo.sh
-
Runs generate.demo.pl and then copies html/* to my web server's doc dir ($DR).
- o number.pl
-
This also was posted by Jean-Damien Durand on the Marpa Google Group, as a demonstration of a grammar for parsing floats and integers, and binary, octal and hex numbers.
- o parse.file.pl
-
This is the script you'll probably use most frequently. Run with '-h' for help.
- o pod2html.sh
-
This lets me quickly proof-read edits to the docs.
- o svg2log.pl
-
Runs parse.file.pl on each data/*.svg file and saves the output in data/*.log.
- o synopsis.pl
-
The code as per the "Synopsis".
- o t/test.fake.data.t
-
A test script. It parses data/*.dat, which are not SVG files, but just contain attribute value data.
- o t/test.real.data.t
-
A test script. It parses data/*.svg, which are SVG files, and compares them to the shipped files data/*.log.
- o test.file.pl
-
This runs the code on a single test file (data/*.dat, not an svg file). Try:
shell> perl -Ilib scripts/test.file.pl -a d -i data/d.30.dat -max debug
- o test.fileset.pl
-
This runs the code on a set of files (data/d.*.dat, data/points.*.dat or data/transform.*.dat). Try:
shell> perl -Ilib scripts/test.fileset.pl -a transform -max debug
- o t/version.t
-
A test script.
FAQ
See also "FAQ" in MarpaX::Languages::SVG::Parser::Actions.
What exactly does this module do?
It parses SVG files (using XML::SAX), and applies special parsing (using Marpa::R2) to certain attributes of certain tags.
The output is an array of hashrefs, whose structure is described below.
Which SVG attributes are treated specially by this module?
- o d
-
This is the 'd' attribute of the 'path' tag.
- o points
-
This is the 'points' attribute of both the 'polygon' and 'polyline' tags.
- o preserveAspectRatio
-
Various tags can have the 'preserveAspectRatio' attribute.
- o transform
-
Various tags can have the 'transform' attribute.
- o viewBox
-
Various tags can have a 'viewBox' attribute.
Each of these special cases has its own Marpa-style BNF.
The SVG versions of the attribute-specific BNFs are here.
Where are the specs for SVG and the BNFs?
W3C's SVG specs. In particular, see paths and shapes.
The BNFs have been translated into the syntax used by Marpa::R2. See Marpa::R2::Scanless::DSL for details.
These BNFs are actually stored at the end of the source code of MarpaX::Languages::SVG::Parser::SAXHandler, and loaded one at a time into Marpa using that fine module Data::Section::Simple.
Also, the BNFs are shipped in data/*.bnf, and in html/*.svg.
Is the stuff at the start of the SVG file preserved in the array?
If by 'stuff' you mean:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
Then, no.
I could not get the xml_decl etc events to fire using XML::SAX V 0.99 and XML::SAX::ParserFactory V 1.01.
Why don't you capture comments?
Because Perl instantly segfaults if I try. Code tried in SAXHandler.pm:
sub comment
{
my($self, $element) = @_;
my($comment) = $$element{Data};
$self -> log(debug => "Comment: $comment"); # Prints ok.
$self -> new_item('comment', '-', $comment); # Segfaults.
} # End of comment.
Hence - No comment.
How do I get access to this array?
The "Synopsis" contains a runnable program, which ships as scripts/synopsis.pl.
How is the parser's output stored in RAM?
It is stored in an array of hashrefs managed by the Set::Array module.
The hashref structure is documented in the next item.
Using Set::Array is much simpler than using an arrayref. Compare:
$self -> items -> push
({
count => $self -> item_count,
name => $name,
type => $type,
value => $value,
});
With:
$self -> items([]);
...
my($araref) = $self -> items;
push @$araref,
{
count => $self -> item_count,
name => $name,
type => $type,
value => $value,
};
$self -> items($araref);
What exactly is the structure of the hashrefs output by the parser?
Firstly, since the following text may be confusing, the very next item in this FAQ, "Annotated output", is designed to clarify things.
Also, it may be necessary to study data/*.log to fully grasp this structure.
Each hashref has these (key => value) pairs:
- o count => $integer
-
This simply counts the number of the hashref within the array, starting from 1.
- o name => $string
-
-
If the type's
value
matches /^(attribute|tag)$/, then this is the tag name or attribute name from the SVG.Note: The SAX parser used, XML::SAX, outputs these names with a '{}' prefix. The code strips this prefix.
However, for other items, where the '{...}' is not empty, the specific string is left intact. See data/utf8.01.log for this sample:
Item Type Name Value 1 tag svg open 2 attribute {http://www.w3.org/2000/xmlns/}xlink http://www.w3.org/1999/xlink ...
You have been warned.
- o Parser-generated tokens
-
In the case that this current array element has been generated by parsing the
value
of the attribute, thename's
value depends on the value of thetype
field.In all such cases, the array contains a hashref with the
name
'raw', and with thevalue
being the tag's original value.The elements which follow the one
named
'raw' are the output of Marpa parsing the value.
-
- o type => $string
-
This key can take the following values:
- o attribute
-
This is an attribute for the most-recently opened tag.
The
name
andvalue
fields are for an attribute which has not been specially parsed.The next element in the array is necessarily another token from the SVG.
See
raw
for the other case (i.e. compared toattribute
). - o boolean
-
The
value
must be 0 or 1.The
name
field in this case will be a counter of parameters for the preceedingcommand
(see next point). - o command
-
The
name
field is the letter (Mm, ..., Zz) for the command itself. In these cases, thevalue
is '-'.Note: As of V 1.01, in the hashref returned by the
action
subcommand
, thevalue
is actually an arrayref of the commands parameters. In V 1.00, thename
was '-' and thevalue
was the commany letter. This change was made when I stopped pushing hashrefs onto a stack, and converted the return value of the sub from scalar to hashref. - o content
-
This is the text content for the most recently opened, but still unclosed, tag. It may be the empty string. Likewise, it may contain any number of newlines, since it's copied faithfully from the input *.svg file.
It will actually be followed by an array element flagging the closing of the tag it belongs to.
- o float
-
Any float.
The
name
field in this case will be a counter of parameters for the preceedingcommand
. - o integer
-
Any integer, but probably always 0, because of the way Marpa handles the BNF.
The
name
field in this case will be a counter of parameters for the preceedingcommand
. - o raw
-
The
name
andvalue
fields are for an attribute which has been specially parsed.The next element in the array is necessarily not another token from the SVG.
Rather, the array elements following this one are output from the Marpa-based parse of the value in the
current
hashref'svalue
key.What this means is that if you are scanning the array, and detect a
type
ofraw
, all elements in the array (after this one), up to the next item oftype =~ /^(attribute|content|raw|tag)$/
, must be parameters output from the parse of the value in thecurrent
hashref'svalue
key.There is one exception to the claim that 'The next element in the array is necessarily not another token from the SVG.' Consider:
<polygon points="350,75 379,161 469,161 397,215 423,301 350,250 277,301 303,215 231,161 321,161z" />
The 'z' (which itself takes no parameters) at the end of the points is the last thing output for this tag, so the close tag item will be next array element.
See
attribute
for the other case (i.e. compared toraw
). - o tag
-
The
name
andvalue
fields are for a tag.The
name
is the name of the tag, and thevalue
is 'open' or 'close'.
- o value => $string
-
The interpretation of this string depends on the value of the
type
key. Basically:In the case of tags, this string is either 'open' or 'close'.
In the case of attributes, it is the attribute's value.
In the case of parsed attributes, it is an SVG command or one of that command's parameters.
See the next FAQ item for details.
Annotated output
Here is a fragment of data/ellipse.02.svg:
<path d="M300,200 h-150 a150,150 0 1,0 150,-150 z"
fill="red" stroke="blue" stroke-width="5" />
And here is the output from the built-in reporting mechanism (see data/ellipse.02.log):
Item Type Name Value
1 tag svg open
...
27 tag path open
28 raw d M300,200 h-150 a150,150 0 1,0 150,-150 z
29 command M -
30 float 1 300
31 float 2 200
32 command h -
33 float 1 -150
34 command a -
35 float 1 150
36 float 2 150
37 integer 3 0
38 boolean 4 1
39 boolean 5 0
40 float 6 150
41 float 7 -150
42 command z -
43 attribute fill red
44 attribute stroke blue
45 attribute stroke-width 5
46 content path
47 tag path close
...
66 tag svg close
Let's go thru it:
- o Item 27 is the open tag for the path
-
Type: tag Name: path Value: open
- o Item 28 is the path's 1st attribute, 'd'
-
Type: raw Name: d Value: M300,200 h-150 a150,150 0 1,0 150,-150 z
But since the
type
israw
we know both that it's an attribute, and that it must be followed by the parsed output of that value.Note: Attributes are reported in sorted order, but the parameters after parsing the attributes' values cannot be, because drawing the coordinates of the value is naturally order-dependent.
- o Item 29
-
Type: command Name: M Values: '-'
This in turn is followed by its respective parameters, if any.
Note: 'Z' and 'z' have no parameters.
- o Item 30 .. 31
-
Two floats. Commas are discarded in the parsing of all special values.
Also, you'll notice they are numbered for your convenience by the
name
key in their hashrefs. - o Item 32
-
Type: command Name: h Values: '-'
- o Item 33
-
This is the float which belongs to 'h'.
- o Item 34
-
Type: command Name: a Values: '-'
- o Items 35 .. 41
-
The 7 parameters of the 'a' command. You'll notice the parser calls 0 an integer rather than a float. SVG does not care, and neither should you. But, since the code knows it is, it might as well tell you.
The two Boolean flags are picked up explicitly, and the code tells you that, too.
- o Item 42
-
Type: command Name: z Values: '-'
As stated, it has no following parameters.
- o Items 43 .. 46
-
The remaining attributes of the 'path'. None of these are treated specially.
- o Item 47 is the close tag for the path
-
Type: tag Name: path Value: close
And, yes, this does mean self-closing tags, such as 'path', have 2 items in the array, with
values
of 'open' and 'close'. This allows code scanning the array to know absolutely where the data for the tag finishes.
Why did you use XML::SAX::ParserFactory to parse the SVG?
I find the SAX mechanism for handling XML particularly easy to work with.
I did start with XML::Rules, a great module, for the debugging of the BNFs, but the problem is that too many tags shared attributes (see 'transform' etc above), which made the code awkward.
Also, that module triggers a callback for closing a tag before triggering the call to process the attributes defined by the opening of that tag. This adds yet more complexity.
How are file encodings handled?
See "Constructor and Initialization" for a discussion of setting the encoding for the input file. Normally this is not necessary. Both iso-8859-1 and utf-8 encoded test files just work
.
For output, scripts/parse.file.pl uses the pragma:
use open qw(:std :utf8); # Undeclared streams in UTF-8.
This is needed if reading files encoded in utf-8, such as data/utf8.01.svg, and at the same time trying to print the parsed results to the screen by calling "maxlevel([$string])" with $string
set to info
or debug
.
Without this pragma, data/utf8.01.svg gives you the dread 'Wide character in print...' message.
The pragma is not in the module because it's global, and the end user's program may not want it at all.
Lastly, I have unilaterally set the utf8 attribute used by Log::Handler. This is harmless for non-utf-8 file, and is vital for data/utf8.01.svg and similar end-user files. It allows the log output (STDOUT) to be redirected. And indeed, this is what some of the tests do.
TODO
This lists some possibly nice-to-have items, but none of them are important:
- o Store BNF's in an array
-
This could be done by reading them once using Data::Section::Simple, in MarpaX::Languages::SVG::Parser::SAXHandler, and caching them, rather than re-reading them each time a BNF is required.
- o Re-write grammars to do left-recursion
-
Well, Jeffrey suggested this, but I don't have the skills (yet).
Machine-Readable Change Log
The file Changes was converted into Changelog.ini by Module::Metadata::Changes.
Version Numbers
Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.
Support
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=MarpaX::Languages::SVG::Parser.
Credits
The BNFs are partially based on the W3C's SVG specs, and partially (for numbers) on 2 programs posted by Jean-Damien Durand to the Marpa Google group. The thread is titled 'Space (\s) problems with my grammar'.
Note: Some posts (as of 2013-10-16) in that thread can't be displayed. This may be a temporary issue. See scripts/float.pl and scripts/number.pl for Jean-Damien's original code, which were of considerable help to me.
Specifically, I use number.pl for integers and floats, with these adjustments:
- o The code did not handle negative numbers, but an optional sign was already defined, so that was easy
- o The code did not handle 0
- o The code included hex and octal and binary numbers, which I did not need
Author
MarpaX::Languages::SVG::Parser was written by Ron Savage <ron@savage.net.au> in 2013.
Home page: http://savage.net.au/.
Copyright
Australian copyright (c) 2013, Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License 2.0, a copy of which is available at:
http://www.opensource.org/licenses/index.html