NAME
Genealogy::Gedcom::Reader::Lexer - An OS-independent lexer for GEDCOM data
Synopsis
Run scripts/lex.pl -help.
A typical run would be:
perl -Ilib scripts/lex.pl -i data/royal.ged -r 1 -s 1
Turn on debugging prints with:
perl -Ilib scripts/lex.pl -i data/royal.ged -r 1 -s 1 -max debug
royal.ged was downloaded from http://www.vjet.f2s.com/ftree/download.html. It's more up-to-date than the one shipped with Gedcom.
Various sample GEDCOM files may be found in the data/ directory in the distro.
Description
Genealogy::Gedcom::Reader::Lexer provides a lexer for GEDCOM data.
See the GEDCOM Specification Ged551-5.pdf.
Installation
Install Genealogy::Gedcom as you would for any Perl
module:
Run:
cpanm Genealogy::Gedcom
or run:
sudo cpan Genealogy::Gedcom
or unpack the distro, and then either:
perl Build.PL
./Build
./Build test
sudo ./Build install
or:
perl Makefile.PL
make (or dmake or nmake)
make test
make install
Constructor and Initialization
new()
is called as my($lexer) = Genealogy::Gedcom::Reader::Lexer -> new(k1 => v1, k2 => v2, ...)
.
It returns a new object of type Genealogy::Gedcom::Reader::Lexer
.
Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. input_file()]):
- o input_file => $gedcom_file_name
-
Read the GEDCOM data from this file.
Default: ''.
- o locale => $a_locale_name
-
Specify the locale for DateTime objects.
Default: 'en_AU'.
- o logger => $logger_object
-
Specify a logger object.
To disable logging, just set logger to the empty string.
Default: An object of type Log::Handler.
- o maxlevel => $level
-
This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.
Default: 'info'.
Log levels are, from highest (i.e. most output) to lowest: 'debug', 'info', 'warning', 'error'. No lower levels are used.
- o minlevel => $level
-
This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.
Default: 'error'.
- o report_items => $Boolean
-
- o 0 => Report nothing
- o 1 => Call "report()" to report, via the log, the items recognized by the lexer
-
This output is at log level 'info'.
Default: 0.
- o strict => $Boolean
-
Specifies lax or strict string length checking during validation.
- o 0 => String lengths can be 0, allowing blank NOTE etc records.
- o 1 => String lengths must be > 0, as per the GEDCOM Specification Ged551-5.pdf.
-
Note: A string of length 1 - e.g. '0' - might still be an error.
Default: 0.
The upper lengths on strings are always as per the GEDCOM Specification Ged551-5.pdf. See "get_max_length($id, $line)" for details.
String lengths out of range (as with all validation failures) are reported as log messages at level 'warning'.
Methods
check_date($id, $line)
Checks the date field in the input arrayref $line, $$line[4].
$id identifies what type of record the $line is expected to be.
check_length($id, $line)
Checks the length of the data component (after the tag) on the input arrayref $line, $$line[4].
$id identifies what type of record the $line is expected to be.
cross_check_xrefs
Ensure that all xrefs point to existing records.
See "What validation is performed?" in FAQ for details.
get_gedcom_from_file()
If the caller has requested GEDCOM data be read from a file, with the input_file option to new(), this method reads that file.
Called as appropriate by "run()", if you do not suppy data with "gedcom_data([$gedcom_data])".
gedcom_data([$gedcom_data])
The [] indicate an optional parameter.
Get or set the arrayref of GEDCOM records to be processed.
This is normally only used internally, but can be used to bypass reading from a file.
Note: If supplying data this way rather than via the file, you must strip newlines etc on every line, as well as leading and trailing blanks.
get_max_length($id, $line)
Get the maximum string length of the data component (after the tag) on the given $line.
$id identifies what type of record the $line is expected to be.
get_min_length($id, $line)
Get the minimum string length of the data component (after the tag) on the given $line.
Currently, this value is actually the value of strict(), i.e. 0 or 1.
$id identifies what type of record the $line is expected to be.
input_file([$gedcom_file_name])
Here, the [] indicate an optional parameter.
Get or set the name of the file to read the GEDCOM data from.
items()
Returns a object of type Set::Array, which is an arrayref of items output by the lexer.
See the "FAQ" for details.
locale([$a_locale_name])
Here, the [] indicate an optional parameter.
Get or set the name of the locale to use for DateTime objects.
log($level, $s)
Calls $self -> logger -> $level($s).
logger([$logger_object])
Here, the [] indicate an optional parameter.
Get or set the logger object.
To disable logging, just set logger to the empty string.
maxlevel([$string])
Here, the [] indicate an optional parameter.
Get or set the value used by the logger object.
This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.
minlevel([$string])
Here, the [] indicate an optional parameter.
Get or set the value used by the logger object.
This option is only used if the lexer creates an object of type Log::Handler. See Log::Handler::Levels.
push_item($line, $type)
Pushes a hashref of components of the $line, with type $type, onto the arrayref of items returned by "items()".
See the "FAQ" for details.
renumber_items()
Scan the arrayref of hashrefs returned by items() and ensure the 'count' field is ok.
This is done in case array elements have been combined, e.g. when processing CONCs and CONTs for NOTEs.
report()
Report, via the log, the list of items recognized by the lexer.
report_items([0 or 1])
The [] indicate an optional parameter.
Get or set the value which determines whether or not to report the items recognised by the lexer.
run()
This is the only method the caller needs to call. All parameters are supplied to new(), or via previous calls to various methods.
Returns 0 for success and 1 for failure.
strict([0 or 1])
The [] indicate an optional parameter.
Get or set the value which determines whether or not to use 0 or 1 as the minimum string length.
FAQ
How are user-defined tags handled?
In the same way as GEDCOM tags.
They are defined by having a leading '_', as well as same syntax as GEDCOM files. That is:
- o At level 0, they match /(_?(?:[A-Z]{3,4}))/.
- o At level > 0, they match /(_?(?:ADR[123]|[A-Z]{3,5}))/.
Each user-defined tag is stand-alone, meaning they can't be extended with CONC or CONT tags in the way some GEDCOM tags can.
See data/sample.4.ged.
How are CONC and CONT tags handled?
Nothing is done with them, meaning e.g. text flowing from a NOTE (say) onto a CONC or CONT is not concatenated.
Currently then, even GEDCOM tags are stand-alone.
How is the lexed data stored in RAM?
Items are stored in an arrayref. This arrayref is available via the "items()" method.
This method returns the same data as does "items()" in Genealogy::Gedcom::Reader.
Each element in the array is a hashref of the form:
{
count => $n,
data => $a_string
level => $n,
line_count => $n,
tag => $a_tag,
type => $a_string,
xref => $a_string,
}
Key-value pairs are:
- o count => $n
-
Items are numbered from 1 up, so this is the array index + 1.
Note: Blank lines in the input file are skipped.
- o data => $a_string
-
This is any data associated with the tag.
Given the GEDCOM record:
1 NAME Given Name /Surname/
then data will be 'Given Name /Surname/', i.e. the text after the tag.
Given the GEDCOM record:
1 SUBM @SUBM1@
then data will be 'SUBM1'.
As with xref (below), the '@' characters are stripped.
- o level => $n
-
The is the level from the GEDCOM data.
- o line_count => $n
-
This is the line number from the GEDCOM data.
- o tag => $a_tag
-
This is the GEDCOM tag.
- o type => $a_string
-
This is a string indicating what broad class the tag refers to. Values:
- o (Empty string)
-
Used for various cases.
- o Address
- o Concat
- o Continue
- o Date
-
If the type is 'Date', then it has been successfully parsed.
If parsing failed, the value will be 'Invalid date'.
- o Event
- o Family
- o File name
- o Header
- o Individual
- o Invalid date
-
If the type is 'Date', then it has been successfully parsed.
If parsing failed, the value will be 'Invalid date'.
- o Link to FAM
- o Link to INDI
- o Link to OBJE
- o Link to SUBM
- o Multimedia
- o Note
- o Place
- o Repository
- o Source
- o Submission
- o Submitter
- o Trailer
- o xref => $a_string
-
Given the GEDCOM record:
0 @I82@ INDI
then xref will be 'I82'.
As with data (above), the '@' characters are stripped.
What validation is performed?
There is no perfect answer as to what should be a warning and what should be an error.
So, the author's philosophy is that unrecoverable states are errors, and the code calls 'die'. See "Under what circumstances does the code call 'die'?".
And, the log level 'error' is not used. All validation failures are logged at level warning, leaving interpretation up to the user. See "How does logging work?".
Details:
- o Cross-references
-
Xrefs (pointers) are checked that they point to an xref which exists. Each dangling xref is only reported once.
- o Dates are validated
- o Duplicate xrefs
-
Xrefs which are (potentially) pointed to are checked for uniqueness.
- o String lengths
-
Maximum string lengths are checked as per the GEDCOM Specification Ged551-5.pdf.
Minimum string lengths are checked as per the value of the 'strict' option to new().
- o Strict 'v' Mandatory
-
Validation is mandatory, even with the 'strict' option set to 0. 'strict' only affects the minimum string length acceptable.
- o Tag nesting
-
Tag nesting is validated by the mechanism of nested method calls, with each method (called tag_*) knowing what tags it handles, and with each nested call handling its own tags.
This process starts with the call to tag_lineage(0, $line) in method "run()".
-
The lexer reports the first unexpected tag, meaning it is not a GEDCOM tag and it does not start with '_'.
All validation failures are reported as log messages at level 'warning'.
What other validation is planned?
Here are some suggestions from the mailing list:
-
This means check that each tag has all its mandatory sub-tags.
- o Natural (not step-) parent must be older than child
- o Prior art
-
Many such checks are possible. E.g. Attribute type (p 43 of GEDCOM Specification) must be one of: CAST | EDUC | NATI | OCCU | PROP | RELI | RESI | TITL | FACT.
What other features are planned?
Here are some suggestions from the mailing list:
How does logging work?
- o Debugging
-
When new() is called as new(maxlevel => 'debug'), each method entry is logged at level 'debug'.
This has the effect of tracing all code which processes tags.
Since the default value of 'maxlevel' is 'info', all this output is suppressed by default. Such output is mainly for the author's benefit.
- o Log levels
-
Log levels are, from highest (i.e. most output) to lowest: 'debug', 'info', 'warning', 'error'. No lower levels are used. See Log::Handler::Levels.
'maxlevel' defaults to 'info' and 'minlevel' defaults to 'error'. In this way, levels 'info' and 'warning' are reported by default.
Currently, level 'error' is not used. Fatal errors cause 'die' to be called, since they are unrecoverable. See "Under what circumstances does the code call 'die'?".
- o Reporting
-
When new() is called as new(report_items => 1), the items are logged at level 'info'.
- o Validation failures
-
These are reported at level 'warning'.
Under what circumstances does the code call 'die'?
- o When there is a typo in the field name passed in to check_length()
-
This is a programming error.
- o When an input file is not specified
-
This is a user (run time) error.
- o When there is a syntax error in a GEDCOM record
-
This is a user (data preparation) error.
How do I change the version of the GEDCOM grammar supported?
By sub-classing.
What file charsets are supported?
ASCII - i.e. nothing else has been tested.
The code should really ought to support ANSEL (a superset of ASCII), ASCII, UTF-8 and UTF-16 (known to GEDCOM as UNICODE).
TODO
- o Test input file for binary
- o Test input file for non-ASCII character sets
- o Test input file for size 0
- o Tighten validation
Machine-Readable Change Log
The file CHANGES was converted into Changelog.ini by Module::Metadata::Changes.
Version Numbers
Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.
References
- o The original Perl Gedcom
- o GEDCOM
-
- o http://www.tamurajones.net/FTWTEXT.xhtml
-
This is apparently the worst offender she's seen. Search that page for 'tags'.
- o http://www.tamurajones.net/GenoPro2011.xhtml
- o http://www.tamurajones.net/GenoPro2007.xhtml
- o http://www.tamurajones.net/TheFTWTEXTProblem.xhtml
- o Other articles on Tamura's site
- o Other projects
-
Many of these are discussed on Tamura's site.
- o http://bettergedcom.wikispaces.com/
- o http://www.ngsgenealogy.org/cs/GenTech_Projects
- o http://gdmxml.fugal.net/
- o http://www.cosoft.org/genxml/
- o http://www.sunflower.com/~billk/GEDC/
- o http://ancestorsnow.blogspot.com/2011/07/vged.html
- o http://www.tamurajones.net/GEDCOMValidation.xhtml
- o http://webtrees.net/
- o http://swoodbridge.com/Genealogy/lifelines/
- o http://deadendssoftware.blogspot.com/
- o http://www.legacyfamilytree.com/
- o https://devnet.familysearch.org/docs/api-overview
The Gedcom Mailing List
Contact perl-gedcom-help@perl.org.
Support
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=Genealogy::Gedcom.
Author
Genealogy::Gedcom::Reader::Lexer was written by Ron Savage <ron@savage.net.au> in 2011.
Home page: http://savage.net.au/index.html.
Copyright
Australian copyright (c) 2011, Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License, a copy of which is available at:
http://www.opensource.org/licenses/index.html