NAME
RDF::RDFa::Parser - RDFa parser using XML::LibXML.
SYNOPSIS
use RDF::RDFa::Parser;
$parser = RDF::RDFa::Parser->new($xhtml, $baseuri);
$parser->consume;
$graph = $parser->graph;
VERSION
0.22
Note: version 0.20 introduced major incompatibilities with 0.0x and 0.1x.
PUBLIC METHODS
- $p = RDF::RDFa::Parser->new($xhtml, $baseuri, \%options, $storage)
-
This method creates a new RDF::RDFa::Parser object and returns it.
The $xhtml variable may contain an XHTML/XML string, or a XML::LibXML::Document. If a string, the document is parsed using XML::LibXML::Parser, which will throw an exception if it is not well-formed. RDF::RDFa::Parser does not catch the exception.
The base URI is used to resolve relative URIs found in the document.
Options (mostly booleans) [default in brackets]:
* alt_stylesheet - Magic rel="alternate stylesheet". [0] * auto_config - See section "Auto Config" [0] * embedded_rdfxml - Find plain RDF/XML chunks within document. [0] 0=no, 1=handle, 2=skip. * full_uris - Support full URIs in CURIE-only attributes. [0] * keywords - THIS WILL VOID YOUR WARRANTY! * prefix_attr - Support @prefix rather than just @xmlns:*. [0] * prefix_bare - Support CURIEs with no colon+suffix. [0] * prefix_empty - URI for empty prefix. ['http://www.w3.org/1999/xhtml/vocab#'] * prefix_nocase - Ignore case-sensitivity of CURIE prefixes. [0] * safe_anywhere - Allow Safe CURIEs in @rel/@rev/etc. [0] * tdb_service - Use thing-described-by.org to name bnodes. [0] * use_rtnlx - Use RDF::Trine::Node::Literal::XML. [0] 0=no, 1=if available. * xhtml_base - Process <base> element. [1] 0=no, 1=yes, 2=use it for RDF/XML too * xhtml_elements - Process <head> and <body> specially. [1] * xhtml_lang - Support @lang rather than just @xml:lang. [0] * xml_base - Support for 'xml:base' attribute. [0] 0=only RDF/XML; 1=except @href/@src; 2=always. * xml_lang - Support for 'xml:lang' attribute. [1]
The default options attempt to stick to the XHTML+RDFa spec as rigidly as possible.
$storage is an RDF::Trine::Storage object. If undef, then a new temporary store is created.
- $p->xhtml
-
Returns the XHTML source of the document being parsed.
- $p->uri
-
Returns the base URI of the document being parsed. This will usually be the same as the base URI provided to the constructor, but may differ if the document contains a <base> HTML element.
Optionally it may be passed a parameter - an absolute or relative URI - in which case it returns the same URI which it was passed as a parameter, but as an absolute URI, resolved relative to the document's base URI.
This seems like two unrelated functions, but if you consider the consequence of passing a relative URI consisting of a zero-length string, it in fact makes sense.
- $p->dom
-
Returns the parsed XML::LibXML::Document.
- $p->set_callbacks(\&func1, \&func2)
-
Set callbacks for handling RDF triples extracted from RDFa document. The first function is called when a triple is generated taking the form of (resource, resource, resource). The second function is called when a triple is generated taking the form of (resource, resource, literal).
The parameters passed to the first callback function are:
A reference to the
RDF::RDFa::Parser
objectA reference to the
XML::LibXML element
being parsedSubject URI or bnode
Predicate URI
Object URI or bnode
Graph URI or bnode (if named graphs feature is enabled)
The parameters passed to the second callback function are:
A reference to the
RDF::RDFa::Parser
objectA reference to the
XML::LibXML element
being parsedSubject URI or bnode
Predicate URI
Object literal
Datatype URI (possibly undef or '')
Language (possibly undef or '')
Graph URI or bnode (if named graphs feature is enabled)
In place of either or both functions you can use the string
'print'
which sets the callback to a built-in function which prints the triples to STDOUT as Turtle. Either or both can be set to undef, in which case, no callback is called when a triple is found.Beware that for literal callbacks, sometimes both a datatype *and* a language will be passed. (This goes beyond the normal RDF data model.)
set_callbacks
must be used beforeconsume
.IMPORTANT - CHANGED IN VERSION 0.20 - callback functions should return true if they wish to prevent the triple from being added to the parser's built-in model; false otherwise.
- $p->named_graphs($xmlns, $attribute, $attributeType)
-
RDF::RDFa::Parser allows for one RDFa document to generate multiple graphs. A graph is created by enclosing it in an element with an attribute with XML namespace $xmlns and local name $attribute.
Each graph is given a URI - if $attributeType is the string 'id', then the URI is generated by treating the attribute like an 'id' attribute - i.e. the URI is the document's base URI, followed by a hash, followed by the attribute value. If $attributeType is the string 'about', then the URI is generated by treating the attribute like an 'about' attribute - i.e. it is treated as an absolute or relative URI, with safe CURIEs being allowed too. If the $attributeType is omitted, then the default behaviour is 'about'.
Calling this method with no parameters will disable the named graph feature. Named graphs are disabled by default.
named_graphs
must be used beforeconsume
. - $p->thing_described_by(1)
-
RDF::RDFa::Parser has a feature that allows it to use thing-described-by.org to create URIs for some blank nodes. It is disabled by default. This function can be used to turn it on (1) or off (0). May be called without a parameter, which just returns the current status.
thing_described_by
must be used beforeconsume
.THIS FUNCTION IS DEPRECATED. PASS AN OPTION TO THE CONSTRUCTOR INSTEAD.
- $p->consume
-
The document is parsed for RDFa. Nothing of interest is returned by this function, but the triples extracted from the document are passed to the callbacks as each one is found.
- $p->graph( [ $graph_name ] )
-
Without a graph name, this method will return an RDF::Trine::Model object with all statements of the full graph. As per the RDFa specification, it will always return an unnamed graph containing all the triples of the RDFa document. If the model contains multiple graphs, all triples will be returned unless a graph name is specified.
It will also take an optional graph URI as argument, and return an RDF::Trine::Model tied to a temporary storage with all triples in that graph.
It makes sense to call
consume
before callinggraph
. Otherwise you'll just get an empty graph. - $p->graphs
-
Will return a hashref of all named graphs, where the graph name is a key and the value is a RDF::Trine::Model tied to a temporary storage.
It makes sense to call
consume
before callinggraphs
. Otherwise you'll just get an empty hashref.
UTILITY METHOD
- RDF::RDFa::Parser::keywords();
-
Without any options, gets an empty structure for keywords. Passing additional strings adds certain bundles of predefined keywords to the structure.
my $keyword_structure = RDF::RDFa::Parser::keywords( 'xhtml', 'xfn', 'grddl');
A keyword structure may be provided as an option when creating a new RDF::RDFa::Parser object. You probably want to leave this alone unless you know what you're doing.
Bundles include: rdfa, html5, html4, html32, iana, grddl, xfn.
CONSTANTS
- RDF::RDFa::Parser::OPTS_XHTML
-
Suggested options hashref for parsing XHTML.
- RDF::RDFa::Parser::OPTS_HTML4
-
Suggested options hashref for parsing HTML.
- RDF::RDFa::Parser::OPTS_HTML5
-
Suggested options hashref for parsing HTML.
- RDF::RDFa::Parser::OPTS_SVG
-
Suggested options hashref for parsing SVG.
- RDF::RDFa::Parser::OPTS_XML
-
Suggested options hashref for parsing generic XML.
AUTO CONFIG
RDF::RDFa::Parser has a lot of different options that can be switched on and off. Sometimes it might be useful to allow the page being parsed to control some of the options. If you switch on the 'auto_config' option, pages can do this.
A page can set options using a specially crafted <meta> tag:
<meta name="http://search.cpan.org/dist/RDF-RDFa-Parser/#auto_config"
content="xhtml_lang=1&keywords=rdfa+html5+html4+html32" />
Note that the content
attribute is an application/x-www-form-urlencoded string (which must then be HTML-escaped of course). Semicolons may be used instead of ampersands, as these tend to look nicer:
<meta name="http://search.cpan.org/dist/RDF-RDFa-Parser/#auto_config"
content="xhtml_lang=1;keywords=rdfa+html5+html4+html32" />
Any option allowed in the constructor may be given using auto config, except 'use_rtnlx', and of course 'auto_config' itself. As named graphs cannot currently be configured using the constructor, they are also not supported with auto config.
BUGS
RDF::RDFa::Parser 0.21 passed all approved tests in the XHTML+RDFa test suite at the time of its release.
RDF::RDFa::Parser 0.22 (used in conjunction with HTML::HTML5::Parser 0.01 and HTML::HTML5::Sanity 0.01) additionally passes all approved tests in the HTML4+RDFa and HTML5+RDFa test suites at the time of its release; except test cases 0113 and 0121, which the author of this module believes mandate incorrect HTML parsing.
Please report any bugs to http://rt.cpan.org/.
Common gotchas:
Is your XML well-formed?
Despite having several options for dealing with HTML+RDFa, this package uses a strict XML parser. If you need to deal with tag soup, you'll need to parse it into an XML::LibXML::Document yourself (e.g. using HTML::HTML5::Parser) and then pass the XML::LibXML::Document to this package's contructor function.
Are your namespaces set correctly?
Does your document have 'xmlns="http://www.w3.org/1999/xhtml"' on the root element? If not, some aspects of this package's behaviour may be unexpected. If you parsed the document using HTML::HTML5::Parser you may need to run it through HTML::HTML5::Sanity.
SEE ALSO
XML::LibXML, RDF::Trine, HTML::HTML5::Parser, HTML::HTML5::Sanity.
AUTHOR
Toby Inkster <tobyink@cpan.org> with contributions from Kjetil Kjernsmo <kjetilk@cpan.org>.
COPYRIGHT
Copyright 2008, 2009 Toby Inkster
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.