NAME
HTML::DublinCore - Extract Dublin Core metadata from HTML
SYNOPSIS
use HTML::DublinCore;
## pass HTML to constructor
my $dc = HTML::DublinCore->new( $html );
## get the title element and print it's content
my $title = $dc->element( 'Title' );
print "title: ", $title->content(), "\n";
## get the same title content in one step
print "title: ", $dc->element( 'Title' )->content(), "\n";
## list context will retrieve all of a particular element
foreach my $element ( $dc->element( 'Creator' ) ) {
print "creator: ",$element->content(),"\n";
}
## qualified dublin core
my $creation = $dc->element( 'Date.created' )->content();
DESCRIPTION
HTML::DublinCore is a module for easily extracting Dublin Core metadata that is embedded in HTML documents. The Dublin Core is a small set of metadata elements for describing information resources. Dublin Core is typically stored in the <HEAD> of and HTML document using the <META> tag. For more information on embedding DublinCore in HTML see RFC 2731 http://www.ietf.org/rfc/rfc2731
HTML::DublinCore allows you to easily extract, and work with the Dublin Core metadata found in a particular HTML document. For a definition of the meaning of various Dublin Core elements please see http://www.dublincore.org/documents/dces/
METHODS
new()
Constructor which you pass HTML content.
$dc = HTML::DublinCore->new( $html );
element()
This method will return a relevant HTML::DublinCore::Element object. When called in a scalar context element() will return the first relevant element found, and when called in a list context it will return all the relevant elements (since Dublin Core elements are repeatable).
## create HTML::DublinCore object from HTML
my $dc = HTML::DublinCore->new( $html );
## retrieve first title element
my $element = $dc->element( 'Title' );
my $title = $element->content();
## shorthand object chaining to extract element content
my $title = $dc->element( 'Title' )->content();
## retrieve all creator elements
@creators = $dc->element( 'Creator' );
You can also retrieve qualified elements in a similar fashion.
my $date = $dc->element( 'Date.created' )->content();
In order to fascilitate chaining element() will return an empty HTML::DublinCore::Element object when the requested element does not exist.
elements()
Returns all the Dublin Core elements found as HTML::DublinCore::Element objects which you can then manipulate further.
my $dc = HTML::DublinCore->new( $html );
foreach my $element ( $dc->elements() ) {
print "name=", $element->name(), "\n";
print "content=", $element->content(), "\n";
}
title()
Returns an HTML::DublinCore::Element object for the title element. You can then retrieve content, qualifier, scheme, lang attributes like so.
my $dc = HTML::DublinCore->new( $html );
my $title = $dc->title();
print "content: ",$title->content(),"\n";
print "qualifier: ",$title->qualifier(),"\n";
print "schema: ",$title->schema(),"\n";
print "language: ",$title->language(),"\n";
Since there can be multiple instances of a particular element type (title, creator, subject, etc) you can retrieve multiple title elements by calling title() in a list context.
my @titles = $dc->title();
foreach my $title ( @titles ) {
print "title: ",$title->content(),"\n";
}
creator()
Retrieve creator information in the same manner as title().
subject()
Retrieve subject information in the same manner as title().
description()
Retrieve description information in the same manner as title().
publisher()
Retrieve publisher information in the same manner as title().
contribtor()
Retrieve contributor information in the same manner as title().
date()
Retrieve date information in the same manner as title().
type()
Retrieve type information in the same manner as title().
format()
Retrieve format information in the same manner as title().
identifier()
Retrieve identifier information in the same manner as title().
source()
Retrieve source information in the same manner as title().
language()
Retrieve language information in the same manner as title().
relation()
Retrieve relation information in the same manner as title().
coverage()
Retrieve coverage information in the same manner as title().
rights()
Retrieve rights information in the same manner as title().
asHtml()
Serialize your Dublin Core metadata as HTML <META> tags.
print $dc->asHtml();
TODO
More comprehensive tests.
Handle HTML entities properly.
Collect error messages so they can be reported out of the object.
SEE ALSO
HTML::DublinCore::Element
Dublin Core http://www.dublincore.org/
RFC 2731 http://www.ietf.org/rfc/rfc2731
HTML::Parser
perl4lib http://www.rice.edu/perl4lib
AUTHOR
Ed Summers <ehs@pobox.com>
COPYRIGHT AND LICENSE
Copyright 2003 by Ed Summers
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.