NAME
PDF::Data - Manipulate PDF files and objects as data structures
VERSION
version v1.2.0
SYNOPSIS
use PDF::Data;
DESCRIPTION
This module can read and write PDF files, and represents PDF objects as
data structures that can be readily manipulated.
METHODS
new
my $pdf = PDF::Data->new(-compress => 1, -minify => 1);
Constructor to create an empty PDF::Data object instance. Any arguments
passed to the constructor are treated as key/value pairs, and included
in the $pdf hash object returned from the constructor. When the PDF
file data is generated, this hash is written to the PDF file as the
trailer dictionary. However, hash keys starting with "-" are ignored
when writing the PDF file, as they are considered to be flags or
metadata.
For example, $pdf->{-compress} is a flag which controls whether or not
streams will be compressed when generating PDF file data. This flag can
be set in the constructor (as shown above), or set directly on the
object.
The $pdf->{-minify} flag controls whether or not to save space in the
generated PDF file data by removing comments and extra whitespace from
content streams. This flag can be used along with $pdf->{-compress} to
make the generated PDF file data even smaller, but this transformation
is not reversible.
clone
my $pdf_clone = $pdf->clone;
Deep copy the entire PDF::Data object itself.
new_page
my $page = $pdf->new_page;
my $page = $pdf->new_page('LETTER');
my $page = $pdf->new_page(8.5, 11);
Create a new page object with the specified size (in inches).
Alternatively, certain page sizes may be specified using one of the
known keywords: "LETTER" for U.S. Letter size (8.5" x 11"), "LEGAL" for
U.S. Legal size (8.5" x 14"), or "A0" through "A8" for ISO A-series
paper sizes. The default page size is U.S. Letter size (8.5" x 11").
copy_page
my $copied_page = $pdf->copy_page($page);
Deep copy a single page object.
append_page
$page = $pdf->append_page($page);
Append the specified page object to the end of the PDF page tree.
read_pdf
my $pdf = PDF::Data->read_pdf($file, %args);
Read a PDF file and parse it with $pdf->parse_pdf(), returning a new
object instance. Any streams compressed with the /FlateDecode filter
will be automatically decompressed. Unless the $pdf->{-decompress} flag
is set, the same streams will also be automatically recompressed again
when generating PDF file data.
parse_pdf
my $pdf = PDF::Data->parse_pdf($data, %args);
Used by $pdf->read_pdf() to parse the raw PDF file data and create a
new object instance. This method can also be called directly instead of
calling $pdf->read_pdf() if the PDF file data comes another source
instead of a regular file.
write_pdf
$pdf->write_pdf($file, $time);
Generate and write a new PDF file from the current state of the
PDF::Data object.
The $time parameter is optional; if not defined, it defaults to the
current time. If $time is defined but false (zero or empty string), no
timestamp will be set.
The optional $time parameter may be used to specify the modification
timestamp to save in the PDF metadata and to set the file modification
timestamp of the output file. If not specified, it defaults to the
current time. If a false value is specified, this method will skip
setting the modification time in the PDF metadata, and skip setting the
timestamp on the output file.
pdf_file_data
my $pdf_file_data = $document->pdf_file_data($time);
Generate PDF file data from the current state of the PDF data
structure, suitable for writing to an output PDF file. This method is
used by the $pdf->write_pdf() method to generate the raw string of
bytes to be written to the output PDF file. This data can be directly
used (e.g. as a MIME attachment) without the need to actually write a
PDF file to disk.
The optional $time parameter may be used to specify the modification
timestamp to save in the PDF metadata. If not specified, it defaults to
the current time. If a false value is specified, this method will skip
setting the modification time in the PDF metadata.
dump_pdf
$pdf->dump_pdf($file, $mode);
Dump the PDF internal structure and data for debugging. If the $mode
parameter is "outline", dump only the PDF internal structure without
the data.
dump_outline
$pdf->dump_outline($file);
Dump an outline of the PDF internal structure for debugging. (This
method simply calls the $pdf->dump_pdf() method with the $mode
parameter specified as "outline".)
merge_content_streams
my $stream = $pdf->merge_content_streams($array_of_streams);
Merge multiple content streams into a single content stream.
find_bbox
$pdf->find_bbox($content_stream, $new);
Analyze a content stream to determine the correct bounding box for the
content stream. The current implementation was purpose-built for a
specific use case and should not be expected to work correctly for most
content streams.
The $content_stream parameter may be a stream object or a string
containing the raw content stream data.
The current algorithm breaks the content stream into lines, skips over
various "neutral" lines and examines the coordinates specified for
certain PDF drawing operators: "m" (moveto), "l" (lineto), "v"
(curveto, initial point replicated), "y" (curveto, final point
replicated), and "c" (curveto, all points specified).
The minimum and maximum X and Y coordinates seen for these drawing
operators are used to determine the bounding box (left, bottom, right,
top) for the content stream. The bounding box and equivalent rectangle
(left, bottom, width, height) are printed.
If the $new boolean parameter is set, an updated content stream is
generated with the coordinates adjusted to move the lower left corner
of the bounding box to (0, 0). This would be better done by translating
the transformation matrix.
new_bbox
$new_content = $pdf->new_bbox($content_stream);
This method simply calls the $pdf->find_bbox() method above with $new
set to 1.
timestamp
my $timestamp = $pdf->timestamp($time);
my $now = $pdf->timestamp;
Generate timestamp in PDF internal format.
UTILITY METHODS
round
my @numbers = $pdf->round(@numbers);
Round numeric values to 12 significant digits to avoid floating-point
rounding error and remove trailing zeroes.
concat_matrix
my $matrix = $pdf->concat_matrix($transformation_matrix, $original_matrix);
Concatenate a transformation matrix with an original matrix, returning
a new matrix. This is for arrays of 6 elements representing standard
3x3 transformation matrices as used by PostScript and PDF.
invert_matrix
my $inverse = $pdf->invert_matrix($matrix);
Calculate the inverse of a matrix, if possible. Returns undef if the
matrix is not invertible.
translate
my $matrix = $pdf->translate($x, $y);
Returns a 6-element transformation matrix representing translation of
the origin to the specified coordinates.
scale
my $matrix = $pdf->scale($x, $y);
Returns a 6-element transformation matrix representing scaling of the
coordinate space by the specified horizontal and vertical scaling
factors.
rotate
my $matrix = $pdf->rotate($angle);
Returns a 6-element transformation matrix representing counterclockwise
rotation of the coordinate system by the specified angle (in degrees).
INTERNAL METHODS
validate
$pdf->validate;
Used by $pdf->new(), $pdf->parse_pdf() and $pdf->write_pdf() to
validate some parts of the PDF structure. Currently, $pdf->validate()
uses $pdf->validate_key() to verify that the document catalog and page
tree root node exist and have the correct type, and that the page tree
root node has no parent node. Then it calls $pdf->validate_page_tree()
to validate the entire page tree.
By default, if a validation error occurs, it will be output as
warnings, but the $pdf->{-validate} flag can be set to make the errors
fatal.
validate_page_tree
my $count = $pdf->validate_page_tree($path, $page_tree_node);
Used by $pdf->validate(), and called by itself recursively, to validate
the PDF page tree and its subtrees. The $path parameter specifies the
logical path from the root of the PDF::Data object to the page subtree,
and the $page_tree_node parameter specifies the actual page tree node
data structure represented by that logical path. $pdf->validate()
initially calls $pdf->validate_page_tree() with "Root/Pages" for $path
and $pdf->{Root}{Pages} for $page_tree_node.
Each child of the page tree node (in $page_tree_node->{Kids}) should be
another page tree node for a subtree or a single page node. In either
case, the parameters used for the next method call will be "$path\[$i]"
for $path (e.g. "Root/Pages[0][1]") and $page_tree_node->{Kids}[$i] for
$page_tree_node (e.g. $pdf->{Root}{Pages}{Kids}[0]{Kids}[1]). These
parameters are passed to either $pdf->validate_page_tree() recursively
(if the child is a page tree node) or to $pdf->validate_page() (if the
child is a page node).
After validating the page tree, $pdf->validate_resources() will be
called to validate the page tree's resources, if any.
If the count of pages in the page tree is incorrect, it will be fixed.
This method returns the total number of pages in the specified page
tree.
validate_page
$pdf->validate_page($path, $page);
Used by $pdf->validate_page_tree() to validate a single page of the
PDF. The $path parameter specifies the logical path from the root of
the PDF::Data object to the page, and the $page parameter specifies the
actual page data structure represented by that logical path.
This method will call $pdf->merge_content_streams() to merge the
content streams into a single content stream (if $page->{Contents} is
an array), then it will call $pdf->validate_content_stream() to
validate the page's content stream.
After validating the page, $pdf->validate_resources() will be called to
validate the page's resources, if any.
validate_resources
$pdf->validate_resources($path, $resources);
Used by $pdf->validate_page_tree(), $pdf->validate_page() and
$pdf->validate_xobject() to validate associated resources. The $path
parameter specifies the logical path from the root of the PDF::Data
object to the resources, and the $resources parameter specifies the
actual resources data structure represented by that logical path.
This method will call validate_xobjects for $resources->{XObject}, if
set.
validate_xobjects
$pdf->validate_xobjects($path, $xobjects);
Used by $pdf->validate_resources() to validate form XObjects in the
resources. The $path parameter specifies the logical path from the root
of the PDF::Data object to the hash of form XObjects, and the $xobjects
parameter specifies the actual hash of form XObjects represented by
that logical path.
This method simply loops across all the form XObjects in $xobjects and
calls $pdf->validate_xobject() for each of them.
validate_xobject
$pdf->validate_xobject($path, $xobject);
Used by $pdf->validate_xobjects() to validate a form XObject. The $path
parameter specifies the logical path from the root of the PDF::Data
object to the form XObject, and the $xobject parameter specifies the
actual form XObject represented by that logical path.
This method verifies that $xobject is a stream and $xobject->{Subtype}
is "/Form", then calls $pdf->validate_content_stream() with $xobject to
validate the form XObject content stream, then calls
$pdf->validate_resources() to validate the form XObject's resources, if
any.
validate_content_stream
$pdf->validate_content_stream($path, $stream);
Used by $pdf->validate_page() and $pdf->validate_xobject() to validate
a content stream. The $path parameter specifies the logical path from
the root of the PDF::Data object to the content stream, and the $stream
parameter specifies the actual content stream represented by that
logical path.
This method calls $pdf->parse_objects() to make sure that the content
stream can be parsed. If the $pdf->{-minify} flag is set,
$pdf->minify_content_stream() will be called with the array of parsed
objects to minify the content stream.
minify_content_stream
$pdf->minify_content_stream($stream, $objects);
Used by $pdf->validate_content_stream() to minify a content stream. The
$stream parameter specifies the content stream to be modified, and the
optional $objects parameter specifies a reference to an array of parsed
objects as returned by $pdf->parse_objects().
This method calls $pdf->parse_objects() to populate the $objects
parameter if unspecified, then it calls $pdf->generate_content_stream()
to generate a minimal content stream for the array of objects, with no
comments and only the minimum amount of whitespace necessary to parse
the content stream correctly. (Obviously, this means that this
transformation is not reversible.)
Currently, this method also performs a sanity check by running the
replacement content stream through $pdf->parse_objects() and comparing
the entire list of objects returned against the original list of
objects to ensure that the replacement content stream is equivalent to
the original content stream.
generate_content_stream
my $data = $pdf->generate_content_stream($objects);
Used by $pdf->minify_content_stream() to generate a minimal content
stream to replace the original content stream. The $objects parameter
specifies a reference to an array of parsed objects as returned by
$pdf->parse_objects(). These objects will be used to generate the new
content stream.
For each object in the array, this method will call an appropriate
serialization method: $pdf->serialize_dictionary() for dictionary
objects, $pdf->serialize_array() for array objects, or
$pdf->serialize_object() for other objects. After serializing all the
objects, the newly-generated content stream data is returned.
serialize_dictionary
$pdf->serialize_dictionary($stream, $hash);
Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary()
(recursively) and $pdf->serialize_array() to serialize a hash as a
dictionary object. The $stream parameter specifies a reference to a
string containing the data for the new content stream being generated,
and the $hash parameter specifies the hash reference to be serialized.
This method will serialize all the key-value pairs of $hash, prefixing
each key in the hash with "/" to serialize the key as a name object,
and calling an appropriate serialization routine for each value in the
hash: $pdf->serialize_dictionary() for dictionary objects (recursive
call), $pdf->serialize_array() for array objects, or
$pdf->serialize_object() for other objects.
serialize_array
$pdf->serialize_array($stream, $array);
Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary()
and $pdf->serialize_array() (recursively) to serialize an array. The
$stream parameter specifies a reference to a string containing the data
for the new content stream being generated, and the $array parameter
specifies the array reference to be serialized.
This method will serialize all the array elements of $array, calling an
appropriate serialization routine for each element of the array:
$pdf->serialize_dictionary() for dictionary objects,
$pdf->serialize_array() for array objects (recursive call), or
$pdf->serialize_object() for other objects.
serialize_object
$pdf->serialize_object($stream, $object);
Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary()
and $pdf->serialize_array() to serialize a simple object. The $stream
parameter specifies a reference to a string containing the data for the
new content stream being generated, and the $object parameter specifies
the pre-serialized object to be serialized to the specified content
stream data.
This method will strip leading and trailing whitespace from the
pre-serialized object if the $pdf->{-minify} flag is set, then append a
newline to ${$stream} if appending the pre-serialized object would
exceed 255 characters for the last line, then append a space to
${$stream} if necessary to parse the object correctly, then append the
pre-serialized object to ${$stream}.
validate_key
$pdf->validate_key($hash, $key, $value, $label);
Used by $pdf->validate() to validate specific hash key values.
get_hash_node
my $hash = $pdf->get_hash_node($path);
Used by $pdf->validate_key() to get a hash node from the PDF structure
by path.
parse_objects
my @objects = $pdf->parse_objects($objects, $data, $offset);
Used by $pdf->parse_pdf() to parse PDF objects into Perl
representations.
parse_data
my @objects = $pdf->parse_data($data);
Uses $pdf->parse_objects() to parse PDF objects from standalone PDF
data.
filter_stream
$pdf->filter_stream($stream);
Used by $pdf->parse_objects() to inflate compressed streams.
compress_stream
$new_stream = $pdf->compress_stream($stream);
Used by $pdf->write_object() to compress streams if enabled. This is
controlled by the $pdf->{-compress} flag, which is set automatically
when reading a PDF file with compressed streams, but must be set
manually for PDF files created from scratch, either in the constructor
arguments or after the fact.
resolve_references
$object = $pdf->resolve_references($objects, $object);
Used by $pdf->parse_pdf() to replace parsed indirect object references
with direct references to the objects in question.
write_indirect_objects
my $xrefs = $pdf->write_indirect_objects($pdf_file_data, $objects, $seen);
Used by $pdf->write_pdf() to write all indirect objects to a string of
new PDF file data.
enumerate_indirect_objects
$pdf->enumerate_indirect_objects($objects);
Used by $pdf->write_indirect_objects() to identify which objects in the
PDF data structure need to be indirect objects.
enumerate_shared_objects
$pdf->enumerate_shared_objects($objects, $seen, $ancestors, $object);
Used by $pdf->enumerate_indirect_objects() to find objects which are
already shared (referenced from multiple objects in the PDF data
structure).
add_indirect_objects
$pdf->add_indirect_objects($objects, @objects);
Used by $pdf->enumerate_indirect_objects() and
$pdf->enumerate_shared_objects() to add objects to the list of indirect
objects to be written out.
write_object
$pdf->write_object($pdf_file_data, $objects, $seen, $object, $indent);
Used by $pdf->write_indirect_objects(), and called by itself
recursively, to write direct objects out to the string of new PDF file
data.
dump_object
my $output = $pdf->dump_object($object, $label, $seen, $indent, $mode);
Used by $pdf->dump_pdf(), and called by itself recursively, to dump (or
outline) the specified PDF object.