NAME
PDF::Tiny - Minimal Lightweight PDF Library
VERSION
Version 0.03
This is an alpha version. The API is still subject to change.
SYNOPSIS
use PDF::Tiny;
# Count pages
my $filename = "filename.pdf"
my $pdf = new PDF::Tiny $filename;
my $page_count = $pdf->page_count;
print "$filename has $page_count page", 's'x($page_count!=1), ".\n";
# Extract pages
my $source_pdf = new PDF::Tiny "other.pdf";
my $new_pdf = new PDF::Tiny version => $source_pdf->version;
# Get just the first three
for (0..2) {
$new_pdf->import_page($source_pdf, $_);
}
$new_pdf->print(fh => \*STDOUT); # or filename => "out.pdf"
# Change title (low-level stuff)
my $pdf = new PDF::Tiny "foo.pdf";
$pdf->trailer->[1]->{Info}
||= PDF::Tiny::make_ref($pdf->add_obj(PDF::Tiny::make_dict({})));
$pdf->vivify_obj('str', '/Info', '/Title')->[1] = "Tie Tool";
$pdf->append;
DESCRIPTION
This is a very lightweight (and limited) PDF parser. If you need to do some simple PDF processing on a web server with limited RAM and CPU, and if slurping the entire file into memory is not an option, this module may well be for you, at the cost of far less functionality than other solutions out there.
This module really does assume you know something about the PDF format. This documentation includes a brief section on "THE PDF FILE FORMAT", if you are too lazy to read the specification.
LIMITATIONS
PDF 1.5 (Acrobat 6) cross-reference and object streams are not supported. If you save your PDFs in 1.4 (Acrobat 5) format, they will work.
Encrypted PDFs are not supported.
There are no interfaces whatsoever for decompressing and manipulating content streams (although you can get to them). If you want to do that, you have to do it all yourself.
Append mode only modifies the existing file. It will not write to a new file. Duplicate the file first if this is a problem.
There is no capability (yet) in this module for handling text strings. The title example in the "SYNOPSIS" is limited to ASCII unless you read the PDF spec and encode the string yourself.
There are very few high-level functions. If you know your way around the PDF spec, you can accomplish a lot with this module. If not, your mileage may vary (but do read "THE PDF FILE FORMAT", below). (I am open to suggestions for additions, but they need to be fairly general and tiny enough to implement in a Tiny module.)
There is not much error checking. PDFs are generally assumed to be well-formed. If they are not, or if you use the low-level functions incorrectly, you may get errors that are hard to debug.
INTERFACE
This section uses many terms that are explained in "THE PDF FILE FORMAT" and the "GLOSSARY", below.
Constructor
You can create a PDF from scratch:
$pdf = new PDF::Tiny;
Or open an existing file:
$pdf = new PDF::Tiny $filename;
The constructor also accepts hash-style arguments:
$pdf = new PDF::Tiny
filename => "foo.pdf",
empty => 0,
version => "1.4",
;
If filename
is absent, it is assumed that a PDF file is being created from scratch.
empty
is a boolean parameter (ignored if a file name is given). When it is false (the default), the new PDF::Tiny object will contain a root and a pages object (the latter containing a Kids array and a Count of 0). If empty
is true, the root and the pages array will be absent. Only use this if you are going to add the root yourself. (A PDF without a root and a page tree is not well-formed. High-level methods may not work until you have added those.)
version
specifies the version of the PDF format. It is ignored when opening an existing file. It defaults to 1.4 when creating a PDF from scratch. If you are going to make a PDF file that contains pages from another PDF, you should probably set it to the version of the other PDF file.
High-Level Methods
- page_count
-
Returns the number of pages in the PDF.
- delete_page
-
$pdf->delete_page(7);
Deletes the specified page. Pages are numbered from 0. Negative numbers count from the end. -1 means the last page.
- import_page
-
$pdf->import_page($source_pdf, $num, $offset)
Imports the specified page from another PDF. $offset specifies where to put it in the new PDF. 0 means before the first page. -1 means before the last page.
undef
means at the very end. - append
-
Appends the modifications to the end of an existing PDF file. (PDFs support incremental updates.) This does not work with PDFs created from scratch.
The PDF::Tiny object is not usable after you call
append
. This is for speed’s sake. It is not worth doing extra work to keep an object functional that may not even be used again. Just re-open the file if you need to continue doing stuff after callingappend
.append
can be handy if you are batch-processing huge files and making tiny changes (e.g., changing the title), but there are some gotchas. See "modified". (The gotchas do not apply if you are only using high-level methods to make changes.) -
$pdf->print(fh => $handle); $pdf->print(filename => "foo.pdf");
Produces a PDF file from scratch. Orphaned objects (indirect objects not referenced anywhere) are not included. However, no objects are renumbered, so you may get a bloated cross-reference table. (See also
import_obj
.) If a filehandle is passed, it is not closed afterwards.Since the file is not actually read into memory in full,
print
needs access to the original file. It cannot read and clobber it at the same time. Sofilename
must not be the file that was originally opened, unless you deleted it before callingprint
. - version
-
An lvalue accessor function returning (optionally setting) the version of the PDF file format.
Low-Level Methods
- xref
-
Returns a reference to a hash containing the cross-reference table. If you really know what you are doing, you can modify it. (Don’t.)
# obj num offset in the file { '1 0' => 324345, ... }
Newly-created PDFs have an empty, unused cross-reference table.
- modified
-
$pdf->modified; # get the hash $pdf->modified("1 0"); # mark 1 0 as modified $pdf->modified("/Info", "/Title"); # mark the indirect object con- # taining the title as modified
This method returns a reference to a hash of modified indirect objects. This is used to determine which objects need to be included in an incremental update. If you pass arguments, the object in question will be marked as modified.
If you are doing an incremental update and modifying objects by hand, you will need to call this for every object that is modified. The exceptions are as follows:
Objects that are imported with
import_obj
,Objects added with
add_obj
Objects returned by
vivify_obj
Any changes made by the high-level methods
All of those methods themselves mark the objects as modified.
The arguments follow the same format as described below for "get_obj", except that the first argument must be an object id ("1 0") or a trailer entry ("/Info").
Warning: Getting this right can be tricky. If an modified object is not marked as such, the changes made will not be saved by
append
. If the file is small enough, it might be better to use theprint
method and avoid this pitfall.The hash format is as follows:
# obj num value should always be true { '1 0' => 1, '2017 0' => 1, ... }
(Yes, you can edit the hash manually.)
- objects
-
Returns a reference to a hash of all parsed indirect objects. See "GUTS".
{ '1 0' => $obj, '2 0' => $obj, ... }
- trailer
-
Returns the PDF trailer dictionary as a parsed object.
- read_obj
-
$pdf->read_obj("4 0")
Reads the specified object from the file and stores it as a token sequence (see "GUTS") in the
objects
hash. The stored object is also returned.If the object already exists in memory, it is simply returned.
- get_obj
-
$pdf->get_obj("4 0")
Reads an indirect object from the file (if necessary), and parses and returns it. If it is already in the
objects
hash, it is simply returned. If the object cannot be found, nothing is returned. See also "vivify_obj".$pdf->get_obj("4 0", "/Pages", "/Kids", 3);
Dereferences several levels of objects. In this example, "4 0" is probably the root object (a dictionary), with a Pages entry (also a dictionary), with a Kids entry (an array), and it returns element 3 of the Kids array. If element 3 is a reference, the object it points to is returned.
The slash is not optional. It is used to distinguish between dictionary and array elements. The characters following the slash must not be escaped (whereas in PDF source they can be escaped).
$pdf->get_obj("/Root", "/Pages", "/Kids");
The first argument may be a dictionary element, in which case the lookup begins at the trailer dictionary.
$root = $pdf->get_obj("/Root"); $pdf->get_obj($root, "/Pages", "/Kids");
The first argument may also be a parsed object.
$pdf->get_obj($root);
If you pass just a single parsed object, it will be returned, unless it is actually an indirect reference, in which case the object will be looked up and returned.
- vivify_obj
-
$pdf->vivify_obj($type, ...)
The first argument must be one of the types listed in "GUTS". The remaining arguments are those accepted by
get_obj
. An object of the specified type will be vivified, along with all the intervening objects (whose type, array or dictionary, is determined by whether the element begins with a slash).Any object returned by
vivify_obj
(whether vivified or not) will be marked as modified, under the assumption that you are going to modify it. - get_page
-
$pdf->get_page($num)
Returns the parsed object associated with the page in question. Pages are numbered from 0. Negative numbers count from the end. -1 means the last page.
- import_obj
-
$pdf->import_obj($other_pdf, $object)
Imports an object, and all other objects it references, from another PDF file. (This entails making sure that each imported object gets renumbered to a number that the destination $pdf is not already using.) This method keeps track of which indirect objects have been imported already, so it can be called multiple times without duplicating the same objects. (It also means that subsequent changes to objects in the source PDF will go unnoticed.)
$object may be a string or a parsed object.
If $object is a string, it must be an object id ("1 0"). The return value will also be an object id.
If it is a parsed object, the object itself will not be added to the $pdf’s list of indirect objects, because it is assumed that it will be inserted directly inside another object. (Or you can pass the return value to
add_obj
.) All objects it references though (by numeric id) will be imported. The object itself will be cloned and the new value returned (with all references to other objects updated).Warning: If you try to import pages from another PDF document with this, watch out for the ‘/Parent’ link from each page to its parent page array. You’ll end up pulling in the parent page array, too, bloating your PDF with page data that will not be displayed.
This can also be used to renumber all the objects in a PDF (excluding orphans), to avoid bloated cross-reference tables (but this does entail reading the entire file into memory):
my $new_pdf = new PDF'Tiny version => $old_pdf->version, empty => 1; my $new_trailer = $new_pdf->trailer->[1]; my $old_trailer = $old_pdf->trailer->[1]; $new_trailer->{Root} = $new_pdf->import_obj($old_pdf, $old_trailer->{Root}); $new_trailer->{Root} = $new_pdf->import_obj($old_pdf, $old_trailer->{Info}) if $old_trailer->{Info};
- add_obj
-
$pdf->add_obj($parsed_obj);
Adds a new indirect object to the PDF and returns the id that got used.
Functions
None of these functions is exported. Call each one with a PDF::Tiny::
prefix.
- tokenize
-
(Spelt
tokenise
on Thursdays.)PDF::Tiny::tokenize($string) PDF::Tiny::tokenize($string, $delimiter_re, \&more)
A low-level function used to break a piece of PDF source into a sequence of tokens. Returns a list of strings. Whitespace is stripped, so if you want to join it back into a string, use the
join_tokens
function.The $string passed as an argument is consumed. It must not be read-only.
The second argument is a regular expression matching the token to stop on (e.g.,
qr/^endobj\z/
). The third argument is a function that is expected to read more into $string if the ending delimiter is not found. These two arguments are used internally when parsing PDF files. - join_tokens
-
PDF::Tiny::join_tokens(@tokens)
Joins a list of tokens into a string, supplying necessary whitespace.
- parse_string
-
PDF::Tiny::parse_string($string) PDF::Tiny::parse_string($string, $delimiter_re)
Turns a string of tokens into a parsed object. If $delimiter_re is supplied, any token than matches it will be the last token processed.
- parse_tokens
-
PDF::Tiny::parse_tokens(@tokens) PDF::Tiny::parse_tokens(@tokens, $delimiter_re)
Turns a list of tokens into a parsed object. If $delimiter_re is supplied, any token than matches it will be the last token processed.
- serialize
-
(Also
serialise
.)PDF::Tiny::serialize($parsed_obj)
Serializes a parsed object. If the object is a stream, the stream content is not serialized (to avoid copying a potentially large stream). The serialized output contains everything up to and including the word ‘stream’ and the line feed that follows.
- make_bool
-
PDF::Tiny::make_bool(1) # or 0
Returns a parsed boolean object.
- make_num
-
PDF::Tiny::make_num(1) PDF::Tiny::make_num(1.1)
Returns a parsed number object.
- make_str
-
PDF::Tiny::make_str("yuhu")
Returns a parsed string object.
- make_name
-
PDF::Tiny::make_name("Catalog")
Returns a parsed named object. The name must be given without the initial slash.
- make_array
-
PDF::Tiny::make_array([...])
Returns a parsed array object that references the very same array passed to it.
- make_dict
-
PDF::Tiny::make_dict({...})
Returns a parsed dictionary (hash) object that references the very same hash passed to it.
- make_stream
-
PDF::Tiny::make_stream($dict, $content)
Returns a parsed stream object. $dict must be a parsed dictionary object. $content contains the content of the stream.
- make_null
-
PDF::Tiny::make_null()
Returns a parsed null object (the same one every time).
- make_ref
-
PDF::Tiny::make_ref("1 0")
Returns a parsed indirect object reference.
THE PDF FILE FORMAT
The body of a PDF file consists of a collection of what are called objects, in no particular order. Each has a numeric id that consists of two numbers separated by space. Here is an example of what they look like:
23 0 obj
<< /Type /Catalog /Pages 3 0 R >>
endobj
This is object number 23 0. The <<
and >>
indicate a dictionary object (like a hash). The value of the ‘Type’ entry is the name ‘Catalog’ (a slash before it indicates a name, or identifier). The value of the ‘Pages’ entry is a reference to object number 3 0.
Everything is an object.
In PDF parlance, even something as simple as a number is called an object. An object embedded directly inside another one (such as the number in << /Length 1 >>
is called a direct object. An object with a numeric ID (such as 23 0) is called an indirect object. An object ID followed by the keyword R
is called an indirect reference.
Since the indirect objects can be in any order, there follows a cross-reference table giving the exact location in the file of each indirect object.
After the cross-reference table is the trailer, an example of which is the following:
trailer
<< /Size 34 /Root 23 0 R /Info 1 0 R
/ID [ <e2ca9df8c15ea42d17d5d724f61808b1>
<e2ca9df8c15ea42d17d5d724f61808b1> ] >>
startxref
7749
%%EOF
Try opening an existing PDF file in a text editor. To see the structure of a PDF file, start with the trailer dictionary. Metadata are stored in the ‘Info’ entry, which here refers to object 1 0. For the actual document structure, you want to look at the ‘Root’ entry, which points to the document root or catalogue. That is object 23 0, shown above. If you find object 3 0 you will see that it is a dictionary containing a ‘Kids’ array of references to page objects, etc.
Now, the actual content of pages is in a different language from the PDF structure, which is described here. It goes inside a stream object, referenced by the pages, which is usually compressed with Deflate encoding. (This module does not handle streams per se, but its low-level functions will allow you to get to them.) Even though the language for drawing pages is not the same as that used for PDF structure, it follows the same tokenization rules, so you can use this module’s tokenize
function if you are writing your own stream-processing code.
GUTS
Guts of the PDF::Tiny objects.
Nothing is an object.
By that I mean that PDF::Tiny does not use Perl objects to represent PDF objects. Rather, it uses array refs. These are referred to throughout this documentation as parsed objects.
A parsed object looks like this:
[ $type, $value ]
[ 'num', 3 ] # a number
[ 'dict', {...} ] # a PDF dictionary (i.e., hash)
[ 'str', 'foo' ] # The string 'foo'
[ 'array', [...] ] # An array
[ 'bool', 1 ] # A boolean
[ 'name', "Root" ] # The name (identifier) /Root
[ 'ref', "2 0" ] # A reference to an indirect object
[ 'null' ] # This special value has one element
[ 'stream', $dict, $content ] # This exception has a parsed dictionary
# object for element 1 and the stream
# content for element 2
The values of dictionaries and arrays are also parsed objects.
The value of an indirect reference (not really an object) consists of two integers without leading zeroes (except 0 itself) separate by a space. Even though "000\n001 R"
is a valid reference in PDF syntax, this module always parses it as "0 1", which is important since it is used as a hash key.
The various make_*
functions can be used to create these.
There are also two special cases, which are handled transparently like the other objects (and converted into them if necessary by get_obj
):
[ 'flat', '.......' ] # A flattened (serialized) object
[ 'tokens', [$token1, $token2, ...]] # A sequence of tokens
So the following three are equivalent:
[ 'dict', { Type => [ 'name', 'Catalog' ], Pages => ['ref', '2 0'] } ]
[ 'flat', '<</Type/Catalog/Pages 2 0 R>>' ]
[ 'tokens', [qw '<< /Type /Catalog /Pages 2 0 R >>'] ]
(This does not apply to the trailer. Do not flatten the trailer.)
Also, streams cannot be stored in flat or tokenized format, but their dictionaries can:
[ 'stream', ['dict', { Length => ['num', 9] }], 'scream!!!' ]
[ 'stream', ['tokens', ['<<','/Length','9','>>']], 'scream!!!' ]
[ 'stream', ['flat', '<</Length 9>>'], 'scream!!!' ]
# Invalid:
[ 'flat', "<</Length 9>>stream\nscream!!!" ]
GLOSSARY
TODO
This section is incomplete and may be finished in version 0.02. Then again, it may not.
- direct object
-
A PDF object embedded directly inside another PDF object.
- indirect object
-
A PDF object with an ID, that gets referenced by its ID.
- indirect reference
-
A parsed object containing an object ID, representing a PDF sequence such as "1 0 R".
- object ID
-
Two numbers separate by a space; e.g. "1 0". While PDF syntax allows initial zeroes and any whitespace, PDF::Tiny does not internally. All functions expecting an object ID require exactly one space and no leading zeroes (except for 0 itself).
- parsed object
-
This term is used throughout this documentation to refer to the array-ref form that all PDF objects take when parsed by this module. It is used even if the object was created from scratch, and not the result of parsing.
- ram hog
-
A mythical creature with the horns of a sheep and the snout of a swine. This term is also used to refer to memorivorous software. This module aims not to be such. Of course if you access every object in a huge PDF, you can defeat that aim.
OTHER PDF MODULES
Why yet another PDF module, considering how many there are? Most of the other solutions were insufficiently lightweight for my needs. (In particular, I needed to write a web service that would serve a single page at a time from a collection of PDFs, some of which are 200MB. I needed it to be very responsive.)
PDF::API2 (probably the best PDF module), CAM::PDF, and PDF::Extract all read the entire file into memory.
Text::PDF is quite fast compared with the others, but I could not figure out how to use it, except to get a page count.
PDF::Reuse is fast, but it has trouble with some PDF files, and it provides no page count feature (something I needed).
That said, there is no reason why you could not use PDF::Tiny in conjunction with other modules. CAM::PDF, for example, can generate PDFs from scratch fairly efficiently, but is slow at extracting pages from large PDFs. You could use it to generate a PDF, and then use PDF::Tiny to add pages afterwards from some other PDF. (Or extract pages with PDF::Tiny to a small PDF, and then import them with CAM::PDF.)
Compatibility
Unfortunately, many of the modules mentioned above do not fully understand PDF syntax, or interpret the spec too strictly, such that they are unable to read certain PDFs. I have a large scanned book in the form of a PDF produced by ABBYY FineReader. I tried rewriting it with PDF::Tiny->new($old)->print(filename => $new)
, and then I tested both PDFs with the above modules. The results:
PDF producer
PDF reader | FineReader | PDF::Tiny
------------------+------------+----------
CAM::PDF 1.60 | no | yes
PDF::API2 2.031 | yes | no
PDF::Tiny 0.02 | yes | yes
Text::PDF 0.31 | no | no
PDF::Reuse 0.39 | yes* | no
PDF::Extract 3.04 | yes | no
Adobe Acrobat | yes | yes
Apple Preview | yes | yes
* It has trouble with the cross-reference table, such that it may or
may not be able to extract the information you want. It happened
to work for my purposes, but was slow and produced a bloated file.
(The bug is fixed in the git repository and may be gone by the
time you read this.)
Part of the reason for the large number of noes is that PDF::Tiny tries to get the files as compact as possible as fast as is possible with a reasonably small amount of code. To avoid reaching the PDF line length limit (which means entering a more complex and slower code path), it emits line breaks between tokens wherever whitespace is mandatory. It is probably the only PDF producer that does that.
I have filed bug reports against all the modules that have a no in either column. I hope I do not have to slow down PDF::Tiny to work with these other modules.
(If this proves to be a problem for anyone, let me know, and I can change the way it outputs whitespace.)
Benchmarks
Okay, so I took the PDF mentioned above (169.5 MB in size, containing 253 scanned pages) and benchmarked (1) fetching a page count and (2) extracting a single page, which are the two tasks for which I am using PDF::Tiny. The benchmark code (which uses Dumbbench) can be found in the benchmark file in the distribution. The results (on a 2.8 GHz Intel Core 2 Duo):
Page count | Page extraction | Resulting file size
------------------+------------+-----------------+--------------------
PDF::Tiny 0.02 | 0.018914 s | 0.10225 s | 952 KB
CAM::PDF 1.60 | 0.6923 s | 1.1849 s | 995 KB
PDF::API2 2.031 | 3.585 s | 4.613 s | 953 KB
PDF::Reuse 0.39 | N/A | 15.36 s | 169.5 MB
PDF::Extract 3.04 | N/A | 158.54 s | 954 KB
Some explanation of the numbers: CAM::PDF does not renumber objects, so it ends up with a bloated 40K cross-reference table. PDF::Reuse drags in the entire contents of the source PDF but only includes one of its pages in the page tree.
PREREQUISITES
perl 5.10 or higher
BUGS
Probably lots. Most of the limitations could be considered bugs. Most of the limitations could also be considered features, because they make the module Tiny.
Okay, there is at least one real bug: Currently PDF strings are limited in length, because the tokenizer only reads more data from the file in between tokens.
This module is badly in need of tests. (Or it needs to be tested badly.) No doubt more tests will uncover bugs.
Please send bug reports to bug-PDF-Tiny@rt.cpan.org.
AUTHOR & COPYRIGHT
Copyright (C) 2017 Father Chrysostomos <sprout [at] cpan [dot] org>
This program is free software; you may redistribute it, modify it or both under the same terms as perl. The full text of the license can be found in the LICENSE file included with this module.