NAME
Renard::Curie::Data::PDF - Retrieve PDF image and text data via MuPDF's mutool
VERSION
version 0.001
FUNCTIONS
_call_mutool
_call_mutool( @args )
Helper function which calls mutool
with the contents of the @args
array.
Returns the captured STDOUT
of the call.
This function dies if mutool
unsuccessfully exits.
get_mutool_pdf_page_as_png
get_mutool_pdf_page_as_png($pdf_filename, $pdf_page_no)
This function returns a PNG stream that renders page number $pdf_page_no
of the PDF file $pdf_filename
.
get_mutool_text_stext_raw
get_mutool_text_stext_raw($pdf_filename, $pdf_page_no)
This function returns an XML string that contains structured text from page number $pdf_page_no
of the PDF file $pdf_filename
.
The XML format is defined by the output of mutool
looks like this (for page 23 of the pdf_reference_1-7.pdf
file):
<document name="test-data/test-data/PDF/Adobe/pdf_reference_1-7.pdf">
<page width="531" height="666">
<block bbox="261.18 616.16394 269.77765 625.2532">
<line bbox="261.18 616.16394 269.77765 625.2532">
<span bbox="261.18 616.16394 269.77765 625.2532" font="MyriadPro-Semibold" size="7.98">
<char bbox="261.18 616.16394 265.50037 625.2532" x="261.18" y="623.2582" c="2"/>
<char bbox="265.50037 616.16394 269.77765 625.2532" x="265.50037" y="623.2582" c="3"/>
</span>
</line>
</block>
<block bbox="225.78 88.20229 305.18158 117.93829">
<line bbox="225.78 88.20229 305.18158 117.93829">
<span bbox="225.78 88.20229 305.18158 117.93829" font="MyriadPro-Bold" size="24">
<char bbox="225.78 88.20229 239.5176 117.93829" x="225.78" y="111.93829" c="P"/>
<char bbox="239.5176 88.20229 248.4552 117.93829" x="239.5176" y="111.93829" c="r"/>
<char bbox="248.4552 88.20229 261.1128 117.93829" x="248.4552" y="111.93829" c="e"/>
<char bbox="261.1128 88.20229 269.28238 117.93829" x="261.1128" y="111.93829" c="f"/>
<char bbox="269.28238 88.20229 281.93997 117.93829" x="269.28238" y="111.93829" c="a"/>
<char bbox="281.93997 88.20229 292.50958 117.93829" x="281.93997" y="111.93829" c="c"/>
<char bbox="292.50958 88.20229 305.18158 117.93829" x="292.50958" y="111.93829" c="e"/>
</span>
</line>
</block>
</page>
</document>
Simplified, the high-level structure looks like:
<page> -> [list of blocks]
<block> -> [list of blocks]
a block is either:
- stext
<line> -> [list of lines] (all have same baseline)
<span> -> [list of spans] (horizontal spaces over a line)
<char> -> [list of chars]
- image
TODO
get_mutool_text_stext_xml
get_mutool_text_stext_xml($pdf_filename, $pdf_page_no)
Returns a HashRef of the structured text from from page number $pdf_page_no
of the PDF file $pdf_filename
.
See the function get_mutool_text_stext_raw for details on the structure of this data.
get_mutool_page_info_raw
get_mutool_page_info_raw($pdf_filename)
Returns an XML string of the page bounding boxes of PDF file $pdf_filename
.
The data is in the form:
<document>
<page pagenum="1">
<MediaBox l="0" b="0" r="531" t="666" />
<CropBox l="0" b="0" r="531" t="666" />
<Rotate v="0" />
</page>
<page pagenum="2">
...
</page>
</document>
get_mutool_page_info_xml
get_mutool_page_info_xml($pdf_filename)
Returns a HashRef containing the page bounding boxes of PDF file $pdf_filename
.
See function get_mutool_page_info_raw for information on the structure of the data.
get_mutool_outline_simple
fun get_mutool_outline_simple($pdf_filename)
Returns an array of the outline of the PDF file $pdf_filename
as an ArrayRef[HashRef]
which corresponds to the items
attribute of Renard::Curie::Model::Outline.
AUTHOR
Project Renard
COPYRIGHT AND LICENSE
This software is copyright (c) 2016 by Project Renard.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.