NAME
EBook::Tools::Unpack - An object class for unpacking E-book files into their component parts and metadata
SYNOPSIS
use EBook::Tools::Unpack;
my $unpacker = EBook::Tools::Unpack->new(
'file' => $filename,
'dir' => $dir,
'encoding' => $encoding,
'format' => $format,
'raw' => $raw,
'author' => $author,
'title' => $title,
'opffile' => $opffile,
'tidy' => $tidy,
'nosave' => $nosave,
);
$unpacker->unpack;
or, more simply:
use EBook::Tools::Unpack;
my $unpacker = EBook::Tools::Unpack->new('file' => 'mybook.prc');
$unpacker->unpack;
DEPENDENCIES
Perl Modules
HTML::Tree
Image::Size
List::MoreUtils
P5-Palm
Palm::Doc
CONSTRUCTOR
new(%args)
Instantiates a new Ebook::Tools::Unpack object.
Arguments
file
The file to unpack. Specifying this is mandatory.
dir
The directory to unpack into. If not specified, defaults to the basename of the file.
encoding
If specified, overrides the encoding to use when unpacking. This is normally detected from the file and does not need to be specified.
Valid values are '1252' (specifying Windows-1252) and '65001' (specifying UTF-8).
key
The decryption key to use if necessary (not yet implemented)
keyfile
The file holding the decryption keys to use if necessary (not yet implemented)
language
If specified, overrides the detected language information.
opffile
The name of the file in which the metadata will be stored. If not specified, defaults to the value of
dir
with.opf
appended.raw
If set true, this forces no corrections to be done on any extracted text and a lot of raw, unparsed, unmodified data to be dumped into the directory along with everything else. It's useful for debugging exactly what was in the file being unpacked, and (when combined with
nosave
) reducing the time needed to extract parsed data from an ebook container without actually unpacking it.author
Overrides the detected author name.
title
Overrides the detected title.
tidy
If set to true, the unpacker will run tidy on any HTML output files to convert them to valid XHTML. Be warned that this can occasionally change the formatting, as Tidy isn't very forgiving on certain common tricks (such as empty <pre> elements with style elements) that abuse the standard.
nosave
If set to true, the unpacker will run through all of the unpacking steps except those that actually write to the disk. This is useful for testing, but also (particularly when combined with
raw
) can be used for extracting parsed data from an ebook container without actually unpacking it.
ACCESSOR METHODS
See "new()" for more details on what some of these mean. Note that some values cannot be autodetected until an unpack method executes.
author
dir
file
filebase
In scalar context, this is the basename of file
. In list context, it actually returns the basename, directory, and extension as per fileparse
from File::Basename.
format
key
keyfile
language
This returns the language specified by the user, if any. It remains undefined if the user has not requested that a language code be set even if a language was autodetected.
opffile
raw
title
This returns the title specified by the user, if any. It remains undefined if the user has not requested a title be set even if a title was autodetected.
detected
This returns a hash containing the autodetected metadata, if any.
MODIFIER METHODS
detect_format()
Attempts to automatically detect the format of the input file. Croaks if it can't. This both sets the object internal values and returns a two-scalar list, where the first scalar is the detected format and the second is a string that may contain additional detected information (such as a title or version).
This is automatically called by "new()" if the format
argument is not specified.
detect_from_mobi_headers()
Detects metadata values from the MOBI headers retrieved via "unpack_mobi_header()" and "unpack_mobi_exth()" and places them into the detected
attribute.
gen_opf(%args)
This generates an OPF file from detected and specified metadata. It does not honor the nosave
flag, and will always write its output.
Normally this is called automatically from inside the unpack
methods, but can be called manually after an unpack if the nosave
flag was set to write an OPF anyway.
Returns the filename of the OPF file.
Arguments
opffile
(optional)If specified, this overrides the object attribute
opffile
, and determines the filename to use for the generated OPF file. If not specified, and the object attributeopffile
has somehow been cleared (the attribute is set during "new()"), it will be generated by looking at thehtmlfile
argument. If no value can be found, the method croaks. If a value was found somewhere other than the object attributeopffile
, then the object attribute is updated to match.textfile
(optional)The file containing the main text of the document. If specified, the method will attempt to split metadata out of the file and add whatever remains to the manifest of the OPF.
unpack()
This is a dispatcher for the specific unpacking methods needed to unpack a particular format. Unless you feel a need to override the unpacking method specified or detected during object construction, it is probalby better to call this than the specific unpacking methods.
unpack_mobi()
Unpacks Mobipocket (.prc / .mobi) files.
unpack_mobi_record0($data)
Converts the information in the header data of PDB record 0 to entries inside the datahashes
attribute.
Keys
The following keys are added to datahashes
:
palm
Information from "unpack_palmdoc_header()"
mobi
Information from "unpack_mobi_header()"
mobiexth
Information from "unpack_mobi_exth()"
unpack_palmdoc()
Unpacks PalmDoc / AportisDoc (.pdb) files
usedir()
Changes the current working directory to the directory specified by the object, creating it if necessary.
PROCEDURES
No procedures are exported by default, and in fact since the final module location for some of these procedures has not yet been finalized, none are even exportable.
Consider these to be private subroutines and use at your own risk.
fix_mobi_html(%args)
Takes raw Mobipocket output text and replaces the custom tags and file position anchors
Arguments
textref
A reference to the raw document text. The procedure croaks if this is not supplied.
encoding
The encoding of the raw document text. Valid values are '1252' (Windows-1252) and '65001' (UTF-8). If not specified, '1252' will be assumed.
filename
The name of the output HTML file (used in generating hrefs). The procedure croaks if this is not supplied.
nonewlines
If this is set to true, the procedure will not attempt to insert newlines for readability. This will leave the output in a single unreadable line, but has the advantage of reducing the processing time, especially useful if tidy is going to be run on the output anyway.
hexstring($bindata)
Takes as an argument a scalar containing a sequence of binary bytes. Returns a string converting each octet of the data to its two-digit hexadecimal equivalent. There is no leading "0x" on the string.
unpack_mobi_exth($headerdata)
Takes as an argument a scalar containing the variable-length Mobipocket EXTH data from the first record. Returns an array of hashes, each hash containing the data from one EXTH record with values from that data keyed to recognizable names.
If $headerdata
doesn't appear to be an EXTH header, carps a warning and returns an empty list.
See:
http://wiki.mobileread.com/wiki/MOBI
Hash keys
type
A numeric value indicating the type of EXTH data in the record. See package variable
%exthtypes
.length
The length of the
data
value in bytesdata
The data of the record.
unpack_mobi_header($headerdata)
Takes as an argument a scalar containing the variable-length Mobipocket-specific header data from the first record. Returns a hash containing values from that data keyed to recognizable names.
See:
http://wiki.mobileread.com/wiki/MOBI
keys
The returned hash will have the following keys (documented in the order in which they are encountered in the header):
identifier
-
This should always be the string 'MOBI'. If it isn't, the procedure croaks.
headerlength
-
This is the size of the complete header. If this value is different from the length of the argument, the procedure croaks.
type
-
A numeric code indicating what category of Mobipocket file this is.
encoding
-
A numeric code representing the encoding. Expected values are '1252' (for Windows-1252) and '65001 (for UTF-8).
The procedure carps a warning if an unexpected value is encountered.
uniqueid
-
This is thought to be a unique ID for the book, but its actual use is unknown.
Use with caution. This key may be renamed in the future if more information is found.
version
-
This is thought to be the Mobipocket format version. A second version code shows up again later as
version2
which is usually the same on unprotected books but different on DRMd books.Use with caution. This key may be renamed in the future if more information is found.
reserved
-
40 bytes of reserved data.
Use with caution. This key may be renamed in the future if more information is found.
nontextrecord
-
This is thought to be an index to the first PDB record other than the header record that does not contain the book text.
Use with caution. This key may be renamed in the future if more information is found.
titleoffset
-
Offset in record 0 (not from start of file) of the full title of the book.
titlelength
-
Length in bytes of the full title of the book
unknownlanguage
-
16 bits of unknown data thought to be related to the book language.
Use with caution. This key may be renamed in the future if more information is found.
region
-
The specific region of
language
. See%mobilangcodes
for an exact map of values.The bottom two bits of this value appear to be unused (i.e. all values are multiples of 4).
language
-
A main language code. See
%mobilangcodes
for an exact map of values. unknowndilanguage
-
16 bits of unknown data thought to be related to the dictionary input language.
Use with caution. This key may be renamed in the future if more information is found.
dictionaryinregion
-
The specific region of
dictionaryinlanguage
. See%mobilangcodes
for an exact map of values. dictionaryinlanguage
-
The language code for the DictionaryInLanguage element. See
%mobilangcodes
for an exact map of values. unknowndolanguage
-
16 bits of unknown data thought to be related to the dictionary output language.
Use with caution. This key may be renamed in the future if more information is found.
dictionaryoutregion
-
The specific region of
dictionaryoutlanguage
. See%mobilangcodes
for an exact map of values. dictionaryoutlanguage
-
The language code for the DictionaryOutLanguage element. See
%mobilangcodes
for an exact map of values. version2
-
This is another Mobipocket format version related to DRM. If no DRM is present, it should be the same as
version
.Use with caution. This key may be renamed in the future if more information is found.
imagerecord
-
This is thought to be an index to the first record containing image data.
Use with caution. This key may be renamed in the future if more information is found.
unknown96
-
Unsigned long int (32-bit) at offset 96.
Use with caution. This key may be renamed in the future if more information is found.
unknown100
-
Unsigned long int (32-bit) at offset 100.
Use with caution. This key may be renamed in the future if more information is found.
unknown104
-
Unsigned long int (32-bit) at offset 104.
Use with caution. This key may be renamed in the future if more information is found.
unknown108
-
Unsigned long int (32-bit) at offset 108.
Use with caution. This key may be renamed in the future if more information is found.
exthflags
-
A 32-bit bitfield related to the Mobipocket EXTH data. If bit 6 (0x40) is set, then there is at least one EXTH record.
unknown116
-
36 bytes of unknown data at offset 116. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
drmcode
-
A number thought to be related to DRM. If present and no DRM is set, contains either the value 0xFFFFFFFF (normal books) or 0x00000000 (samples). This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown156
-
20 bytes of unknown data at offset 156, usually zeroes. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown176
-
16 bits of unknown data at offset 176. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown178
-
16 bits of unknown data at offset 178. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown180
-
32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown184
-
32 bits of unknown data at offset 184. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown188
-
32 bits of unknown data at offset 188. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown192
-
32 bits of unknown data at offset 192. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown196
-
32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown200
-
Unknown data of unknown length running to the end of the header. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unpack_palmdoc_header
Takes as an argument a scalar containing the 16 bytes of the PalmDoc header (also used by Mobipocket). Returns a hash containing those values keyed to recognizable names.
See:
http://wiki.mobileread.com/wiki/DOC#PalmDOC
and
http://wiki.mobileread.com/wiki/MOBI
keys
The returned hash will have the following keys:
compression
Possible values:
A warning will be carped if an unknown value is found.
textlength
Uncompressed length of book text in bytes
textrecords
Number of PDB records used for book text
recordsize
Maximum size of each record containing book text. This should always be 2048 (for some Mobipocket files) or 4096 (for everything else). A warning will be carped if it isn't.
unused
Two bytes that should always be zero. A warning will be carped if they aren't.
Note that the current position component of the header is discarded.
BUGS/TODO
DRM isn't handled. Infrastructure to support this via an external plug-in module may eventually be built, but it will never become part of the main module for legal reasons.
Mobipocket HuffDic encoding (used mostly on dictionaries) isn't supported yet.
Not all Mobipocket data is understood, so a conversion from OPF to Mobipocket .prc back to OPF will not result in all data being retained. Patches welcome.
Mobipocket EXTH subjectcode records may not end up attached to the correct subject element if the number of subject records differs from the number of subjectcode records. This is because the Mobipocket format leaves the EXTH subjectcode records completely unlinked from the subject records, and there is no way to detect if a subject with no associated subjectcode comes before a subject with an associated subjectcode.
Fortunately, this should rarely be a problem with real data, as Mobipocket Creator only allows a single subject to be set, and the only other way to have a subjectcode attached to a subject is to manually edit the OPF file and insert an additional dc:Subject element with a BASICCode attribute.
Mobipocket has indicated that they may move data currently in their custom elements and attributes to the standard <meta> elements in a future release, so this problem may become moot then.
Unit tests are incomplete
Documentation is incomplete. Accessors in particular could use some cleaning up.
Need to implement setter methods for object attributes
Palm::Doc is currently used for extraction, with a lot of code in this module dedicated to extracting information that it can't. It may be better to split out that code into a dedicated module to replace Palm::Doc completely.
PDB Bookmarks aren't supported. This is a weakness inherited from Palm::Doc, and will take a while to fix.
Import/extraction/unpacking is currently limited to PalmDoc and Mobipocket. Extraction from eReader and Microsoft Reader (.lit) is also eventually planned. Other formats may follow from there.
AUTHOR
Zed Pobre <zed@debian.org>
COPYRIGHT
Copyright 2008 Zed Pobre
Licensed to the public under the terms of the GNU GPL, version 2