NAME
EBook::Tools::Mobipocket - Components related to the Mobipocket format.
SYNOPSIS
DEPENDENCIES
Compress::Zlib
HTML::Tree
Image::Size
List::MoreUtils
P5-Palm
CONSTRUCTOR
new()
Instantiates a new Ebook::Tools::Mobipocket object.
ACCESSOR METHODS
text()
Returns the text of the file
MODIFIER METHODS
These methods have two naming/capitalization schemes -- methods directly related to the subclassing of Palm::PDB use its MethodName capitalization style. Any other methods are lowercase_with_underscores for consistency with the rest of EBook::Tools.
ParseRecord(%record)
Parses PDB records, updating the object attributes. This method is called automatically on every database record during Load()
.
ParseRecord0($data)
Parses the header record and places the parsed values into the hashref $self-
{header}{palm}>, the hashref $self-
{header}{mobi}>, and $self-
{header}{exth}> by calling "parse_palmdoc_header()", "parse_mobi_header()", and "parse_mobi_exth()" respectively.
ParseRecordImage(\$dataref)
Parses image records, updating object attributes, most notably adding the image data to the hash $self-
{imagedata}, adding the image filename to $self-
{recindexlinks}, and incrementing $self-
{recindex}>.
Takes as an argument a reference to the record data. Croaks if it isn't provided, or isn't a reference.
This is called automatically by "ParseRecord()" and "ParseResource()" as needed.
ParseRecordText(\$dataref)
Parses text records, updating object attributes, most notably appending text to $self-
{text}>. Takes as an argument a reference to the record data.
This is called automatically by "ParseRecord()" and "ParseResource()" as needed.
fix_html(%args)
Takes raw Mobipocket text and replaces the custom tags and file position anchors
Arguments
filename
The name of the output HTML file (used in generating hrefs). The procedure croaks if this is not supplied.
nonewlines
(optional)If this is set to true, the procedure will not attempt to insert newlines for readability. This will leave the output in a single unreadable line, but has the advantage of reducing the processing time, especially useful if tidy is going to be run on the output anyway.
fix_html_filepos()
Takes the raw HTML text of the object and replaces the filepos anchors. This has to be called before any other action that modifies the text, or the filepos positions will not be valid.
Returns 1 if successful, undef if there was no text to fix.
This is called automatically by "fix_html()".
write_images()
Writes each image record to the disk.
Returns the number of images written.
write_text($filename)
Writes the book text to disk with the given filename. This filename must match the filename given to "fix_html()" for the internal links to be consistent.
Croaks if $filename
is not specified.
Returns 1 on success, or undef if there was no text to write.
write_unknown_records()
Writes each unidentified record to disk with a filename in the format of 'raw-record-####', where #### is the record number (not the record ID).
Returns the number of records written.
parse_mobi_exth($headerdata)
Takes as an argument a scalar containing the variable-length Mobipocket EXTH data from the first record. Returns an array of hashes, each hash containing the data from one EXTH record with values from that data keyed to recognizable names.
If $headerdata
doesn't appear to be an EXTH header, carps a warning and returns an empty list.
See:
http://wiki.mobileread.com/wiki/MOBI
Hash keys
type
A numeric value indicating the type of EXTH data in the record. See package variable
%exthtypes
.length
The length of the
data
value in bytesdata
The data of the record.
parse_mobi_header($headerdata)
Takes as an argument a scalar containing the variable-length Mobipocket-specific header data from the first record. Returns a hash containing values from that data keyed to recognizable names.
See:
http://wiki.mobileread.com/wiki/MOBI
keys
The returned hash will have the following keys (documented in the order in which they are encountered in the header):
identifier
-
This should always be the string 'MOBI'. If it isn't, the procedure croaks.
headerlength
-
This is the size of the complete header. If this value is different from the length of the argument, the procedure croaks.
type
-
A numeric code indicating what category of Mobipocket file this is.
encoding
-
A numeric code representing the encoding. Expected values are '1252' (for Windows-1252) and '65001 (for UTF-8).
The procedure carps a warning if an unexpected value is encountered.
uniqueid
-
This is thought to be a unique ID for the book, but its actual use is unknown.
Use with caution. This key may be renamed in the future if more information is found.
version
-
This is thought to be the Mobipocket format version. A second version code shows up again later as
version2
which is usually the same on unprotected books but different on DRMd books.Use with caution. This key may be renamed in the future if more information is found.
reserved
-
40 bytes of reserved data.
Use with caution. This key may be renamed in the future if more information is found.
firstimagerecord
-
This is thought to be an index to the first record containing image data. If there are no images in the book, this value will be 4294967295 (0xffffffff)
Use with caution. This key may be renamed in the future if more information is found.
titleoffset
-
Offset in record 0 (not from start of file) of the full title of the book.
titlelength
-
Length in bytes of the full title of the book
languageunknown
-
16 bits of unknown data thought to be related to the book language.
Use with caution. This key may be renamed in the future if more information is found.
language
-
A pseudo-IANA language code string representing the main book language (i.e. the value of <dc:language>). See
%mobilangcodes
for an exact map of raw values to this string and notes on non-compliant results. dilanguageunknown
-
16 bits of unknown data thought to be related to the dictionary input language.
Use with caution. This key may be renamed in the future if more information is found.
dilanguage
-
A pseudo-IANA language code string for the DictionaryInLanguage element. See
%mobilangcodes
for an exact map of raw values to this string and notes on non-compliant results. dolanguageunknown
-
16 bits of unknown data thought to be related to the dictionary output language.
Use with caution. This key may be renamed in the future if more information is found.
dolanguage
-
A pseudo-IANA language code string for the DictionaryOutLanguage element. See
%mobilangcodes
for an exact map of raw values to this string and notes on non-compliant results. version2
-
This is another Mobipocket format version related to DRM. If no DRM is present, it should be the same as
version
.Use with caution. This key may be renamed in the future if more information is found.
nontextrecord
-
This is thought to be an index to the first PDB record other than the header record that does not contain the book text.
Use with caution. This key may be renamed in the future if more information is found.
unknown96
-
Unsigned long int (32-bit) at offset 96.
Use with caution. This key may be renamed in the future if more information is found.
unknown100
-
Unsigned long int (32-bit) at offset 100.
Use with caution. This key may be renamed in the future if more information is found.
unknown104
-
Unsigned long int (32-bit) at offset 104.
Use with caution. This key may be renamed in the future if more information is found.
unknown108
-
Unsigned long int (32-bit) at offset 108.
Use with caution. This key may be renamed in the future if more information is found.
exthflags
-
A 32-bit bitfield related to the Mobipocket EXTH data. If bit 6 (0x40) is set, then there is at least one EXTH record.
unknown116
-
36 bytes of unknown data at offset 116. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
drmcode
-
A number thought to be related to DRM. If present and no DRM is set, contains either the value 0xFFFFFFFF (normal books) or 0x00000000 (samples). This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown156
-
20 bytes of unknown data at offset 156, usually zeroes. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown176
-
16 bits of unknown data at offset 176. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
lastimagerecord
-
This is thought to be an index to the last record containing image data. If there are no images in the book, this value will be 65535 (0xffff).
Use with caution. This key may be renamed in the future if more information is found.
unknown180
-
32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
fcisrecord
-
This is thought to be an index to a 'FCIS' record, so named because those are always the first four characters when the record data is decompressed using uncompress_palmdoc().
This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown188
-
32 bits of unknown data at offset 188. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
flisrecord
-
This is thought to be an index to a 'FLIS' record, so named because those are always the first four characters when the record data is decompressed using uncompress_palmdoc().
This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown196
-
32 bits of unknown data at offset 180. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
unknown200
-
Unknown data of unknown length running to the end of the header. This value will be undefined if the header data was not long enough to contain it.
Use with caution. This key may be renamed in the future if more information is found.
parse_mobi_language($languagecode, $regioncode)
Takes the integer values $languagecode
and $regioncode
unpacked from the Mobipocket header and returns a language string mostly (but not entirely) conformant to the IANA language subtag registry codes.
Croaks if $languagecode
is not provided. If $regioncode
is not provided or not recognized, it is disregarded and the base language string (with no region or script) is returned.
If $languagecode
is not provided, the sub croaks. If it isn't recognized, a warning is carped and the sub returns undef. Note that 0,0 is a recognized code returning an empty string.
See %mobilanguagecodes
for an exact map of values. Note that the bottom two bits of the region code appear to be unused (i.e. the values are all multiples of 4).
unpack_mobi_language($data)
Takes as an argument 4 bytes of data. If less data is provided, the sub croaks. If more, a debug warning is provided, but the sub continues.
In scalar context returns a language string mostly (but not entirely) conformant to the IANA language subtag registry codes.
In list context, returns the language string, an unknown code integer, a region code integer, and a language code integer, with the last three being directly unpacked values.
See %mobilangcodes
for an exact map of values. Note that the bottom two bits of the region code appear to be unused (i.e. the values are all multiples of 4). The unknown code integer appears to be unused, and is generally zero.
BUGS AND LIMITATIONS
Mobipocket HuffDic encoding (used mostly on dictionaries) isn't supported yet.
Not all Mobipocket data is understood, so a conversion from OPF to Mobipocket .prc back to OPF will not result in all data being retained. Patches welcome.
Mobipocket EXTH subjectcode records may not end up attached to the correct subject element if the number of subject records differs from the number of subjectcode records. This is because the Mobipocket format leaves the EXTH subjectcode records completely unlinked from the subject records, and there is no way to detect if a subject with no associated subjectcode comes before a subject with an associated subjectcode.
Fortunately, this should rarely be a problem with real data, as Mobipocket Creator only allows a single subject to be set, and the only other way to have a subjectcode attached to a subject is to manually edit the OPF file and insert an additional dc:Subject element with a BASICCode attribute.
Mobipocket has indicated that they may move data currently in their custom elements and attributes to the standard <meta> elements in a future release, so this problem may become moot then.
AUTHOR
Zed Pobre <zed@debian.org>
LICENSE AND COPYRIGHT
Copyright 2008 Zed Pobre
Licensed to the public under the terms of the GNU GPL, version 2