NAME
MsOffice::Word::Surgeon::PackagePart - Operations on a single part within the ZIP package of a docx document
DESCRIPTION
This class is part of MsOffice::Word::Surgeon; it encapsulates operations for a single package part within the ZIP package of a .docx
document. It is mostly used for the document part, that contains the XML representation of the main document body. However, other parts such as headers, footers, footnotes, etc. have the same internal representation and therefore the same operations can be invoked.
METHODS
new
my $run = MsOffice::Word::Surgeon::PackagePart->new(
surgeon => $surgeon,
part_name => $name,
);
Constructor for a new part object. This is called internally from MsOffice::Word::Surgeon; it is not meant to be called directly by clients.
Constructor arguments
- surgeon
-
a weak reference to the main surgeon object
- part_name
-
ZIP member name of this part
Other attributes
Other attributes, which are not passed through the constructor but are generated lazily on demand, are :
- contents
-
the XML contents of this part
- runs
-
a decomposition of the XML contents into a collection of MsOffice::Word::Surgeon::Run objects.
- relationships
-
an arrayref of Office relationships associated with this part. This information comes from a
.rels
member in the ZIP archive, named after the name of the package part. Array indices correspond to relationship numbers. Array values are hashrefs with keys- rId
-
the full relationship id
- num
-
the numeric part of
rId
- type
-
the full reference to the XML schema for this relationship
- short_type
-
only the last word of the type, e.g. 'image', 'style', etc.
- target
-
designation of the target within the ZIP file. The prefix 'word/' must be added for having a complete Zip member name.
- images
-
a hashref of images within this package part. Keys of the hash are image titles, because this is the only information visible to MsWord users : when in Word, within the image formatting panel, the "properties" tab has a field for editing the image title. Images without title will not be accessible through the current Perl module. Values of the hash are zip member names for the corresponding image representations in
.png
format.
Contents restitution
contents
Returns a Perl string with the current internal XML representation of the part contents.
original_contents
Returns a Perl string with the XML representation of the part contents, as it was in the ZIP archive before any modification.
indented_contents
Returns an indented version of the XML contents, suitable for inspection in a text editor. This is produced by "toString" in XML::LibXML::Document and therefore is returned as an encoded byte string, not a Perl string.
plain_text
Returns the text contents of the part, without any markup. Paragraphs and breaks are converted to newlines, all other formatting instructions are ignored.
runs
Returns a list of MsOffice::Word::Surgeon::Run objects. Each of these objects holds an XML fragment; joining all fragments restores the complete document.
my $contents = join "", map {$_->as_xml} $self->runs;
Modifying contents
cleanup_XML
$part->cleanup_XML;
Apply several other methods for removing unnecessary nodes within the internal XML. This method successively calls "reduce_all_noises", "unlink_fields", "suppress_bookmarks" and "merge_runs".
reduce_noise
$part->reduce_noise($regex1, $regex2, ...);
This method is used for removing unnecessary information in the XML markup. It applies the given list of regexes to the whole document, suppressing matches. The final result is put back into $self->contents
. Regexes may be given either as qr/.../
references, or as names of builtin regexes (described below). Regexes are applied to the whole XML contents, not only to run nodes.
noise_reduction_regex
my $regex = $part->noise_reduction_regex($regex_name);
Returns the builtin regex corresponding to the given name. Known regexes are :
proof_checking => qr(<w:(?:proofErr[^>]+|noProof/)>),
revision_ids => qr(\sw:rsid\w+="[^"]+"),
complex_script_bold => qr(<w:bCs/>),
page_breaks => qr(<w:lastRenderedPageBreak/>),
language => qr(<w:lang w:val="[^/>]+/>),
empty_run_props => qr(<w:rPr></w:rPr>),
soft_hyphens => qr(<w:softHyphen/>),
reduce_all_noises
$part->reduce_all_noises;
Applies all regexes from the previous method.
unlink_fields
my $names_of_ASK_fields = $part->unlink_fields;
Removes all fields from the part, just leaving the current value stored in each field. This is the equivalent of performing Ctrl-Shift-F9 on the whole document.
The return value is an arrayref to a list of names of ASK fields within the document. Such names should then be passed to the "suppress_bookmarks" method (see below).
suppress_bookmarks
$part->suppress_bookmarks(@names_to_erase);
Removes bookmarks markup in the part. This is useful because MsWord may silently insert bookmarks in unexpected places; therefore some searches within the text may fail because of such bookmarks.
By default, this method only removes the bookmarks markup, leaving intact the contents of the bookmark. However, when the name of a bookmark belongs to the list @names_to_erase
, the contents is also removed. Currently this is used for suppressing ASK fields, because such fields contain a bookmark content that is never displayed by MsWord.
merge_runs
$part->merge_runs(no_caps => 1); # optional arg
Walks through all runs of text within the document, trying to merge adjacent runs when possible (i.e. when both runs have the same properties, and there is no other XML node inbetween).
This operation is a prerequisite before performing replace operations, because documents edited in MsWord often have run boundaries across sentences or even in the middle of words; so regex searches can only be successful if those artificial boundaries have been removed.
If the argument no_caps => 1
is present, the merge operation will also convert runs with the w:caps
property, putting all letters into uppercase and removing the property; this makes more merges possible.
replace
$part->replace($pattern, $replacement, %replacement_args);
Replaces all occurrences of $pattern
regex within the text nodes by the given $replacement
. This is not exactly like a search-replace operation performed within MsWord, because the search does not cross boundaries of text nodes. In order to maximize the chances of successful replacements, the "cleanup_XML" method is automatically called before starting the operation.
The argument $pattern
can be either a string or a reference to a regular expression. It should not contain any capturing parentheses, because that would perturb text splitting operations.
The argument $replacement
can be either a fixed string, or a reference to a callback subroutine that will be called for each match.
The %replacement_args
hash can be used to pass information to the callback subroutine. That hash will be enriched with three entries :
- matched
-
The string that has been matched by
$pattern
. - run
-
The run object in which this text resides.
- xml_before
-
The XML fragment (possibly empty) found before the matched text .
The callback subroutine may return either plain text or structured XML. See "SYNOPSIS" in MsOffice::Word::Surgeon::Run for an example of a replacement callback.
The following special keys within %replacement_args
are interpreted by the replace()
method itself, and therefore are not passed to the callback subroutine :
- keep_xml_as_is
-
if true, no call is made to the "cleanup_XML" method before performing the replacements
- dont_overwrite_contents
-
if true, the internal XML contents is not modified in place; the new XML after performing replacements is merely returned to the caller.
replace_image
$part->replace_image($image_title, $image_PNG_content);
Replaces an existing PNG image by a new image. All features of the old image will be preserved (size, positioning, border, etc.) -- only the image itself will be replaced. The $image_title
must correspond to the title set in Word through the image formatting panel, "properties" tab, "title" field.
add_image
my $rId = $part->add_image($image_PNG_content);
Stores the given PNG image within the ZIP file, adds it as a relationship to the current part, and returns the relationship id. This operation is not sufficient to make the image visible in Word : it just stores the image, but you still have to insert a proper drawing
node in the contents XML, using the $rId
. Future versions of this module may offer helper methods for that purpose; currently it must be done by hand.
AUTHOR
Laurent Dami, <dami AT cpan DOT org<gt>
COPYRIGHT AND LICENSE
Copyright 2019-2022 by Laurent Dami.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.