NAME

ODF::lpOD_Helper - fix and enhance ODF::lpOD

SYNOPSIS

use feature 'unicode_strings';
use ODF::LpOD;
use ODF::LpOD_Helper qw/:chars :DEFAULT/;

my $doc = odf_get_document("/path/to/something.xml");
my $body = $doc->get_body;

# Replace "{famous author}" with "Stephen King" in large, red, bold text.
# regardless of segmentation
$body->Hreplace("{famous author}", 
                ["bold", size => "24pt", color => "red"], "Stephen King"] 
               );

The following funcions are exported by default:

Hautomatic_style 
Hcommon_style
self_or_parent
fmt_match fmt_node fmt_tree
The REPL_* constants used by the Hreplace method.

DESCRIPTION

ODF::lpOD_Helper enables transparent Unicode support and provides higher-level text search & replace which can match segmented text including tabs, newlines, and multiple spaces.

Styles may be specified with a high-level notation and the necessary ODF styles are automatically created and fonts registered.

ODF::lpOD by itself can be inconvenient for text operations because

  1. Method arguments must be passed as encoded binary octets, rather than character strings (see 'man perlunicode').

  2. search() can not match segmented strings, and so can not match text which LibreOffice has fragmented for it's own internal purposes (such as "record changes"), nor can searches match tab, newline or multiple spaces.

  3. replace() can not replace text stored in multiple segments, and will store \t, \n, or consecutive spaces embedded in a single #PCDATA node rather then using the special ODF objects.

ODF::lpOD_Helper also fixes a bug causing spurrious "Unknown method DESTROY" warnings (https://rt.cpan.org/Public/Bug/Display.html?id=97977)

The ':chars' import tag

This makes all ODF::lpOD methods accept and return character strings rather than encoded binary.

You will always want this unless your application really, really needs to pass un-decoded octets directly between file/network resources and ODF::lpOD without looking at the data along the way. Not enabled by default to avoid breaking old programs. See ODF::lpOD_Helper::Unicode.

Currently :chars has global effect but might someday become scoped; to be safe put use ODF::lpOD_Helper ':chars' at the top of every file.

METHODS

"Hxxx" methods are installed into appropriate ODF::lpOD packages so they can be called the same way as native ODF::lpOD methods ('H' denotes extensions from ODF::lpOD_Helper).

@matches = $context->Hsearch($expr)

$match = $context->Hsearch($expr, OPTIONS)

Finds $expr within the "virtual text" of paragraphs below $context (or $context itself if it is a paragraph).

The "virtual text" is the concatenation of all leaf nodes in the paragraph, treating the special tab, newline, and space objects as if they stored normal text.

Each match must be contained within a paragraph, but may include any number of segments and need not start or end on segment boundaries. A match may encompass leaves under different spans.

<$expr> may be a plain string or qr/regex/s (the /s option allows '.' to match \n). Spaces, tabs, and newlines in $expr will match the corresponding special ODF objects as well as regular text.

$context may be a paragraph or an ancestor such as a table cell, or even the document body; all contained paragraphs are searched.

OPTIONS may be

multi  => BOOL    # Allow multiple matches? (FALSE by default)

offset => NUMBER  # Starting position within the combined virtual
                  # texts of all paragraphs in C<$context>

A hash is returned for each match:

{
  match    => The matched virtual text
  segments => [ leaf nodes containing the match ]
  offset   => Offset of match in the first segment's virtual text
  end      => Offset+1 of end of match in the last segment's v.t.
  voffset  => Offset of match in the combined virtual texts
  vend     => Offset+1 of match-end in the combined virtual texts
}

       Para.#1 ║ Paragraph #2 containing a match  │
       (ignord)║  spread over several segments    │
               ║                                  │
               ║                                  │
       ------------match voffset---►┊             │
       --------match vend---------------------►┊  │
               ║                    ┊          ┊  │
               ║              match ┊   match  ┊  │
               ║             ║-off-►┊ ║--end--►┊  │
       ╓──╥────╥──╥────╥─────╥──────┬─╥────────┬──╖
       ║xx║xxxx║xx║xxxx║xx...║......**║*MATCH**...║
       ║xx║xxxx║xx║xxxx║xxSEA║RCHED VI║IRTUAL TEXT║
       ╙──╨────╨──╨────╨──┼──╨────────╨───────────╜
       ┊─OPTION 'offset'─►┊

Note: text:tab and text:newline objects count as one virtual character; if the last segment is a text:s (which can represent several consecutive spaces), then 'end' will be the number of virtual spaces included in the match.

RETURNS:

In array context, zero or more hashrefs.

In scalar context, a hashref or undef if there was no match, and croaks if there were multiple matches.

$context->Hreplace($expr, [content], OPTIONS)

$context->Hreplace($expr, sub{...}, OPTIONS)

Search and replace. $expr is a string or qr/regex/s as with Hsearch.

In the first form, each matched substring in the virtual text is replaced with [content] and the number of matches is returned.

In the second form, the specified sub is called for each match, passing a match hashref (see Hsearch) as the only argument.

The sub must return one of the following ways:

return(REPL_CONTINUE)
return(REPL_CONTINUE, expr => $newexpr)

  Nothing is done to the matched text; searching continues,
  optionally with a new search target.

return(REPL_SUBST_CONTINUE, [content]) or
return(REPL_SUBST_CONTINUE, [content], expr => $newexpr)

  The matched text is replaced by [content] and searching continues.

return(REPL_SUBST_STOP, [content], optRESULTS)

  The matched text is replaced with [content] and then "Hreplace"
  terminates, returning optRESULTS if provided otherwise
  the total number of matches.

return(REPL_STOP, optRESULTS)

  "Hreplace" just terminates.

If the sub does not specify any return value(s), then Hreplace returns the number of matches.

Content Specification

The [content] argument is a list of zero or more elements, each of which is either

  • A text string which may include spaces, tabs and newlines, or

  • A reference to [list of format properties]

Each [list of format properties] describes an automatic character style which will be applied only to the immediately-following text string.

Format properties may be any of the key => value pairs accepted by odf_create_style, as well as these single-item abbreviations:

"center"      means  align => "center"
"left"        means  align => "left"
"right"       means  align => "right"
"bold"        means  weight => "bold"
"italic"      means  style => "italic"
"oblique"     means  style => "oblique"
"normal"      means  style => "normal", weight => "normal"
"roman"       means  style => "normal"
"small-caps"  means  variant => "small-caps"
"normal-caps" means  variant => "normal", #??

<NUM>         means  size => "<NUM>pt,   # bare number means point size
"<NUM>pt"     means  size => "<NUM>pt,
"<NUM>%"      means  size => "<NUM>%,    # only in 'common' styles!

Internally, an ODF "automatic" Style is created for each unique combination of properties, re-using styles when possible. Fonts are automatically registered.

An ODF Style which already exists (or will be created) may be used by passing a single special property list

[style-name => "name of style"]

$context->Hinsert_content([content], OPTIONS)

This works like ODF::lpOD::Element::insert_element() except the possibly-multiple segments to be inserted are described by a high-level [content] specification (as described for Hreplace).

The segment(s) actually inserted will include spans and the special ODF objects representing tabs, spaces and newlines as implied by the characters in [content].

OPTIONS may contain:

position => ...  # default is FIRST_CHILD

chomp => BOOL    # remove \n from end of content, if any

The new content is inserted at the indicated position relative to $context.

If multiple segments are inserted, the first one will be at the indicated position and the others will be immediately-following siblings of the first.

Returns nothing.

Empty elements are deleted (or not inserted).

$node->self_or_parent($tag)

Returns $node or it's nearest ancestor which matches a gi

Currently this throws an exception if neither $node or an ancestor matches $tag.

$context->descendants_pruned($cond, $prune_cond)

Similar to XML::Twig's descendants method but omits descendants of items which match prune_cond. An undef condition matches all items.

For example

@nodes = $some_paragraph->descendants_pruned(undef, qr/^text:[ph]$/);

would return all the nodes below $some_paragraph including any nested paragraph or heading nodes, but exluding the contents of those nested containers (nested paragraphs can occur, for example, in Frames in an outer paragraph).

Note: The XPath subset supported by XML::Twig does not allow this kind of filtering.

$context->gen_style_name($family, SUFFIX)

$context->gen_table_name(SUFFIX)

Generate a style or table name not currently in use.

In the case of a style, the $family must be specified ("text", "table", etc.).

SUFFIX is an optional string which will be appended to a generated unique name (to make it easier for humans to recognize).

$context may be the document itself or any Element.

FUNCTIONS (not methods)

Hautomatic_style($context, $family, properties...)

Find or create an 'automatic' (i.e. functionally anonymous) style with the specified properties.

Styles are re-used when possible, so the returned style object should not be modified because it might be shared.

$family must be "text" or another supported style family name (TODO: specify)

Property list items are as described for Hreplace.

Hcommon_style($context, $family, properties...)

Create a 'common' (i.e. named by the user) style from high-level props.

The name, which must not name an existing style, is given by name => "STYLENAME" somewhere in PROPs.

hashtostring($hashref)

Returns a single string representing the keys and values of a hash

fmt_node($node)

Format a single node for debug messages, without a final newline.

fmt_tree($top)

Format a node and all of it's children (sans final newline).

fmt_match($matchhash)

Format a match hashreffor debug messages (sans final newline).

HISTORY

The original ODF::lpOD_Helper was written in 2012. The code was reworked and this manual written in 2023. The API changed with version 3.000 .

As of Feb 2023, ODF::lpOD is not actively maintained (last updated in 2014, v1.126), and is now unusable as-is because of the warning mentioned above. With ODF::lpOD_Helper, ODF::lpOD is once again an extremely useful tool.

AUTHOR

Jim Avera (jim.avera AT gmail dot com)

LICENSE

ODF::lpOD (v1.126) may be used under the GPL 3 or Apache 2.0 license.

ODF::lpOD_Helper is in the Public Domain (or CC0 license), but requires ODF::lpOD to function so as a practical matter use must comply with ODF::lpOD's license.