NAME
File::TTX - Utilities for dealing with TRADOS TTX files
VERSION
Version 0.04
SYNOPSIS
TRADOS has been more or less the definitive set of translation tools for over a decade; more to the point, they're the tools I use most. There are two basic modes used by TRADOS to interact with documents. The first is in Word documents, which is not addressed in this module. The second is with TagEditor, which has TTX files as its native file format. TTX files are a breed of XML, so they're actually pretty easy to work with.
use File::TTX;
my $foo = File::TTX->load('myfile.ttx');
... do stuff with it ...
$foo->write();
Each TTX consists of a header and body text. The header contains various information about the file you can read and write; the text is, well, the text of the document. Before translation, the text consists of just plain text, but as you work TagEditor segments the file into segments, each of which is translated in isolation. (The paradigm here is that if you re-encounter a segment or something similar to one you've already done, the translation memory will provide you with the translation, either automatically writing it if it's identical, or at least presenting it to you to speed things up if it's just similar.)
A common mode is to read things with a script, build a TTX, and write it out for translation with TagEditor. Here's the kind of functions you'd use for that:
use File::TTX;
my $ttx = File::TTX->new();
$ttx->append_text("This is a sentence.\n");
$ttx->append_mark("test mark");
$ttx->append_text("\n");
$ttx->append_text("This is another sentence.\n");
$ttx->write ("my.ttx");
After translation, you can use the marks to find out where you are in the file (they'll be skipped during translation without being removed from the file).
There are two basic modes for content extraction; either you want to scan all content, or you're just interested in the segments so you can toss them into an Excel spreadsheet or something. These work pretty much the same; to scan all elements, you use content_elements
as follows; it returns a list of File::TTX::Content
elements, documented below, which are really just XML::Snap
elements with a little extra sugar for convenience.
use File::TTX;
my $ttx = File::TTX->load('myfile.ttx');
foreach my $piece ($ttx->content_elements) {
if ($piece->type eq 'mark') {
# something
} else {
print $piece->translated . "\n";
}
}
To do a more data-oriented extraction, you'd want the segments
function, and the loop would look more like this:
foreach my $s ($ttx->segments) {
print $s->source . " - " . $s->translated . "\n";
}
Clear? Sure it is.
Here's another example: a filter to strip all pre-translated content out of a TTX in case you want a new, un-pre-translated copy.
use File::TTX;
my $in = $ARGV[0];
my $outf = $in;
$outf =~ s/\.xls\.ttx$/-stripped.xls.ttx/;
my $ttx = File::TTX->load($in);
my $out = File::TTX->new(from=>$ttx);
foreach my $piece ($ttx->content_elements) {
$out->append_copy ($piece->source_xml);
}
$out->write($outf);
It should be easy to see how you can expand that filter idea into nearly anything you need.
There are still plenty of gaps in this API! I plan to extend it as I run into new use cases. I'd be overjoyed to hear about yours.
CREATING A TTX OBJECT
new()
The new
function creates a blank TTX so you can build whatever you want and write it out. If you've already got an XML::Snap structure (that's the library used internally for XML representation here) then you can pass it in and it will be broken down into useful structural components for the element access functions.
load()
The load
function loads an existing TTX. Said file will remember where it came from, so you don't have to give the filename again when you write it (assuming you write it, of course).
TRADOS is nice enough to provide us with TTX that is illegal XML sometimes, so load() has to load your entire file into memory to sanitize it of illegal characters before the XML parser sees it. This will unfortunately cause File::TTX to work from a different input from TRADOS native tools, but as long as your TTX isn't generated from a Word document with soft hyphens in it, you ought to be OK.
FILE MANIPULATION
write($file)
Writes a TTX out to disk; the $file
can be omitted if you used load
to make the object and you want the file to write to the same place.
HEADER ACCESS
Here are a bunch of functions to access and/or modify different things in the header. Pass any of them a value to set that value.
CreationTool(), CreationDate(), CreationToolVersion()
These are in the ToolSettings part of the header. Mostly you don't care about them.
SourceDocumentPath(), OEncoding(), TargetLanguage(), PlugInInfo(), SourceLanguage(), SettingsPath(), SettingsRelativePath(), DataType(), SettingsName(), TargetDefaultFont()
These are in the UserSettings part of the header. Frankly, mostly you don't care about these either, but here we're getting into the reason for this module, like writing a quick script to read or change the source and target languages of TTX files.
copy_header ($source)
Copies the header information from another TTX into this one.
slang(), tlang()
These are quicker versions of SourceLanguage and TargetLanguage; they cache the values for repeated use (and they do get used repeatedly). The drawback is they're actually slower for files without a source or target language defined, but this actually doesn't happen all that often. At least I hope not.
WRITING TO THE BODY
append_text($string)
Append a string to the end of the body. It's the caller's responsibility to terminate the line.
append_segment($source, $target, $match, $slang, $tlang, $origin)
Appends a segment to the body. Only $source
and $target
are required; $match
defaults to 0, and defaults for $slang
and $tlang
(the source and target languages) default to the master values in the header. Note that TagEditor really doesn't like you to mix languages, but who am I to stand in your way in this matter? Finally, $origin
defaults to unspecified. TagEditor sets it to "manual"; probably "Align" is another value, but I haven't verified that.
If the header doesn't actually have a source or target language, and you specify one or the other here, it will be written to the header as the default source or target language.
append_mark($string, $tag)
Appends a non-opening, non-closing tag to the body. (External style, e.g. text in Word that doesn't get translated.) This is useful for setting marks for script coordination, which is why I call it append_mark.
The default appearance is "text", but you can add $tag
if you want something else.
append_open_tag($string, $tag), append_close_tag ($string, $tag)
Appends a opening or closing tag. Here, the $tag
is required. (Well, it will default to 'cf' if you screw up. But don't.)
append_copy, copy_all
If you have an XML piece from another TTX, you can append a copy of it directly into this TTX. Note that the "XML piece" from source
and translated
of a segment may actually be a list (because a segment may contain tags and text). The copy_all
method copies the contents of another TTX's body tag into the current TTX, and can filter along the way.
READING FROM THE BODY
Since a TTX is structured data, not just text, reading from it consists of iterating across its child elements. These elements are XML::Snap elements due to the underlying XML nature of the TTX file. I suppose some convenience functions might be a good idea, but frankly it's so easy to use the XML::Snap functions (well, I did write XML::Snap) that I haven't needed any so far. This might be a place to watch for further details.
content_elements()
Returns all the top-level content elements in a list. Depending on the structure of the TTX and the tool used to build it, this level may not include all segments (I've had segmented TTX with the segments embedded in top-level formatting elements).
segments()
Returns a list of just the segments in the body. Useful for data extraction.
MISCELLANEOUS STUFF
date_now()
Formats the current time the way TTX likes it.
File::TTX::Content
This helper class wraps the XML::Snap parts returned by content_elements
, providing a little more comfort when working with them.
rebless($xml)
Called on an XML::Snap element to rebless it as a File::TTX::Content element. This is a class method.
type()
Returns the type of content piece. The possible answers are 'text', 'open', 'close', 'segment', and 'mark'.
tag()
Returns (or sets) the tag or mark text of a tag or mark.
translated(), translated_xml()
Returns the translated content of a segment, or just the content for anything else. Use with care. The _xml
variant returns the underlying XML object - use with even more care.
write_translated($thing)
If not called on a segment, does nothing at all. Eventually, of course, it will have to be possible to identify a text area and segment it, but this is not that function.
If called on a segment with a string, deletes whatever may be in the segment's translated half, creates an XML::Snap text object from the string, and inserts said object. If called on a segment with an XML::Snap object, insert it. If called with a list of things, inserts one after the other with the same rules.
source(), source_xml()
Returns the source content of a segment, or just the content for anything else. The _xml
variant returns the xml object, so you get the tag structure if it's a complex source segment.
write_source($thing)
Works just like write_translated, except on the source, which Trados tools won't let you do. Use with care.
match()
Returns and/or sets the recorded match percent of a segment (or 0 if it's not a segment).
source_lang(), translated_lang()
Returns and/or sets the source or target language of a segment (or nothing if it's not a segment).
Other things we'll want
The XML::Snap doesn't support the full range of XML manipulation in its current incarnation, so I'll need to revisit it, and also I don't need all this functionality today, but here's what the content handler should be able to do:
- Segment non-segmented text, replacing a chunk or series of chunks (in case neighboring text chunks don't cover a full segment)
with a segment or a segment-plus-extra-text.
- Translate a segment, i.e. replace the translated content.
- Modify the source of a segment (just in case).
If you are actually using Perl to access TTX files and would like to do these things, then by all means drop me a line and tell me to get the lead out.
AUTHOR
Michael Roberts, <michael at vivtek.com>
BUGS
Please report any bugs or feature requests to bug-file-ttx at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-TTX. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc File::TTX
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2010 Michael Roberts.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.