NAME
File::TTX - Utilities for dealing with TRADOS TTX files
VERSION
Version 0.02
SYNOPSIS
TRADOS has been more or less the definitive set of translation tools for over a decade; more to the point, they're the tools I use most. There are two basic modes used by TRADOS to interact with documents. The first is in Word documents, which is not addressed in this module. The second is with TagEditor, which has TTX files as its native file format. TTX files are a breed of XML, so they're actually pretty easy to work with.
use File::TTX;
my $foo = File::TTX->load('myfile.ttx');
... do stuff with it ...
$foo->write();
Each TTX consists of a header and body text. The header contains various information about the file you can read and write; the text is, well, the text of the document. Before translation, the text consists of just plain text, but as you work TagEditor segments the file into segments, each of which is translated in isolation. (The paradigm here is that if you re-encounter a segment or something similar to one you've already done, the translation memory will provide you with the translation, either automatically writing it if it's identical, or at least presenting it to you to speed things up if it's just similar.)
A common mode is to read things with a script, build a TTX, and write it out for translation with TagEditor. Here's the kind of functions you'd use for that:
use File::TTX;
my $ttx = File::TTX->new();
$ttx->append_text("This is a sentence.\n");
$ttx->append_mark("test mark");
$ttx->append_text("\n");
$ttx->append_text("This is another sentence.\n");
$ttx->write ("my.ttx");
After translation, you can use the marks to find out where you are in the file (they'll be skipped during translation without being removed from the file).
There are two basic modes for content extraction; either you want to scan all content, or you're just interested in the segments so you can toss them into an Excel spreadsheet or something. These work pretty much the same; to scan all elements, you use content_elements
as follows; it returns a list of File::TTX::Content
elements, documented below, which are really just XML::xmlapi
elements with a little extra sugar for convenience.
use File::TTX;
my $ttx = File::TTX->load('myfile.ttx');
foreach my $piece ($ttx->content_elements) {
if ($piece->type eq 'mark') {
# something
} elsif ($piece->type eq 'segment') {
print $piece->translated . "\n";
}
}
To do a more data-oriented extraction, you'd want the segments
function, and the loop would look more like this:
foreach my $s ($ttx->segments) {
print $s->source . " - " . $s->translated . "\n";
}
Clear? Sure it is.
There are still plenty of gaps in this API; I plan to extend it as I run into new use cases. Now that I've actually put the darned thing on CPAN, I won't lose it in the meantime, so I won't have to rewrite it. Again. This is the fourth time, if you're keeping count. (And of course, since this is the first time I've managed to upload it, you can't be keeping count.)
CREATING A TTX OBJECT
new()
The new
function creates a blank TTX so you can build whatever you want and write it out. If you've already got an XML::xmlapi structure (that's the library used internally for XML representation here) then you can pass it in and it will be broken down into useful structural components for the element access functions.
load()
The load
function loads an existing TTX. Said file will remember where it came from, so you don't have to give the filename again when you write it (assuming you write it, of course).
FILE MANIPULATION
write($file)
Writes a TTX out to disk; the $file
can be omitted if you used load
to make the object and you want the file to write to the same place.
HEADER ACCESS
Here are a bunch of functions to access and/or modify different things in the header. Pass any of them a value to set that value.
CreationTool(), CreationDate(), CreationToolVersion()
These are in the ToolSettings part of the header. Mostly you don't care about them.
SourceDocumentPath(), OEncoding(), TargetLanguage(), PlugInInfo(), SourceLanguage(), SettingsPath(), SettingsRelativePath(), DataType(), SettingsName(), TargetDefaultFont()
These are in the UserSettings part of the header. Frankly, mostly you don't care about these either, but here we're getting into the reason for this module, like writing a quick script to read or change the source and target languages of TTX files.
slang(), tlang()
These are quicker versions of SourceLanguage and TargetLanguage; they cache the values for repeated use (and they do get used repeatedly). The drawback is they're actually slower for files without a source or target language defined, but this actually doesn't happen all that often. At least I hope not.
WRITING TO THE BODY
append_text($string)
Append a string to the end of the body. It's the caller's responsibility to terminate the line.
append_segment($source, $target, $match, $slang, $tlang, $origin)
Appends a segment to the body. Only $source
and $target
are required; $match
defaults to 0, and defaults for $slang
and $tlang
(the source and target languages) default to the master values in the header. Note that TagEditor really doesn't like you to mix languages, but who am I to stand in your way in this matter? Finally, $origin
defaults to unspecified. TagEditor sets it to "manual"; probably "Align" is another value, but I haven't verified that.
If the header doesn't actually have a source or target language, and you specify one or the other here, it will be written to the header as the default source or target language.
append_mark($string, $tag)
Appends a non-opening, non-closing tag to the body. (External style, e.g. text in Word that doesn't get translated.) This is useful for setting marks for script coordination, which is why I call it append_mark.
The default appearance is "text", but you can add $tag
if you want something else.
append_open_tag($string, $tag), append_close_tag ($string, $tag)
Appends a opening or closing tag. Here, the $tag
is required. (Well, it will default to 'cf' if you screw up. But don't.)
READING FROM THE BODY
Since a TTX is structured data, not just text, reading from it consists of iterating across its child elements. These elements are XML::xmlapi elements due to the underlying XML nature of the TTX file. I suppose some convenience functions might be a good idea, but frankly it's so easy to use the XML::xmlapi functions (well, I did write XML::xmlapi) that I haven't needed any so far. This might be a place to watch for further details.
content_elements()
Returns all the content elements in a list. Text may be broken up into multiple chunks, depending on how it was added.
segments()
Returns a list of just the segments in the body. Useful for data extraction.
MISCELLANEOUS STUFF
date_now()
Formats the current time the way TTX likes it.
File::TTX::Content
This helper class wraps the XML::xmlapi parts returned by content_elements
, providing a little more comfort when working with them.
rebless($xml)
Called on an XML::xmlapi element to rebless it as a File::TTX::Content element. This is a class method.
type()
Returns the type of content piece. The possible answers are 'text', 'open', 'close', 'segment', and 'mark'.
tag()
Returns (or sets) the tag or mark text of a tag or mark.
translated()
Returns the translated content of a segment, or just the content for anything else. Use with care.
source()
Returns the source content of a segment, or just the content for anything else.
match()
Returns and/or sets the recorded match percent of a segment (or 0 if it's not a segment).
Other things we'll want
The XML::xmlapi doesn't support the full range of XML manipulation in its current incarnation, so I'll need to revisit it, and also I don't need all this functionality today, but here's what the content handler should be able to do:
- Segment non-segmented text, replacing a chunk or series of chunks (in case neighboring text chunks don't cover a full segment)
with a segment or a segment-plus-extra-text.
- Translate a segment, i.e. replace the translated content.
- Modify the source of a segment (just in case).
- See and set the source and target languages of a segment.
If you are actually using Perl to access TTX files and would like to do these things, then by all means drop me a line and tell me to get the lead out.
AUTHOR
Michael Roberts, <michael at vivtek.com>
BUGS
Please report any bugs or feature requests to bug-file-ttx at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-TTX. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc File::TTX
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2010 Michael Roberts.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.