NAME
Unicode::Truncate - Unicode-aware efficient string truncation
SYNOPSIS
use utf8;
use Unicode::Truncate;
truncate_egc("hello world", 7);
## returns "hell…";
truncate_egc("hello world", 7, '');
## returns "hello w"
truncate_egc('深圳', 7);
## returns "深…"
truncate_egc("née Jones", 5)'
## returns "n…" (not "ne…", even in NFD)
truncate_egc("\xff", 10)
## throws exception:
## "input string not valid UTF-8 (detected at byte offset 0 in truncate_egc)"
my $str = "hello world";
truncate_egc_inplace($str, 8)
## $str is now "hello…";
DESCRIPTION
This module is for truncating UTF-8 encoded Unicode text to particular byte lengths while inflicting the least amount of data corruption possible. The resulting truncated string will be no longer than your specified number of bytes (after UTF-8 encoding).
All truncated strings will continue to be valid UTF-8: it won't cut in the middle of a UTF-8 encoded code-point. Furthermore, if your text contains combining diacritical marks, this module will not cut in between a diacritical mark and the base character. It will in general try to preserve what users perceive as whole characters, with as little as possible mutilation at the truncation site.
The truncate_egc
function truncates only between extended grapheme clusters (as defined by Unicode TR29 version 7.0.0).
The truncate_egc_inplace
function is identical to truncate_egc
except that the input string will be modified so that no copying occurs. If you pass in a read-only value it will throw an exception.
Eventually I'd like to support other boundaries such as words and sentences. Those functions will be named truncate_word
and so on.
RATIONALE
Of course in a perfect world we would only need to worry about the amount of space some text takes up on the screen, in the real world we often have to or want to make sure things fit within certain byte size capacity limits. Many databases, network protocols, and file-formats require honouring byte-length restrictions. Even if they automatically truncate for you, are they doing it properly and consistently? On many file-systems, file and directory names are subject to byte-size limits. Many APIs that use C structs have fixed limits as well. You may even wish to do things like guarantee that a collection of news headlines will fit in a single ethernet packet.
I knew I had to write this module after I asked Tom Christiansen about the best way to truncate unicode to fit in fixed-byte fields and he got angry and told me to never do that. :)
Why not just use substr
on a string before UTF-8 encoding it? The main problem with that is the number of bytes that an encoded string will consume is not known until after you encode it. It depends on how many "high" code-points are in the string, how "high" those code-points are, the normalisation form chosen, and (relatedly) how many combining marks are used. Even with perl unicode strings (ie before encoding), using substr
will cut in front of combining marks.
Truncating post-encoding may result in invalid UTF-8 partials at the end of your string, as well as cutting in front of combining marks.
One interesting aspect of unicode's combining marks is that there is no specified limit to the number of combining marks that can be applied. So in some interpretations a single character/grapheme/whatever can take up an arbitrarily large number of bytes. However, there are various recommendations such as the Unicode UAX15-D3 "stream-safe" limit of 30. Reportedly the largest known "legitimate" use is a 1 base + 8 combining marks grapheme used in a Tibetan script.
ELLIPSIS
When a string is truncated, truncate_egc
indicates this by appending an ellipsis. The length of the truncated content including the ellipsis is guaranteed to be no greater than the byte size limit you specified.
By default the ellipsis is the character U+2026 (…) however you can use any other string by passing it in as the third argument. The ellipsis string must not contain invalid UTF-8 (it can be encoded or can contain perl high-code points, up to you). Note the default ellipsis consumes 3 bytes in UTF-8 encoding which is the same as 3 periods in a row.
IMPLEMENTATION
This module uses the ragel state machine compiler to parse/validate UTF-8 and to determine the presence of combining characters. Ragel is nice because we can determine the truncation location with a single pass through the data in an optimised C loop.
One of the requirements of this module was to additionally validate UTF-8 encoding. This is so you can run it against strings with or without having decoded them with Encode::decode
first. This module will throw exceptions if the strings to be truncated aren't UTF-8. This property lets us minimise the amount of times a user-supplied string is "decoded". With this module, you can accept an arbitrary string from a web request (say), validate that it is UTF-8, truncate it if necessary, and write it out to a DB, all with only a single pass over the data.
As mentioned, this module will not scan further than it needs to in order to determine the truncation location. So creating a short truncation of a really long string doesn't require traversing the entire string. However, this module won't validate that the bytes beyond its truncation location are valid UTF-8.
Another purpose of this module is to be a "proof of concept" for Inline::Module::LeanDist and Inline::Filters::Ragel. This distribution concept was of course heavily inspired by Inline::Module.
SEE ALSO
Although efficient, as discussed above, substr
will not be able to give you a guaranteed byte-length output (if done pre-encoding) and will corrupt text (pre or post-encoding).
There are several similar modules such as Text::Truncate, String::Truncate, Text::Elide but they are all essentially wrappers around substr
and are subject to its limitations.
A reasonable "99%" solution is to encode your string as UTF-8, truncate at the byte-level with substr
, decode with Encode::FB_QUIET
, and then re-encode it to UTF-8. This will ensure that the output is always valid UTF-8, but will still risk corrupting unicode text that contains combining marks.
Ricardo Signes suggested an algorithm using Unicode::GCString which would also be correct but likely less efficient.
It may be possible to use the regexp engine's \X
combined with (?{})
in some way but I haven't been able to figure that out.
BUGS
Of course I can't test this module on all the writing systems of the world so I don't know the severity of the corruption in all situations. It's possible that the corruption can be minimised in additional ways without sacrificing the simplicity or efficiency of the algorithm. If you have any ideas please let me know and I'll try to incorporate them.
Eventually I'd like to truncate on other boundaries specified by unicode, such as word, sentence, and line.
It would be nice to be able to apply an EGC limit such as 30.
This module doesn't handle the UTF-16 surrogate range in the grapheme properties files because Encode::encode
isn't encoding them the way I'd need them to. That's OK because these aren't valid UTF-8 anyway.
Perl internally supports characters outside what is officially unicode. This module only works with the official UTF-8 range so if you are using this perl extension (perhaps for some sort of non-unicode sentinel value) this module will throw an exception indicating invalid UTF-8 encoding (which is more of a feature than a bug given this module's primary purpose of validating and truncating untrusted, user-provided text).
AUTHOR
Doug Hoyte, <doug@hcsw.org>
COPYRIGHT & LICENSE
Copyright 2014-2017 Doug Hoyte.
This module is licensed under the same terms as perl itself.