NAME

Unicode::LineBreak - UAX #14 Unicode Line Breaking Algorithm

SYNOPSIS

use Unicode::LineBreak;
$lb = Unicode::LineBreak->new();
$broken = $lb->break($string);

DESCRIPTION

Unicode::LineBreak performs Line Breaking Algorithm described in Unicode Standards Annex #14 [UAX #14]. East_Asian_Width informative properties defined by Annex #11 [UAX #11] will be concerned to determin breaking positions.

Terminology

Following terms are used for convenience.

Mandatory break is obligatory line breaking behavior defined by core rules and performed regardless of surrounding characters. Arbitrary break is line breaking behavior allowed by core rules and chosen by user to perform it. Arabitrary break includes direct break and indirect break defined by [UAX #14].

Alphabetic characters are characters usually no line breaks are allowed between pairs of them, except that other characters provide break oppotunities. Ideographic characters are characters that usually allow line breaks both before and after themselves. [UAX #14] classifies most of alphabetic to AL and most of ideographic to ID (These terms are inaccurate from the point of view by grammatology). On several scripts, breaking points are not obvious by each characters therefore heuristic based on dictionary is used.

Number of columns of a string is not always equal to the number of characters it contains: Each of characters is either wide, narrow or nonspacing; they occupy 2, 1 or 0 columns, respectively. Several characters may be both wide and narrow by the contexts they are used. Characters may have more various widths by customization.

PUBLIC INTERFACE

Line Breaking

new ([KEY => VALUE, ...])

Constructor. About KEY => VALUE pairs see "Options".

$self->break (STRING)

Instance method. Break Unicode string STRING and returns it.

$self->break_partial (STRING)

Instance method. Same as break() but accepts incremental inputs. Give undef as STRING argument to specify that input was completed.

$self->config (KEY)
$self->config (KEY => VALUE, ...)

Instance method. Get or update configuration. About KEY => VALUE pairs see "Options".

$self->copy

Copy constructor. Create a copy of object instance.

Getting Informations

context ([Charset => CHARSET], [Language => LANGUAGE])

Function. Get language/region context used by character set CHARSET or language LANGUAGE.

$self->eawidth (STRING)

Instance method. Get East_Asian_Width property of the first character of Unicode string STRING. See "Constants" for returned value. EA_Z means nonspacing (zero width) character. Property value A (ambiguous) will be resolved to other appropriate values.

$self->lbclass (STRING)

Instance method. Get line breaking property (class) of the first character of Unicode string STRING. See "Constants" for returned value. Classes AI, SA, SG and XX will be resolved to other appropriate classes. However, when word segmentation for South East Asian writing systems is enabled, characters of supported scripts (currently Thai only) will be kept SA.

$self->lbrule (BEFORE, AFTER)

Instance method. Get line breaking action between class BEFORE and class AFTER. See "Constants" for returned value.

Note: This method might not give appropriate value related to classes BK, CM, CR, LF, NL and SP, and won't give meaningful value related to classes AI, SA, SG and XX.

$self->strsize (LEN, PRE, SPC, STR)
$self->strsize (LEN, PRE, SPC, STR, MAX)

Instance method. When MAX is not specified, calculate number of columns of Unicode string PRE.SPC.STR based on character widths defined by [UAX #11]. When a positive value MAX is specified, return number of characters of SUBSTR, the longest substring of STR by which number of columns of PRE.SPC.SUBSTR does not exceed MAX.

Options

"new" and "config" methods accept following pairs.

CharactersMax => NUMBER

Possible maximum number of characters in one line, not counting trailing SPACEs and newline sequence. Note that number of characters generally doesn't represent length of line. Default is 998. Should not be 0.

ColumnsMin => NUMBER

Minimum number of columns which line broken arbitrarily may include, not counting trailing spaces and newline sequences. Default is 0.

ColumnsMax => NUMBER

Maximum number of columns line may include not counting trailing spaces and newline sequence. In other words, maximum length of line. Default is 76.

See also "UrgentBreaking" option and "User-Defined Breaking Behaviors".

Context => CONTEXT

Specify language/region context. Currently available contexts are "EASTASIAN" and "NONEASTASIAN". Default context is "NONEASTASIAN".

Format => METHOD

Specify the method to format broken lines.

"DEFAULT"

Default method. Just only insert newline at arbitrary breaking positions.

"NEWLINE"

Insert or replace newline sequences by that specified by "Newline" option, remove SPACEs leading newline sequences or end-of-text. Then append newline at end of text if it does not exist.

"TRIM"

Insert newline at arbitrary breaking positions. Remove SPACEs leading newline sequences.

Subroutine reference

See "Formatting Lines".

HangulAsAL => "YES" | "NO"

Treat hangul syllables and conjoining jamos as alphabetic characters (AL). Default is "NO".

LegacyCM => "YES" | "NO"

Treat combining characters lead by a SPACE as an isolated combining character (ID). As of Unicode 5.0, such use of SPACE is not recommended. Default is "YES".

Newline => STRING

Unicode string to be used for newline sequence. Default is "\n".

SizingMethod => METHOD

Specify method to calculate size of string. Following options are available.

"DEFAULT"

Default method. strsize() will be used.

Subroutine reference

See "Calculating String Size".

See also "TailorEA" option.

TailorEA => [ ORD => CLASS, ... ]

Tailor classification of East_Asian_Width property. ORD is UCS scalar value of character or array reference of them. CLASS is one of East_Asian_Width property values N, Na, W, F and H (See "Constants").

By default, no tailorings are available. See also "Tailoring Character Properties".

TailorLB => [ ORD => CLASS, ... ]

Tailor classification of line breaking property. ORD is UCS scalar value of character or array reference of them. CLASS is one of line breaking classes (See "Constants").

By default, no tailorings are available. See also "Tailoring Character Properties".

UrgentBreaking => METHOD

Specify method to handle excessing lines. Following options are available.

"CROAK"

Print error message and die.

"FORCE"

Force breaking excessing fragment.

"NONBREAK"

Default method. Won't break excessing fragment.

Subroutine reference

See "User-Defined Breaking Behaviors".

UserBreaking => [METHOD, ...]

Specify user-defined line breaking behavior(s). Following methods are available.

"NONBREAKURI"

Won't break URIs.

"BREAKURI"

Break URIs according to a rule suitable for printed materials. For more details see [CMOS], sections 6.17 and 7.11.

[ REGEX, SUBREF ]

The sequences matching regular expression REGEX will be broken by subroutine referred by SUBREF. For more details see "User-Defined Breaking Behaviors".

Constants

EA_Na, EA_N, EA_A, EA_W, EA_H, EA_F, EA_Z

Index values to specify 6 East_Asian_Width properties defined by [UAX #11], and EA_Z to specify nonspacing.

LB_BK, LB_CR, LB_LF, LB_NL, LB_SP, LB_OP, LB_CL, LB_CP, LB_QU, LB_GL, LB_NS, LB_EX, LB_SY, LB_IS, LB_PR, LB_PO, LB_NU, LB_AL, LB_ID, LB_IN, LB_HY, LB_BA, LB_BB, LB_B2, LB_CB, LB_ZW, LB_CM, LB_WJ, LB_H2, LB_H3, LB_JL, LB_JV, LB_JT, LB_SG, LB_AI, LB_SA, LB_XX

Index values to specify 37 line breaking properties (classes) defined by [UAX #14].

Note: Property value CP was introduced by Unicode 5.2.0.

MANDATORY, DIRECT, INDIRECT, PROHIBITED

4 values to specify line breaking behaviors: Mandatory break; Both direct break and indirect break are allowed; Indirect break is allowed but direct break is prohibited; Prohibited break.

Unicode::LineBreak::SouthEastAsian::supported

Flag to determin if word segmentation for South East Asian writing systems is enabled. If this feature was enabled, a non-empty string is set. Otherwise, undef is set.

N.B.: Current release supports Thai script of modern Thai language only.

UNICODE_VERSION

A string to specify version of Unicode standard this module refers.

CUSTOMIZATION

Formatting Lines

If you specify subroutine reference as a value of "Format" option, it should accept three arguments:

MODIFIED = &subroutine(SELF, EVENT, STR);

SELF is a Unicode::LineBreak object, EVENT is a string to determine the context that subroutine was called in, and STR is a fragment of Unicode string leading or trailing breaking position.

EVENT |When Fired           |Value of STR
-----------------------------------------------------------------
"sot" |Beginning of text    |Fragment of first line
"sop" |After mandatory break|Fragment of next line
"sol" |After arbitrary break|Fragment on sequel of line
""    |Just before any      |Complete line without trailing
      |breaks               |SPACEs
"eol" |Arabitrary break     |SPACEs leading breaking position
"eop" |Mandatory break      |Newline and its leading SPACEs
"eot" |End of text          |SPACEs (and newline) at end of
      |                     |text
-----------------------------------------------------------------

Subroutine should return modified text fragment or may return undef to express that no modification occurred. Note that modification in the context of "sot", "sop" or "sol" may affect decision of successive breaking positions while in the others won't.

Note: As of release 1.003, string arguments are Unicode::GCString object. See "CAVEAT" in Unicode::GCString.

For example, following code folds lines removing trailing spaces:

sub fmt {
    if ($_[1] =~ /^eo/) {
        return "\n";
    }
    return undef;
}
my $lb = Unicode::LineBreak->new(Format => \&fmt);
$output = $lb->break($text);

User-Defined Breaking Behaviors

When a line generated by arbitrary break is expected to be beyond measure of either CharactersMax, ColumnsMin or ColumnsMax, urgent break may be performed on successive string. If you specify subroutine reference as a value of "UrgentBreaking" option, it should accept five arguments:

BROKEN = &subroutine(SELF, LEN, PRE, SPC, STR);

SELF is a Unicode::LineBreak object, LEN is size of preceding string, PRE is preceding Unicode string, SPC is additional SPACEs and STR is a Unicode string to be broken.

Subroutine should return an array of broken string STR.

Note: As of release 1.003, string arguments are Unicode::GCString object. See "CAVEAT" in Unicode::GCString.

For example, following code inserts hyphen to the name of several chemical substances (such as Titin) so that it may be folded:

sub hyphenize {
    return map {$_ =~ s/yl$/yl-/; $_} split /(\w+?yl(?=\w))/, $_[4];
}
my $lb = Unicode::LineBreak->new(UrgentBreaking => \&hyphenize);
$output = $lb->break("Methionylthreonylthreonylglutaminylarginyl...");

If you specify [REGEX, SUBREF] array reference as an item of "UserBreaking" option, subroutine should accept two arguments:

BROKEN = &subroutine(SELF, STR);

SELF is a Unicode::LineBreak object and STR is a Unicode string matched with REGEX.

Subroutine should return an array of broken string STR.

For example, following code break HTTP URLs using [CMOS] rule.

my $url = qr{http://[\x21-\x7E]+}i;
sub breakurl {
    my $self = shift;
    my $str = shift;
    return split m{(?<=[/]) (?=[^/]) |
                   (?<=[^-.]) (?=[-~.,_?\#%=&]) |
                   (?<=[=&]) (?=.)}x, $str;
}
my $lb = Unicode::LineBreak->new(UserBreaking => [$url, \&breakurl]);
$output = $lb->break($string);

Calculating String Size

If you specify subroutine reference as a value of "SizingMethod" option, it will be called with five or six arguments:

COLS = &subroutine(SELF, LEN, PRE, SPC, STR);

CHARS = &subroutine(SELF, LEN, PRE, SPC, STR, MAX);

SELF is a Unicode::LineBreak object, LEN is size of preceding string, PRE is preceding Unicode string, SPC is additional SPACEs and STR is a Unicode string to be processed.

By the first format, subroutine should return calculated number of columns of PRE.SPC.STR. The number of columns may not be an integer: Unit of the number may be freely chosen, however, it should be same as those of "ColumnsMin" and "ColumnsMax" option.

By the second format, subroutine should return maximum number of Unicode characters that substring of STR contains by which number of columns PRE.SPC.SUBSTR contains may not exceed MAX. This format will be used when "UrgentBreaking" option is set to "FORCE". If you don't wish to implement latter format, undef should be returned.

Note: As of release 1.003, string arguments are Unicode::GCString object. See "CAVEAT" in Unicode::GCString.

For example, following code processes lines with tab stops by each eight columns.

    sub tabbedsizing {
        my ($self, $cols, $pre, $spc, $str, $max) = @_;
        return undef if $max;
    
	my $spcstr = $spc.$str;
        while ($spcstr =~ s/^( *)(\t+)//) {
            $cols += length($1);
            $cols += length($2) * 8 - $cols % 8;
        }
        $cols += $self->strsize(0, '', '', $spcstr);
        return $cols;
    };
    my $lb = Unicode::LineBreak->new(TailorLB => [ord("\t") => LB_SP],
                                     SizingMethod => \&tabbedsizing);
    $output = $lb->break($string);

Tailoring Character Properties

Character properties may be tailored by "TailorLB" and "TailorEA" options. Some constants are defined for convenience of tailoring.

Line Breaking Properties

By default, several hiragana, katakana and characters corresponding to kana are treated as nonstarters (NS). When the following pair(s) are specified for value of "TailorLB" option, these characters are treated as normal ideographic characters (ID).

KANA_NONSTARTERS() => LB_ID

All of characters below.

IDEOGRAPHIC_ITERATION MARKS() => LB_ID

Ideographic iteration marks. U+3005 IDEOGRAPHIC ITERATION MARK, U+303B VERTICAL IDEOGRAPHIC ITERATION MARK, U+309D HIRAGANA ITERATION MARK, U+309E HIRAGANA VOICED ITERATION MARK, U+30FD KATAKANA ITERATION MARK and U+30FE KATAKANA VOICED ITERATION MARK.

N.B. Some of them are neither hiragana nor katakana.

KANA_SMALL_LETTERS() => LB_ID
KANA_PROLONGED_SOUND_MARKS() => LB_ID

Hiragana or katakana small letters: Hiragana small letters U+3041 A, U+3043 I, U+3045 U, U+3047 E, U+3049 O, U+3063 TU, U+3083 YA, U+3085 YU, U+3087 YO, U+308E WA, U+3095 KA, U+3096 KE. Katakana small letters U+30A1 A, U+30A3 I, U+30A5 U, U+30A7 E, U+30A9 O, U+30C3 TU, U+30E3 YA, U+30E5 YU, U+30E7 YO, U+30EE WA, U+30F5 KA, U+30F6 KE. Katakana phonetic extensions U+31F0 KU - U+31FF RO. Halfwidth katakana small letters U+FF67 A - U+FF6F TU.

Hiragana or katakana prolonged sound marks: U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK and U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK.

N.B. These letters are optionally treated either as nonstarter or as normal ideographic. See [JIS X 4051] 6.1.1.

N.B. U+3095, U+3096, U+30F5, U+30F6 are considered to be neither hiragana nor katakana.

MASU_MARK() => LB_ID

U+303C MASU MARK.

N.B. Although this character is not kana, it is usually regarded as abbreviation to sequence of hiragana "ます" or katakana "マス", MA and SU.

N.B. This character is classified as nonstarter (NS) by [UAX #14] and as Class 13 (corresponding to ID) by [JIS X 4051].

East_Asian_Width Properties

Some particular letters of Latin, Greek and Cyrillic scripts have ambiguous (A) East_Asian_Width property. Thus, these characters are treated as wide in "EASTASIAN" context. Specifying TailorEA => [ AMBIGUOUS_*() => EA_N ], those characters are always treated as narrow.

AMBIGUOUS_ALPHABETICS() => EA_N

Treat all of characters below as East_Asian_Width neutral (N).

AMBIGUOUS_CYRILLIC() => EA_N
AMBIGUOUS_GREEK() => EA_N
AMBIGUOUS_LATIN() => EA_N

Treate letters having ambiguous (A) width of Cyrillic, Greek and Latin scripts as neutral (N).

On the other hand, despite several characters were occasionally rendered as wide characters by number of implementations for East Asian character sets, they are given narrow (Na) East_Asian_Width property just because they have fullwidth (F) compatibility characters. Specifying TailorEA as below, those characters are treated as ambiguous --- wide on "EASTASIAN" context.

QUESTIONABLE_NARROW_SIGNS() => EA_A

U+00A2 CENT SIGN, U+00A3 POUND SIGN, U+00A5 YEN SIGN (or yuan sign), U+00A6 BROKEN BAR, U+00AC NOT SIGN, U+00AF MACRON.

Configuration File

Built-in defaults of option parameters for "new" and "config" method can be overridden by configuration files: Unicode/LineBreak/Defaults.pm. For more details read Unicode/LineBreak/Defaults.pm.sample.

BUGS

Please report bugs or buggy behaviors to developer.

CPAN Request Tracker: http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-LineBreak.

VERSION

See Unicode::LineBreak::Version.

Development versions of this module may be found at http://hatuka.nezumi.nu/repos/Unicode-LineBreak/.

Conformance to Standards

Character properties this module is based on are defined by Unicode Standards version 5.2.0.

This module is intended to implement UAX14-C2.

  • Some ideographic characters may be treated either as NS or as ID by choice.

  • Hangul syllables and conjoining jamos may be treated as either ID or AL by choice.

  • Characters assigned to AI may be resolved to either AL or ID by choice.

  • Character(s) assigned to CB are not resolved.

  • When word segmentation for South East Asian writing systems is not supported, characters assigned to SA are resolved to AL, except that characters that have Grapheme_Cluster_Break property value Extend or SpacingMark be resolved to CM.

  • Characters assigned to SG or XX are resolved to AL.

  • Code points of following UCS ranges are given fixed property values even if they have not been assigned any characers.

    Ranges             | lbclass()  | eawidth()  | Description
    -------------------------------------------------------------
    U+3400..U+4DBF     | ID         | W          | CJK ideographs
    U+4E00..U+9FFF     | ID         | W          | CJK ideographs
    U+D800..U+DFFF     | AL (SG)    | N          | Surrogates
    U+E000..U+F8FF     | AL (XX)    | F or N (A) | Private use
    U+F900..U+FAFF     | ID         | W          | CJK ideographs
    U+20000..U+2FFFD   | ID         | W          | CJK ideographs
    U+30000..U+3FFFD   | ID         | W          | CJK ideographs
    U+F0000..U+FFFFD   | AL (XX)    | F or N (A) | Private use
    U+100000..U+10FFFD | AL (XX)    | F or N (A) | Private use
    Other unassigned   | AL (XX)    | N          | Unassigned
    -------------------------------------------------------------
  • Characters belonging to General Category Mn, Me, Cc, Cf, Zl or Zp have the property value Z (nonspacing) defined by this module, regardless of East_Asian_Width property values assigned by [UAX #11].

REFERENCES

[CMOS]

The Chicago Manual of Style, 15th edition. Chicago University Press, 2003.

[JIS X 4051]

JIS X 4051:2004 日本語文書の組版方法 (Formatting Rules for Japanese Documents). Japanese Standards Association, 2004.

[UAX #11]

A. Freytag (2008-2009). Unicode Standard Annex #11: East Asian Width, Revision 17-19. http://unicode.org/reports/tr11/.

[UAX #14]

A. Freytag and A. Heninger (2008-2009). Unicode Standard Annex #14: Unicode Line Breaking Algorithm, Revision 22-24. http://unicode.org/reports/tr14/.

SEE ALSO

Text::LineFold, Text::Wrap.

AUTHOR

Copyright (C) 2009 Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.