NAME
Text::KnuthPlass - Breaks paragraphs into lines using the TeX (Knuth-Plass) algorithm
SYNOPSIS
To use with plain text, indentation of 2. NOTE that you should also set the shrinkability of spaces to 0 in the new() call:
use Text::KnuthPlass;
my $typesetter = Text::KnuthPlass->new(
'indent' => 2, # two characters,
# set space shrinkability to 0
'space' => { 'width' => 3, 'stretch' => 6, 'shrink' -> 0 },
# can let 'measure' default to character count
# default line lengths to 78 characters
);
my @lines = $typesetter->typeset($paragraph);
...
for my $line (@lines) {
for my $node (@{$line->{'nodes'}}) {
if ($node->isa("Text::KnuthPlass::Box")) {
# a Box is a word or word fragment (no hyphen on fragment)
print $node->value();
} elsif ($node->isa("Text::KnuthPlass::Glue")) {
# a Glue is (at least) a single space, but you can look at
# the line's 'ratio' to insert additional spaces to
# justify the line. we also are glossing over the skipping
# of any final glue at the end of the line
print " ";
}
# ignoring Penalty (word split point) within line
}
if ($line->{'nodes'}[-1]->is_penalty()) { print "-"; }
print "\n";
}
To use with PDF::Builder: (also PDF::API2)
my $text = $page->text();
$text->font($font, 12);
$text->leading(13.5);
my $t = Text::KnuthPlass->new(
'indent' => 2*$text->text_width('M'), # 2 ems
'measure' => sub { $text->text_width(shift) },
'linelengths' => [235] # points
);
my @lines = $t->typeset($paragraph);
my $y = 500; # PDF decreases y down the page
for my $line (@lines) {
$x = 50; # left margin
for my $node (@{$line->{'nodes'}}) {
$text->translate($x,$y);
if ($node->isa("Text::KnuthPlass::Box")) {
# a Box is a word or word fragment (no hyphen on fragment)
$text->text($node->value());
$x += $node->width();
} elsif ($node->isa("Text::KnuthPlass::Glue")) {
# a Glue is a variable-width space
$x += $node->width() + $line->{'ratio'} *
($line->{'ratio'} < 0 ? $node->shrink(): $node->stretch());
# we also are glossing over the skipping
# of any final glue at the end of the line
}
# ignoring Penalty (word split point) within line
}
# explicitly add a hyphen at a line-ending split word
if ($line->{'nodes'}[-1]->is_penalty()) { $text->text("-"); }
$y -= $text->leading(); # go to next line down
}
METHODS
$t = Text::KnuthPlass->new(%opts)
The constructor takes a number of options. The most important ones are:
- measure
-
A subroutine reference to determine the width of a piece of text. This defaults to
length(shift)
, which is what you want if you're typesetting plain monospaced text. You will need to change this to plug into your font metrics if you're doing something graphical. For PDF::Builder (also PDF::API2), this would be theadvancewidth()
method (aliastext_width()
), which returns the width of a string (in the present font and size) in points.'measure' => sub { length(shift) }, # default, for character output 'measure' => sub { $text->advancewidth(shift) }, # PDF::Builder/API2
- linelengths
-
This is an array of line lengths. For instance,
[30,40,50]
will typeset a triangle-shaped piece of text with three lines. What if the text spills over to more than three lines? In that case, the final value in the array is used for all further lines. So to typeset an ordinary block-shaped column of text, you only need specify an array with one value: the default is[78]
. Note that this default would be the character count, rather than points (as needed by PDF::Builder or PDF::API2).'linelengths' => [$lw, $lw, $lw-6, $lw-6, $lw],
This would set the first two lines in the paragraph to
$lw
length, the next two to 6 less (such as for a float inset), and finally back to full length. At each line, the first element is consumed, but the last element is never removed. Any paragraph indentation set will result in a shorter-appearing first line, which actually has blank space at its beginning. Start output of the first line at the samex
value as you do the other lines.Setting
linelengths
in thenew()
(constructor) call resets the internal line length list to the new elements, overwriting anything that was already there (such as any remaining line lengths left over from a previoustypeset()
call). Subsequenttypeset()
calls will continue to consume the existing line length list, until the last element is reached. You can either reset the list for the next paragraph with thetypeset()
call, or call thelinelengths()
method to get or set the list. - indent
-
This sets the global (default) paragraph indentation, unless overridden on a per-paragraph basis by an
indent
entry in atypeset()
call. The units are the same as formeaure
andlinelengths
. A "Box" of value''
and width ofindent
is inserted before the first node of the paragraph. Your rendering code should know how to handle this by starting at the samex
coordinate as other lines, and then moving right (or left) by the indicated amount.'indent' => 2, # 2 character indentation 'indent' => 2*$text->text_width('M'), # 2 ems indentation 'indent' => -3, # 3 character OUTdent
If the value is negative, a negative-width space Box is added. The overall line will be longer than other lines, by that amount. Again, your rendering code should handle this in a similar manner as with a positive indentation, but move left by the indicated amount. Be careful to have your starting
x
value far enough to the right that text will not end up being written off-page. - tolerance
-
How much leeway we have in leaving wider spaces than the algorithm would prefer. The
tolerance
is the maximumratio
glue expansion value to tolerate in a possible solution, before discarding this solution as so infeasible as to be a waste of time to pursue further. Most of the time, thetolerance
is going to have a value in the 1 to 3 range. One approach is to try withtolerance => 1
, and if no successful layout is found, try again with 2, and then 3 and perhaps even 4. - hyphenator
-
An object which hyphenates words. If you have the
Text::Hyphen
product installed (which is highly recommended), then aText::Hyphen
object is instantiated by default; if not, an object of the classText::KnuthPlass::DummyHyphenator
is instantiated - this simply finds no hyphenation points at all. So to turn hyphenation off, set'hyphenator' => Text::KnuthPlass::DummyHyphenator->new()
To typeset non-English text, pass in a
Text::Hyphen
-like object which responds to thehyphenate
method, returning a list of hyphen positions for that particular language (nativeText::Hyphen
defaults to American English hyphenation rules). (SeeText::Hyphen
for the interface.) - space
-
Fine tune space (glue) width, stretchability, and shrinkability.
'space' => { 'width' => 3, 'stretch' => 6, 'shrink' => 9 },
For typesetting constant width text or output to a text file (characters), we suggest setting the
shrink
value to 0. This prevents the glue spaces from being shrunk to less than one character wide, which could result in either no spaces between words, or overflow into the right margin.'space' => { 'width' => 3, 'stretch' => 6, 'shrink' => 0 },
- infinity
-
The default value for infinity is, as is customary in TeX, 10000. While this is a far cry from the real infinity, so long as it is substantially larger than any other demerit or penalty, it should take precedence in calculations. Both positive and negative
inifinity
are used in the code for various purposes, including a+inf
penalty for something absolutely forbidden, and-inf
for something absolutely required (such as a line break at the end of a paragraph).'infinity' => 10000,
- hyphenpenalty
-
Set the penalty for an end-of-line hyphen at 50. You may want to try a somewhat higher value, such as 100+, if you see too much hyphenation on output. Remember that excessively short lines are prone to splitting words and being hyphenated, no matter what the penalty is.
'hyphenpenalty' => 50,
There does not appear to be anything in the code to find and prevent multiple contiguous (adjacent) hyphenated lines, nor to prevent the penultimate (next-to-last) line from being hyphenated, nor to prevent the hyphenation of a line where you anticipate the paragraph to be split between columns. Something may be done in the future about these three special cases, which are considered to not be good typesetting.
- demerits
-
Various demerits used in calculating penalties, including fitness, which is used when line tightness (
ratio
) changes by more than one class between two lines.'demerits' => { 'line' => 10, 'flagged' => 100, 'fitness' => 3000 },
There may be other options for fine-tuning the output. If you know your way around TeX, dig into the source to find out what they are. At some point, this package will support additional tuning by allowing the setting of more parameters which are currently hard-coded. Please let us know if you found any more parameters that would be useful to allow additional tuning!
$t->typeset($paragraph_string, %opts)
This is the main interface to the algorithm, made up of the constituent parts below. It takes a paragraph of text and returns a list of lines (array of hashes) if suitable breakpoints could be found.
The typesetter currently allows several options:
- indent
-
Override the global paragraph indentation value just for this paragraph. This can be useful for instances such as not indenting the first paragraph in a section.
'indent' => 0, # default set in new() is 2ems
- linelengths
-
The array of line lengths may be set here, in
typeset
. As withnew()
, it will override whatever existing line lengths array is left over from earlier operations.
Possibly (in the future) many other global settings set in new()
may be overridden on a per-paragraph basis in typeset()
.
The returned list has the following structure:
(
{ 'nodes' => \@nodes, 'ratio' => $ratio },
{ 'nodes' => \@nodes, 'ratio' => $ratio },
...
)
The node list in each element will be a list of objects. Each object will be either Text::KnuthPlass::Box
, Text::KnuthPlass::Glue
or Text::KnuthPlass::Penalty
. See below for more on these.
The ratio
is the amount of stretch or shrink which should be applied to each glue element in this line. The corrected width of each glue node should be:
$node->width() + $line->{'ratio'} *
($line->{'ratio'} < 0 ? $node->shrink() : $node->stretch());
Each box, glue or penalty node has a width
attribute. Boxes have value
s, which are the text which went into them (including a wide null blank for paragraph indentation, a special case); glue has stretch
and shrink
to determine how much it should vary in width. That should be all you need for basic typesetting; for more, see the source, and see the original Knuth-Plass paper in "Digital Typography".
Why typeset rather than something like linesplit? Per "ACKNOWLEDGEMENTS", this code is ported from the Javascript product typeset.
This method is a thin wrapper around the three methods below.
$t->line_lengths()
- @list = $t->line_lengths() # Get
- $t->line_lengths(@list) # Set
-
Get or set the
linelengths
list of allowed line lengths. This permits you to do more elaborate operations on this array than simply replacing (resetting) it, as done in thenew()
andtypeset()
methods. For example, at the bottom of a page, you might cancel any further inset for a float, by deleting all but the last element of the list.my @temp_LL = $t->line_lengths(); # cancel remaining line shortening splice(@temp_LL, 0, scalar(@temp_LL)-1); $t->line_lengths(@temp_LL);
On a "Set" request, you must have at least one length element in the list. If the list is empty, it is assumed to be a "Get" request.
$t->break_text_into_nodes($paragraph_string, %opts)
This turns a paragraph into a list of box/glue/penalty nodes. It's fairly basic, and designed to be overloaded. It should also support multiple justification styles (centering, ragged right, etc.) but this will come in a future release; right now, it just does full justification.
'style' => "string_name"
- "justify"
-
Fully justify the text (flush left and right). This is the default, and currently the only choice implemented.
- "left"
-
Not yet implemented. This will be flush left, ragged right (reversed for RTL scripts).
- "right"
-
Not yet implemented. This will be flush right, ragged left (reversed for RTL scripts).
- "center"
-
Implemented, but not yet fully tested. This is centered text within the indicated line width.
If you are doing clever typography or using non-Western languages you may find that you will want to break text into nodes yourself, and pass the list of nodes to the methods below, instead of using this method.
break
This implements the main body of the algorithm; it turns a list of nodes (produced from the above method) into a list of breakpoint objects.
@lines = $t->breakpoints_to_lines(\@breakpoints, \@nodes)
And this takes the breakpoints and the nodes, and assembles them into lines.
boxclass()
glueclass()
penaltyclass()
For subclassers.
AUTHOR
originally written by Simon Cozens, <simon at cpan.org>
since 2020, maintained by Phil Perry
ACKNOWLEDGEMENTS
This module is a Perl translation (originally by Simon Cozens) of Bram Stein's "Typeset" Javascript Knuth-Plass implementation.
BUGS
Please report any bugs or feature requests to the issues section of https://github.com/PhilterPaper/Text-KnuthPlass
.
Do NOT under ANY circumstances open a PR (Pull Request) to report a bug. It is a waste of both your and our time and effort. Open a regular ticket (issue), and attach a Perl (.pl) program illustrating the problem, if possible. If you believe that you have a program patch, and offer to share it as a PR, we may give the go-ahead. Unsolicited PRs may be closed without further action.
COPYRIGHT & LICENSE
Copyright (c) 2011 Simon Cozens.
Copyright (c) 2020-2022 Phil M Perry.
This program is released under the following license: Perl, GPL