How we implemented TeX's line-breaking algorithm in a PDF engine
Greedy line breaking produces ugly justified text. Knuth and Plass solved this in 1981. Here's how we brought their algorithm into a modern Rust PDF engine.
Justified text is one of those things that looks trivial until you try to implement it. You have a line of text. You have a target width. Stretch the spaces until the line fills the width. Done.
Except it looks terrible. Some lines have three words with huge gaps between them. Others are packed tight. Adjacent lines have wildly different spacing, creating rivers of whitespace that flow down the page. Every PDF you've seen with bad justified text was produced by an engine using the same naive approach: fill each line greedily, then stretch the spaces.
TeX doesn't have this problem. Documents typeset by TeX have justified text that looks effortless. The spacing is even. The rivers are gone. That's not an accident. It's the result of an algorithm that Donald Knuth and Michael Plass published in 1981, and it's the algorithm we just shipped in Forme.

The greedy approach and why it fails
Most text layout engines use greedy line breaking. The logic is simple: add words to the current line until the next word doesn't fit, then break and start a new line.
Line 1: [word] [word] [word] [word] <- fits
Line 2: [word] [word] <- next word overflows, break here
Line 3: [word] [word] [word] [word] [word] <- fits tightly
For left-aligned text this is fine. Nobody notices the ragged right edge. But for justified text, each line must fill the exact target width. The greedy algorithm doesn't consider what happens on line 2 when it commits to the break on line 1. It can't. It only looks at one line at a time.
The result: line 2 has two words stretched across 500 points. Line 3 has five words packed into the same space. The spacing ratio between those lines is 4:1. Your eye notices.
Knuth-Plass: the paragraph-level view
The insight from Knuth and Plass is simple: don't break lines one at a time. Consider the entire paragraph at once and find the set of break points that minimizes the total ugliness across all lines.
This is a dynamic programming problem. Each potential break point has a cost (called "demerits") based on how much the spaces on that line need to stretch or shrink. The algorithm finds the combination of breaks with the lowest total demerits.
The data model is three types of items:
- Boxes: fixed-width content (word fragments, characters). These don't stretch.
- Glue: stretchable/shrinkable space. Each glue has a natural width, a stretch amount, and a shrink amount. Word spaces are glue.
- Penalties: potential break points with a cost. Hyphenation points have a positive penalty (discouraged but allowed). Forced breaks have a penalty of negative infinity (must break here).
A sentence becomes a sequence of items:
Box("The") Glue(space) Box("quick") Glue(space) Box("brown") Glue(space) Box("fox")
The algorithm walks through the items, maintaining a set of "active" break points. At each potential break, it computes an adjustment ratio: how much the spaces on that line would need to stretch or shrink to hit the target width. If the ratio is within tolerance, it's a feasible break. The algorithm tracks the best break for each "fitness class" (tight, normal, loose, very loose) and penalizes adjacent lines with very different tightness.
Implementation in Rust
The core of the algorithm is about 200 lines. Here's the shape of it:
pub fn find_breaks(items: &[Item], config: &Config) -> Option<Vec<LineSolution>> {
let mut breakpoints: Vec<Breakpoint> = Vec::new();
let mut active: Vec<usize> = Vec::new();
// Seed with a dummy breakpoint at the start of the paragraph
breakpoints.push(Breakpoint { item_index: 0, line: 0, prev: None, .. });
active.push(0);
let mut total_width = 0.0;
let mut total_stretch = 0.0;
let mut total_shrink = 0.0;
for (i, item) in items.iter().enumerate() {
if is_feasible_break(item, i, &items) {
for &a in &active {
let ratio = compute_adjustment_ratio(
total_width - breakpoints[a].total_width,
total_stretch - breakpoints[a].total_stretch,
config.line_width,
);
if ratio >= -1.0 && ratio <= config.tolerance {
// Feasible break - compute demerits and record it
let demerits = compute_demerits(ratio, item);
// ... track best break per fitness class
}
}
// Create new breakpoints, deactivate overfull ones
}
// Update running totals
}
// Backtrack from the best final breakpoint to build the solution
backtrack(&breakpoints, &active)
}
The adjustment ratio is the key value. A ratio of 0 means the line is exactly the right width. Positive means the spaces need to stretch. Negative means they need to shrink. A ratio below -1 means the line is overfull even with maximum shrinking. A ratio above the tolerance threshold (we use 2.0) means the line would need to stretch too much.
Demerits are computed as (1 + 100 * |ratio|^3 + penalty)^2, with extra penalties for consecutive hyphenated lines and adjacent lines of very different tightness. The cubic term means slightly-off lines are cheap but very-off lines are expensive. This is what produces even spacing: the algorithm prefers three lines that are each a little stretched over two perfect lines and one that's terrible.
The paragraph-ending trick
One subtle detail: the last line of a paragraph shouldn't be justified. TeX handles this by appending a special sequence to every paragraph: a zero-width glue with infinite stretchability, followed by a forced break penalty.
// Final forced break (the paragraph must end)
items.push(Item::Glue { width: 0.0, stretch: 1e6, shrink: 0.0 });
items.push(Item::Penalty { penalty: f64::NEG_INFINITY, .. });
The infinite stretch means the last line always has an adjustment ratio near zero, regardless of how much content is on it. This means the last line is treated as perfectly fitting, so it gets zero demerits. During reconstruction, we skip justification for the final line, leaving it left-aligned. This is what you expect from typeset text, and it falls naturally out of the algorithm.
Justified spacing in reconstruction
Once the algorithm finds optimal break points, reconstruction walks the items between each pair of breaks and adjusts the glue widths:
for glue in line_glues {
let adjusted_width = if ratio >= 0.0 {
glue.width + ratio * glue.stretch
} else {
glue.width + ratio * glue.shrink
};
// Position the next character at x + adjusted_width
}
The adjustment ratio is the same value the algorithm computed during optimization. Positive ratio stretches glue proportionally to its stretchability. Negative ratio shrinks it proportionally to its shrinkability. The result is that all spaces on a line are adjusted by the same proportion, which looks natural.
Unicode line breaking (UAX #14)
The algorithm needs to know where it can break. Our old code checked for ASCII spaces, hyphens, and tabs. That works for English. It doesn't work for:
- CJK text, where you can break between any two ideographs
- Em dashes, where a break after is conventional
- Zero-width spaces (U+200B), an explicit invisible break point
- Line separators (U+2028), paragraph separators (U+2029)
- Dozens of other Unicode-defined break opportunities
We replaced hardcoded break detection with the unicode-linebreak crate, which implements the full UAX #14 algorithm and returns break opportunities at every character position: Mandatory, Allowed, or nothing.
These break opportunities feed directly into build_items(). Allowed breaks become zero-width penalties (free to take). Mandatory breaks become negative-infinity penalties (forced). Spaces still become glue (the space character is both a break opportunity and stretchable whitespace).
Multi-language hyphenation
Good line breaking needs hyphenation. Without it, the algorithm has fewer places to break, which leads to wider spacing variation. With it, the algorithm can split long words at syllable boundaries, giving it more options to balance lines evenly.
We use the hypher crate, which implements the Knuth-Liang hyphenation algorithm (also from TeX). It supports 35+ languages. To use the right dictionary, we added a lang property to the style system that cascades like CSS:
<Document lang="de">
<Page>
<Text style={{ hyphens: 'auto' }}>
Donaudampfschifffahrtsgesellschaftskapitän
</Text>
</Page>
</Document>
The lang property on <Document> cascades to all child nodes. A child can override it with its own lang. This maps to the BCP 47 language tag, which we resolve to a hypher dictionary:
fn resolve_hypher_lang(lang: Option<&str>) -> Option<hypher::Lang> {
let tag = lang.unwrap_or("en");
match tag.split('-').next().unwrap_or(tag) {
"en" => Some(hypher::Lang::English),
"de" => Some(hypher::Lang::German),
"fr" => Some(hypher::Lang::French),
// ... 35+ languages
_ => None, // unsupported → no algorithmic hyphenation
}
}
In the item list, hyphenation points become flagged penalties with a positive cost (we use 50). This means the algorithm will choose hyphenation when it produces significantly better overall spacing, but won't hyphenate when it's unnecessary. Consecutive hyphenated lines get extra demerits (3000) to avoid the "ladder" effect of three or four hyphens in a row.
Fallback
The Knuth-Plass algorithm can fail. If the text is too wide for the line and there's no feasible combination of breaks within the tolerance, find_breaks() returns None. This can happen with very narrow columns, very long words, or extreme font sizes.
When it fails, we fall back to the greedy algorithm. This is the right tradeoff: KP produces better results when it can find a solution, and greedy always produces a result (even if the spacing isn't ideal). The fallback is invisible to the user.
if let Some(solutions) = knuth_plass::find_breaks(&items, &config) {
knuth_plass::reconstruct_lines(&solutions, &items, &chars, &widths, max_width, justify)
} else {
self.break_into_lines(/* greedy fallback */)
}
What changed
Forme now uses optimal line breaking for all text, not just justified. Left-aligned, right-aligned, and centered text all benefit from better break decisions. The difference is most visible with justified text, where the spacing is noticeably more even, but it also reduces orphaned short lines in left-aligned text.
The integration touches three layers:
- UAX #14 break opportunities replace hardcoded ASCII break detection
- Knuth-Plass replaces the greedy line breaker for all alignments
- Justified text gets proper inter-word spacing from the adjustment ratios
<Text style={{ textAlign: 'justify', hyphens: 'auto' }}>
The quick brown fox jumps over the lazy dog. Pack my box with five
dozen liquor jugs. How vexingly quick daft zebras jump.
</Text>
If you're using textAlign: 'justify' in Forme, it now produces the kind of output you'd expect from a typesetting engine. Because it is one.