Why Your Redacted PDFs Aren't Actually Redacted (And How to Fix It)
Most PDF redaction tools just draw a black box over text. The data is still there. Here's how true content-stream redaction works and how to do it programmatically.
PDF redaction fails more often than it works. A 2022 study found that 65% of PDFs with visual redactions still contained the underlying text in the file. The black box is cosmetic. The data is still there.
You've seen it in the news. A law firm releases court documents with sensitive information "redacted." A journalist opens the PDF, selects the blacked-out text, and pastes it into a text editor. The information is right there.
This isn't a rare edge case. It's the default behavior of most PDF redaction tools.
How PDFs Actually Work
A PDF file is not an image. It's a structured document containing a content stream -- a sequence of operators that tell a PDF renderer what to draw and where.
A text operator looks like this:
BT
/F1 12 Tf
100 700 Td
(John Smith) Tj
ET
This tells the renderer: use font F1 at size 12, move to position (100, 700), and draw the text "John Smith."
When most tools "redact" a PDF, they draw a black rectangle over the text. The rectangle covers it visually. But the text operator is still in the content stream. The text is still there. Anyone can select it, copy it, or extract it programmatically.
The Three Ways PDF Redaction Fails
1. Visual overlay only
The most common failure. A black rectangle is drawn over the text. The underlying text operators are untouched.
How to check: Open the PDF in Chrome. Try to select text in the redacted area. If you can select it, the redaction failed.
2. Metadata not scrubbed
Even if the visible text is removed, the PDF's metadata may still contain sensitive information -- author name, creation date, edit history, comments, and in some cases the original text content.
How to check: Open the PDF in Acrobat or a PDF inspector. Look at Document Properties → Description. Check for comments and annotations.
3. Incorrect layer handling
PDFs can have multiple content layers. A redaction that removes text from one layer may leave it intact on another.
What True PDF Redaction Looks Like
True redaction removes the text at the content stream level.
The process:
- Decompress the content stream (PDFs use FlateDecode compression)
- Tokenize the stream into operators
- Find and remove text operators that match the redaction target
- Recompress the stream
- Draw a black rectangle over the redacted area
- Scrub document metadata
After true redaction, there is no text to select, copy, or extract. The content doesn't exist in the file.
Why Adobe Acrobat's Redaction Is Overkill for Developers
Adobe Acrobat does support true redaction -- it's under Tools → Redact. But it requires a paid Acrobat Pro subscription ($19.99/month), a desktop app, and a manual workflow.
For developers who need to redact documents programmatically at scale -- thousands of PDFs, automated pipelines, serverless functions -- a desktop tool isn't the answer. You need an API.
Redacting PDFs Programmatically with Forme
Forme's redact endpoint removes text from the content stream. Not a visual overlay -- actual content removal. Metadata is scrubbed automatically on every redaction.
curl -X POST https://api.formepdf.com/v1/redact \
-H "Authorization: Bearer $FORME_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"pdf": "<base64-encoded-pdf>",
"presets": ["ssn", "email", "phone"],
"patterns": [
{ "pattern": "John Smith", "pattern_type": "Literal" },
{ "pattern": "\\d{3}-\\d{2}-\\d{4}", "pattern_type": "Regex" }
]
}' \
--output redacted.pdf
Built-in presets for common sensitive data
Forme ships with regex presets for the most common PII:
ssn-- Social Security Numbers (xxx-xx-xxxx)email-- Email addressesphone-- US phone numbersdate-of-birth-- Date formats (MM/DD/YYYY, YYYY-MM-DD)credit-card-- 16-digit card numbers
# Redact all common PII in one call
curl -X POST https://api.formepdf.com/v1/redact \
-H "Authorization: Bearer $FORME_API_KEY" \
-d '{
"pdf": "<base64>",
"presets": ["ssn", "email", "phone", "date-of-birth"]
}'
Text-search redaction
You can also redact by literal string or regex pattern:
import { FormeClient } from '@formepdf/sdk';
const client = new FormeClient({ apiKey: process.env.FORME_API_KEY });
const redacted = await client.redact(pdfBytes, {
patterns: [
{ pattern: 'John Smith', pattern_type: 'Literal' },
{ pattern: '[A-Z]{2}\\d{6}', pattern_type: 'Regex' },
],
presets: ['ssn', 'email'],
});
Redaction templates for recurring workflows
For HIPAA compliance, legal discovery, or financial document processing -- save named redaction templates and reference them by slug:
# Create the template once in the dashboard
# Then reference it by slug on every request:
curl -X POST https://api.formepdf.com/v1/redact \
-H "Authorization: Bearer $FORME_API_KEY" \
-d '{
"pdf": "<base64>",
"template": "hipaa-patient-record"
}'
Verifying Your PDF Redaction Worked
After redacting a PDF with Forme:
- Open the output in Chrome
- Try to select text in the redacted area
- The text should not be selectable
You can also verify programmatically:
import { findTextRegions } from '@formepdf/core';
const regions = await findTextRegions(redactedPdf, [
{ pattern: 'John Smith', pattern_type: 'Literal' }
]);
console.log(regions.length); // Should be 0
If findTextRegions returns nothing, the text is gone.
Limitations to Know
Forme's text redaction works on text operators in PDF content streams. A few edge cases:
Scanned documents -- If a PDF contains a scanned image with text in it, that image is not modified. Forme redacts text operators, not pixels. For scanned documents you need OCR first.
CJK text -- Chinese, Japanese, and Korean text uses CIDFont encoding which requires additional glyph mapping. Forme currently redacts WinAnsi (Latin) encoded text.
Encrypted PDFs -- Cannot be redacted without first decrypting them.
The Compliance Case
For regulated industries -- healthcare, legal, finance, government -- redaction is not optional. HIPAA requires that PHI is actually removed from documents before sharing. A black box overlay does not satisfy the requirement.
The tools most teams reach for -- markup in Acrobat, text boxes in Word, "redacted" overlays in various editors -- produce visually redacted documents that fail a basic copy-paste test.
True redaction requires content stream removal. That's what Forme does.
Getting Started
# Install the SDK
npm install @formepdf/sdk
# Or call the API directly
curl -X POST https://api.formepdf.com/v1/redact \
-H "Authorization: Bearer $FORME_API_KEY" \
-d '{ "pdf": "<base64>", "presets": ["ssn", "email"] }' \
--output redacted.pdf
Sign up at app.formepdf.com. The free plan includes 50 operations per month -- enough to test your redaction workflow before committing.
Self-hosting available via the formepdf/forme Docker image for teams that need documents to stay on their own infrastructure.