Why Your Redacted PDFs Aren't Actually Redacted (And How to Fix It)

PDF redaction fails more often than it works. A 2022 study found that 65% of PDFs with visual redactions still contained the underlying text in the file. The black box is cosmetic. The data is still there.

You've seen it in the news. A law firm releases court documents with sensitive information "redacted." A journalist opens the PDF, selects the blacked-out text, and pastes it into a text editor. The information is right there.

This isn't a rare edge case. It's the default behavior of most PDF redaction tools.

How PDFs Actually Work

A PDF file is not an image. It's a structured document containing a content stream -- a sequence of operators that tell a PDF renderer what to draw and where.

A text operator looks like this:

BT
/F1 12 Tf
100 700 Td
(John Smith) Tj
ET

This tells the renderer: use font F1 at size 12, move to position (100, 700), and draw the text "John Smith."

When most tools "redact" a PDF, they draw a black rectangle over the text. The rectangle covers it visually. But the text operator is still in the content stream. The text is still there. Anyone can select it, copy it, or extract it programmatically.

The Three Ways PDF Redaction Fails

1. Visual overlay only

The most common failure. A black rectangle is drawn over the text. The underlying text operators are untouched.

How to check: Open the PDF in Chrome. Try to select text in the redacted area. If you can select it, the redaction failed.

2. Metadata not scrubbed

Even if the visible text is removed, the PDF's metadata may still contain sensitive information -- author name, creation date, edit history, comments, and in some cases the original text content.

How to check: Open the PDF in Acrobat or a PDF inspector. Look at Document Properties → Description. Check for comments and annotations.

3. Incorrect layer handling

PDFs can have multiple content layers. A redaction that removes text from one layer may leave it intact on another.

What True PDF Redaction Looks Like

True redaction removes the text at the content stream level.

The process:

Decompress the content stream (PDFs use FlateDecode compression)
Tokenize the stream into operators
Find and remove text operators that match the redaction target
Recompress the stream
Draw a black rectangle over the redacted area
Scrub document metadata

After true redaction, there is no text to select, copy, or extract. The content doesn't exist in the file.

Why Adobe Acrobat's Redaction Is Overkill for Developers

Adobe Acrobat does support true redaction -- it's under Tools → Redact. But it requires a paid Acrobat Pro subscription ($19.99/month), a desktop app, and a manual workflow.

For developers who need to redact documents programmatically at scale -- thousands of PDFs, automated pipelines, serverless functions -- a desktop tool isn't the answer. You need an API.

Redacting PDFs Programmatically with Forme

Forme's redact endpoint removes text from the content stream. Not a visual overlay -- actual content removal. Metadata is scrubbed automatically on every redaction.

curl -X POST https://api.formepdf.com/v1/redact \
  -H "Authorization: Bearer $FORME_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pdf": "<base64-encoded-pdf>",
    "presets": ["ssn", "email", "phone"],
    "patterns": [
      { "pattern": "John Smith", "pattern_type": "Literal" },
      { "pattern": "\\d{3}-\\d{2}-\\d{4}", "pattern_type": "Regex" }
    ]
  }' \
  --output redacted.pdf

Built-in presets for common sensitive data

Forme ships with regex presets for the most common PII:

ssn -- Social Security Numbers (xxx-xx-xxxx)
email -- Email addresses
phone -- US phone numbers
date-of-birth -- Date formats (MM/DD/YYYY, YYYY-MM-DD)
credit-card -- 16-digit card numbers

# Redact all common PII in one call
curl -X POST https://api.formepdf.com/v1/redact \
  -H "Authorization: Bearer $FORME_API_KEY" \
  -d '{
    "pdf": "<base64>",
    "presets": ["ssn", "email", "phone", "date-of-birth"]
  }'

Text-search redaction

You can also redact by literal string or regex pattern:

import { FormeClient } from '@formepdf/sdk';

const client = new FormeClient({ apiKey: process.env.FORME_API_KEY });

const redacted = await client.redact(pdfBytes, {
  patterns: [
    { pattern: 'John Smith', pattern_type: 'Literal' },
    { pattern: '[A-Z]{2}\\d{6}', pattern_type: 'Regex' },
  ],
  presets: ['ssn', 'email'],
});

Redaction templates for recurring workflows

For HIPAA compliance, legal discovery, or financial document processing -- save named redaction templates and reference them by slug:

# Create the template once in the dashboard
# Then reference it by slug on every request:
curl -X POST https://api.formepdf.com/v1/redact \
  -H "Authorization: Bearer $FORME_API_KEY" \
  -d '{
    "pdf": "<base64>",
    "template": "hipaa-patient-record"
  }'

Verifying Your PDF Redaction Worked

After redacting a PDF with Forme:

Open the output in Chrome
Try to select text in the redacted area
The text should not be selectable

You can also verify programmatically:

import { findTextRegions } from '@formepdf/core';

const regions = await findTextRegions(redactedPdf, [
  { pattern: 'John Smith', pattern_type: 'Literal' }
]);

console.log(regions.length); // Should be 0

If findTextRegions returns nothing, the text is gone.

Limitations to Know

Forme's text redaction works on text operators in PDF content streams. A few edge cases:

Scanned documents -- If a PDF contains a scanned image with text in it, that image is not modified. Forme redacts text operators, not pixels. For scanned documents you need OCR first.

CJK text -- Chinese, Japanese, and Korean text uses CIDFont encoding which requires additional glyph mapping. Forme currently redacts WinAnsi (Latin) encoded text.

Encrypted PDFs -- Cannot be redacted without first decrypting them.

The Compliance Case

For regulated industries -- healthcare, legal, finance, government -- redaction is not optional. HIPAA requires that PHI is actually removed from documents before sharing. A black box overlay does not satisfy the requirement.

The tools most teams reach for -- markup in Acrobat, text boxes in Word, "redacted" overlays in various editors -- produce visually redacted documents that fail a basic copy-paste test.

True redaction requires content stream removal. That's what Forme does.

Getting Started

# Install the SDK
npm install @formepdf/sdk

# Or call the API directly
curl -X POST https://api.formepdf.com/v1/redact \
  -H "Authorization: Bearer $FORME_API_KEY" \
  -d '{ "pdf": "<base64>", "presets": ["ssn", "email"] }' \
  --output redacted.pdf

Sign up at app.formepdf.com. The free plan includes 50 operations per month -- enough to test your redaction workflow before committing.

Self-hosting available via the formepdf/forme Docker image for teams that need documents to stay on their own infrastructure.