PDFs that carry their own data
Embed structured JSON inside any PDF. Extract it later without OCR. Invoices carry their line items, reports carry their datasets.
A PDF is a visual format. It tells a viewer where to draw text and lines, not what the text means. If you generate an invoice PDF from { customer: 'Acme', total: 503 }, that structured data is gone. The PDF has characters positioned on a canvas that spell out "Acme" and "$503.00", but no machine can reliably get the data back without OCR or heuristic parsing.
Forme now solves this. Every PDF can carry its source data as a hidden attachment.
Embedding
Pass embedData when rendering:
import { renderDocument } from '@formepdf/core';
const invoice = {
number: 'INV-2024-001',
customer: 'Acme Corp',
items: [{ name: 'Widget Pro', qty: 5, price: 49 }],
total: 245,
};
const pdf = await renderDocument(
<Document>
<Page>
<Text>Invoice {invoice.number}</Text>
<Text>Total: ${invoice.total}</Text>
</Page>
</Document>,
{ embedData: invoice }
);
The PDF looks and prints the same. The data is compressed and stored as a forme-data.json file attachment inside the PDF binary, following the PDF 1.7 EmbeddedFile specification.
Extracting
One function call:
import { extractData } from '@formepdf/core';
import { readFileSync } from 'fs';
const data = extractData(new Uint8Array(readFileSync('invoice.pdf')));
// { number: 'INV-2024-001', customer: 'Acme Corp', items: [...], total: 245 }
Returns null for PDFs without embedded data. Works on any Forme-generated PDF that used embedData.
Three patterns
1. Programmatic (opt-in) — When using @formepdf/core directly, you choose what to embed. It can be the full template data, a subset, or a record ID for lookup.
// Full data
await renderDocument(el, { embedData: invoice });
// Just a reference
await renderDocument(el, { embedData: { id: 'inv_abc123' } });
2. Hosted API (automatic) — The Forme hosted API embeds the request body into every rendered PDF automatically. No opt-in needed.
# Render — data embedded automatically
curl -X POST https://api.formepdf.com/v1/render/invoice \
-H 'Authorization: Bearer forme_sk_...' \
-d '{"customer": "Acme", "total": 503}'
# Extract
curl -X POST https://api.formepdf.com/v1/extract \
-H 'Content-Type: application/pdf' \
--data-binary @invoice.pdf
# → {"data": {"customer": "Acme", "total": 503}}
3. Templates — renderTemplate() embeds the data JSON automatically, since the template and data are already separate.
Why this matters
PDFs are the universal document format. They get emailed, archived, printed, and forgotten. When someone needs the data back — to import into accounting software, to reconcile payments, to audit a report — they reach for OCR, regex parsing, or manual re-entry.
Embedded data eliminates that entire class of problem. The PDF is both the human-readable document and the machine-readable data. One file, two audiences.
Invoice processing: Accounting systems extract line items directly. No OCR errors, no layout-dependent parsing.
Form submissions: PDF forms carry structured responses. Downstream systems consume JSON, not pixel coordinates.
Archival: The source data lives alongside its visual representation. Regenerate or audit years later.
Round-tripping: Extract data → modify → re-render. The PDF is never a dead end.
How it works under the hood
The data is JSON-stringified, DEFLATE-compressed, and stored as an EmbeddedFile stream object in the PDF. A FileSpec entry named forme-data.json references it via the document's Names tree. This follows the PDF 1.7 specification — the attachment is invisible in most viewers but can be listed in attachment panels.
extractData() scans the PDF bytes for the forme-data.json FileSpec, resolves the stream object reference, decompresses, and parses. Pure TypeScript with node:zlib — no WASM needed for extraction.
The attachment adds negligible size. A 10KB JSON payload compresses to ~2KB in the PDF stream. For a typical invoice, the embedded data is smaller than a single embedded font glyph table.