How to Anonymize PDF, DOCX, and XLSX Files

Supported Document Formats

anonym.plus processes seven document formats, each with specific size limits and structure preservation characteristics. The app extracts text locally from each format, runs PII detection via the Presidio engine, and produces an anonymized output — all without any network calls.

Format	Max Size	Structure Preserved	Notes
PDF	50 MB	Text layer only	Text is extracted from the PDF text layer. Scanned PDFs require OCR preprocessing. Layout and images pass through unchanged.
DOCX	30 MB	Full formatting	Paragraphs, tables, headers, footers, styles, and fonts are preserved. Only text content is modified.
XLSX	20 MB / 100K rows	Cell structure	Cell values, sheet names, and formulas are preserved. PII is detected and replaced within cell text.
CSV	30 MB	Row/column structure	Delimiter detection is automatic. Headers and data rows are preserved.
JSON	30 MB	Full structure	Object keys, nesting, and arrays are preserved. Only string values containing PII are modified.
XML	30 MB	Full structure	Element hierarchy, attributes, and namespaces are preserved. PII in text nodes and attributes is detected.
TXT	50 MB	Plain text	Line breaks and whitespace are preserved. No formatting to maintain.

How File Anonymization Works

Regardless of file format, anonym.plus follows a consistent pipeline for file anonymization:

File ingestion. Drop a file onto the dropzone or click to browse. The file is read entirely on your local machine — nothing is uploaded to any server.
Text extraction. The app uses format-specific parsers to extract text content. For PDF, this means reading the text layer. For DOCX, it parses the XML structure within the .docx package. For XLSX, it reads cell values across all sheets.
PII detection. The extracted text is analyzed by the local Presidio engine combined with spaCy NER models. The engine identifies entities based on your selected detection preset and confidence threshold.
Entity review. Detected entities are displayed with color-coded badges. You review each detection, toggling off false positives or adding missed entities manually.
Anonymization. You choose an operator (Replace or Encrypt) and click "Anonymize." The engine applies the operator to each enabled entity within the extracted text.
Output generation. The anonymized text is written back into the original file format, preserving the document structure. You choose to save as a new file or replace the original.

This pipeline ensures that document formatting, layout, and non-text elements remain intact while all detected PII is processed according to your chosen operator.

Replace Mode: Step by Step

Replace mode permanently substitutes each detected PII entity with a type-based placeholder. This is ideal when you need to share documents externally or create permanently sanitized copies.

Drop your file onto the anonym.plus dropzone. The file type is detected automatically.
Select a detection preset. For most document workflows, "General PII Detection" or "GDPR Compliance" work well.
Click "Start Analysis." The text extraction and PII detection run locally.
Review the detected entities in the sidebar. Each entity shows its type (e.g., PERSON, EMAIL_ADDRESS, PHONE_NUMBER), the original value, and a confidence score.
Set the operator to "Replace" for each entity type, or set Replace as the global default.
Click "Anonymize." Each PII value is replaced with a placeholder like <PERSON> or <EMAIL_ADDRESS>.
Choose your output format: same as input, PDF, DOCX, or TXT.
Click "Save as New File" to write the anonymized document. The original remains untouched.

Encrypt Mode: Step by Step

Encrypt mode replaces each PII entity with an AES-256-GCM encrypted token. The original values can be recovered later using the Deanonymize feature with the correct encryption key.

Drop your file onto the dropzone.
Select a detection preset and click "Start Analysis."
Review detected entities.
Set the operator to "Encrypt" and select an encryption key from your vault. If you do not have a key, create one in Settings — the key is generated locally and stored in your encrypted vault.
Click "Anonymize." Each PII entity is encrypted with AES-256-GCM using a random nonce per entity.
Save the encrypted document. Share it safely — recipients cannot read the PII without your encryption key.
When you need to restore the original values, use the Deanonymize feature: drop the encrypted file, and the app automatically matches encrypted tokens to your history and loads the correct key.

Format-Specific Considerations

PDF

PDF anonymization works on the text layer of the document. The app reads text content, positions, and fonts from the PDF, applies anonymization, and writes the modified text back. Images, vector graphics, and other non-text elements are not modified. If your PDF was created from a scanner (image-only PDF), the text layer may be empty — in that case, use the Image Anonymization feature to process individual pages as images with OCR.

For best results with PDFs, ensure the document has a proper text layer (most PDFs created from Word, Excel, or web browsers do). The maximum supported file size is 50 MB.

DOCX

DOCX files are internally XML-based packages. anonym.plus parses the document structure, processes text within paragraphs, tables, headers, and footers, and writes the anonymized content back while preserving all formatting: fonts, styles, colors, bullet points, numbering, and page layout. Embedded images and charts are not modified.

Track changes and comments that contain PII are also processed. The maximum file size is 30 MB.

XLSX

Spreadsheet anonymization processes each cell individually across all sheets. Cell formatting (number formats, colors, borders), formulas, and sheet structure are preserved. PII is detected within cell text values — numeric cells, dates in date-formatted cells, and formula cells are analyzed based on their displayed value.

The limit is 20 MB or 100,000 rows, whichever is reached first. For very large spreadsheets, consider splitting into smaller files or using batch processing.

CSV, JSON, and XML

These structured data formats are parsed natively. CSV delimiter detection is automatic (comma, semicolon, tab, or pipe). JSON objects and arrays maintain their structure — only string values containing PII are modified. XML preserves element hierarchy, attributes, and namespaces. In all three formats, only the data values are anonymized while the structural elements remain intact.

TXT

Plain text files are the simplest format to anonymize. The entire file content is treated as text, with line breaks and whitespace preserved. TXT supports the largest file size at 50 MB. Output is always TXT format.

Ready to try it yourself? See it in action →

Known Limitations

File anonymization has format-specific limitations and considerations:

Embedded objects: Images, charts, and embedded objects in PDF/DOCX are not analyzed for text. Extract or anonymize separately.
Metadata preservation: File metadata (author, creation date) is not automatically scrubbed. Use specialized metadata removal tools if needed.
OCR not included: Scanned PDFs or image-based documents require OCR preprocessing before text extraction works reliably.