How to Anonymize Images with OCR-Based Detection

Detect and redact sensitive text in photos and scans using Tesseract OCR and NER-based PII detection.

By anonym.plus · Published March 2026 · Updated March 2026

How Image Anonymization Works

Image anonymization in anonym.plus combines optical character recognition (OCR) with named entity recognition (NER) to detect and redact personally identifiable information directly in images. The pipeline processes each image through a series of tightly integrated stages:

  1. Image upload and EXIF correction. When you upload an image, the app reads its EXIF orientation metadata and automatically rotates the image to the correct orientation. Photos taken in portrait mode or at unusual angles are corrected before any text extraction begins.
  2. Tesseract OCR text extraction. The corrected image is passed to Tesseract OCR, which extracts all visible text along with character-level bounding boxes. Each recognized character is mapped to its precise pixel coordinates in the image. Tesseract supports 38 OCR languages, and you select the primary language of the text for optimal accuracy.
  3. Presidio NER PII detection. The extracted text is fed to the Presidio NER engine, which identifies PII entities such as person names, email addresses, phone numbers, dates, locations, national IDs, credit card numbers, and more based on your selected detection preset.
  4. Bounding box padding and merging. Each detected PII entity is mapped back to the character-level bounding boxes from the OCR step. Bounding boxes are padded by 4 pixels on each side to ensure full coverage. Adjacent boxes for multi-word entities (such as full names like "John Smith") are merged into a single contiguous region.
  5. Redaction box rendering. Colored rectangles are drawn over each detected PII region, completely covering the original text in the image. You can configure the fill color: black, red, green, blue, or gray. The output is always a PNG image with PII visually redacted.

The entire pipeline runs locally on your machine. No images are uploaded to any server. The original image is never modified — a new redacted copy is created.

Supported Formats and Limits

anonym.plus supports four image formats for anonymization, each with specific characteristics:

Format Extensions Notes
PNG .png Lossless compression. Best for screenshots and digital documents.
JPEG .jpg, .jpeg Lossy compression. Common for photos. EXIF orientation auto-corrected.
BMP .bmp Uncompressed bitmap. Large file sizes but no quality loss.
TIFF .tiff, .tif Common for scanned documents. Supports multi-page (first page processed).

Size limits: Maximum file size is 10 MB. Maximum resolution is 25 megapixels. Images exceeding these limits are rejected with a clear error message. All output is saved as PNG regardless of the input format.

Step-by-Step Walkthrough

Follow these steps to anonymize an image from start to finish:

  1. Open the Image tab. Switch to the Image tab in the anonymization panel. The dropzone accepts PNG, JPG, BMP, and TIFF files up to 10 MB.
  2. Drop an image. Drag and drop your image onto the dropzone or click to browse. Once loaded, configure the fill color (black, red, green, blue, or gray), select a detection preset, and choose the OCR language matching the text in your image.
  3. Click Analyze. Tesseract OCR extracts all visible text from the image with character-level bounding boxes. The Presidio NER engine then detects PII entities within the extracted text and maps them back to pixel coordinates.
  4. Review detected entities. Each detected PII region is highlighted with a colored bounding box on the image preview. Entity type filter badges with checkboxes let you toggle entire categories on or off — for example, disable all DATE_TIME detections if dates are not sensitive in your context.
  5. Click Redact Selected. The app draws colored fill rectangles over all enabled PII regions, permanently covering the original text in the output image. Only checked entity types are redacted.
  6. Compare and save. Use the before/after comparison to verify redaction coverage. Click Save to download the redacted PNG image to your filesystem.

Tips for Best Results

Image anonymization quality depends heavily on OCR accuracy. Follow these guidelines to maximize detection reliability:

Known Limitations

Image anonymization has inherent limitations related to OCR technology. Understanding these helps set appropriate expectations:

Limitation Description Workaround
Photos of screens Moire patterns, glare, and reflections degrade OCR accuracy Use screenshots or direct digital exports instead
Handwritten text Tesseract is optimized for printed/typed text only No reliable workaround; manual redaction needed
Low resolution (<150 DPI) Insufficient detail for reliable character recognition Rescan at 300+ DPI or upscale before processing
Rotated/skewed text (>15°) Tesseract cannot reliably extract angled text Straighten or deskew the image before uploading
Complex backgrounds Watermarks, textures, and overlapping elements confuse OCR Crop to clean text areas; increase contrast
Very small text (<8pt) Falls below OCR detection threshold Zoom/crop to enlarge the text region
Multi-column layouts OCR reading order may become confused across columns Process each column as a separate cropped image
NER language model NER uses the English spaCy model; person name detection is strongest for English and Latin-script names Pattern-based entities (phone numbers, IBANs, emails, credit cards) work across all languages

For any image anonymization task, always review the detected entities before redacting. The review step lets you catch false positives from OCR noise and false negatives where PII was missed.

Ready to try it yourself? See it in action →