How Image Anonymization Works
Image anonymization in anonym.plus combines optical character recognition (OCR) with named entity recognition (NER) to detect and redact personally identifiable information directly in images. The pipeline processes each image through a series of tightly integrated stages:
- Image upload and EXIF correction. When you upload an image, the app reads its EXIF orientation metadata and automatically rotates the image to the correct orientation. Photos taken in portrait mode or at unusual angles are corrected before any text extraction begins.
- Tesseract OCR text extraction. The corrected image is passed to Tesseract OCR, which extracts all visible text along with character-level bounding boxes. Each recognized character is mapped to its precise pixel coordinates in the image. Tesseract supports 38 OCR languages, and you select the primary language of the text for optimal accuracy.
- Presidio NER PII detection. The extracted text is fed to the Presidio NER engine, which identifies PII entities such as person names, email addresses, phone numbers, dates, locations, national IDs, credit card numbers, and more based on your selected detection preset.
- Bounding box padding and merging. Each detected PII entity is mapped back to the character-level bounding boxes from the OCR step. Bounding boxes are padded by 4 pixels on each side to ensure full coverage. Adjacent boxes for multi-word entities (such as full names like "John Smith") are merged into a single contiguous region.
- Redaction box rendering. Colored rectangles are drawn over each detected PII region, completely covering the original text in the image. You can configure the fill color: black, red, green, blue, or gray. The output is always a PNG image with PII visually redacted.
The entire pipeline runs locally on your machine. No images are uploaded to any server. The original image is never modified — a new redacted copy is created.
Supported Formats and Limits
anonym.plus supports four image formats for anonymization, each with specific characteristics:
| Format | Extensions | Notes |
|---|---|---|
| PNG | .png | Lossless compression. Best for screenshots and digital documents. |
| JPEG | .jpg, .jpeg | Lossy compression. Common for photos. EXIF orientation auto-corrected. |
| BMP | .bmp | Uncompressed bitmap. Large file sizes but no quality loss. |
| TIFF | .tiff, .tif | Common for scanned documents. Supports multi-page (first page processed). |
Size limits: Maximum file size is 10 MB. Maximum resolution is 25 megapixels. Images exceeding these limits are rejected with a clear error message. All output is saved as PNG regardless of the input format.
Step-by-Step Walkthrough
Follow these steps to anonymize an image from start to finish:
- Open the Image tab. Switch to the Image tab in the anonymization panel. The dropzone accepts PNG, JPG, BMP, and TIFF files up to 10 MB.
- Drop an image. Drag and drop your image onto the dropzone or click to browse. Once loaded, configure the fill color (black, red, green, blue, or gray), select a detection preset, and choose the OCR language matching the text in your image.
- Click Analyze. Tesseract OCR extracts all visible text from the image with character-level bounding boxes. The Presidio NER engine then detects PII entities within the extracted text and maps them back to pixel coordinates.
- Review detected entities. Each detected PII region is highlighted with a colored bounding box on the image preview. Entity type filter badges with checkboxes let you toggle entire categories on or off — for example, disable all DATE_TIME detections if dates are not sensitive in your context.
- Click Redact Selected. The app draws colored fill rectangles over all enabled PII regions, permanently covering the original text in the output image. Only checked entity types are redacted.
- Compare and save. Use the before/after comparison to verify redaction coverage. Click Save to download the redacted PNG image to your filesystem.
Tips for Best Results
Image anonymization quality depends heavily on OCR accuracy. Follow these guidelines to maximize detection reliability:
- Use screenshots, not camera photos. Screenshots of digital content produce far better OCR results than photos of screens, which suffer from moire patterns, glare, and reduced contrast.
- Select the correct OCR language. Mismatched language selection is the single most common cause of poor results. If your image contains German text, select German — not English.
- Use 300+ DPI for scans. Scanned documents should be at least 300 DPI for reliable text extraction. Images below 150 DPI produce significantly degraded results.
- Crop to the text area. Removing large non-text regions (photos, logos, whitespace) speeds up processing and reduces false positives from background noise.
- Ensure good contrast. Dark text on a light background works best. Low contrast between text and background significantly reduces OCR accuracy.
Known Limitations
Image anonymization has inherent limitations related to OCR technology. Understanding these helps set appropriate expectations:
| Limitation | Description | Workaround |
|---|---|---|
| Photos of screens | Moire patterns, glare, and reflections degrade OCR accuracy | Use screenshots or direct digital exports instead |
| Handwritten text | Tesseract is optimized for printed/typed text only | No reliable workaround; manual redaction needed |
| Low resolution (<150 DPI) | Insufficient detail for reliable character recognition | Rescan at 300+ DPI or upscale before processing |
| Rotated/skewed text (>15°) | Tesseract cannot reliably extract angled text | Straighten or deskew the image before uploading |
| Complex backgrounds | Watermarks, textures, and overlapping elements confuse OCR | Crop to clean text areas; increase contrast |
| Very small text (<8pt) | Falls below OCR detection threshold | Zoom/crop to enlarge the text region |
| Multi-column layouts | OCR reading order may become confused across columns | Process each column as a separate cropped image |
| NER language model | NER uses the English spaCy model; person name detection is strongest for English and Latin-script names | Pattern-based entities (phone numbers, IBANs, emails, credit cards) work across all languages |
For any image anonymization task, always review the detected entities before redacting. The review step lets you catch false positives from OCR noise and false negatives where PII was missed.
Ready to try it yourself? See it in action →