The Challenge
An enterprise AI team is fine-tuning a customer service LLM using 18 months of support ticket data. The dataset contains 240,000 JSON records with customer names, email addresses, account numbers, product serial numbers, and free-text descriptions that include PII. The EU AI Act (Art. 10, effective August 2026) requires data governance practices ensuring training data is free of unnecessary personal data for high-risk AI applications. Uploading the dataset to a cloud anonymization service would itself create a GDPR violation — the data must stay within the company's EU data center.
The Solution
The ML engineering team installs anonym.plus on a workstation within the EU data center. They split the 240K record dataset into 120 JSONL files of 2,000 records each (avg 25 MB per file). Using Batch mode with 5 parallel workers, they process all 120 files over approximately 90 minutes. A custom preset uses: PERSON, EMAIL_ADDRESS, PHONE_NUMBER, IBAN_CODE, IP_ADDRESS, CREDIT_CARD, and a custom entity for product serial numbers (regex: SN-[A-Z0-9]{10}). Replace operator ensures irreversible anonymization. Processing history is exported as CSV for Art. 11 technical documentation.
The Results
- 240,000 records anonymized — 6 PII categories + 1 custom entity processed in 90 minutes
- Anonymized dataset exits GDPR scope — no lawful basis required for training, no data subject rights apply
- EU AI Act Art. 10 data governance requirement met — documented in technical file
- Training data never left the EU data center — full data residency maintained
- No DPA required with the training infrastructure provider — anonymized data only
- Processing history CSV provides audit trail for Art. 11 technical documentation
Training Data Formats Supported
- JSON / JSONL — instruction tuning datasets, chat conversations, annotation files (30 MB per file)
- CSV — tabular training data, labeled examples, evaluation sets (30 MB)
- TXT — pretraining corpora, raw document collections (50 MB)
- XLSX — human-annotated datasets, scoring sheets (20 MB / 100K rows)
- PDF / DOCX — document classification corpora, knowledge base documents
For datasets larger than per-file limits, split into chunks and process with Batch mode. Up to 20 files processed simultaneously with the Pro plan.
EU AI Act Art. 10 Documentation
After anonymizing training data, document the following in the AI system's technical file (Art. 11):
- Data governance practice: PII removed from training data using anonym.plus [version], Replace operator, GDPR Compliance preset
- Entity types detected and removed: [list from processing history export]
- Processing date and dataset version: [timestamp from history]
- Residual risk assessment: Replace operator produces true anonymization (GDPR Recital 26); re-identification not possible from output data
- Data residency: Processing performed locally on EU infrastructure; no data transferred outside the data center
Read the full EU AI Act guide. EU AI Act Art. 10 compliance →
Frequently Asked Questions
How do I remove PII from AI training data for GDPR and EU AI Act compliance?
Load training files (JSON, CSV, TXT, XLSX) into anonym.plus. Select the GDPR Compliance preset or configure entity types. Choose Replace operator for permanent anonymization. Process in Batch mode for large datasets. Anonymized output exits GDPR scope and meets EU AI Act Art. 10 data governance requirements.
Does anonym.plus process JSONL format training datasets?
Yes. JSON and JSONL files (30 MB) are supported. anonym.plus parses text fields and replaces detected PII with labels. Structure is preserved — the JSONL file remains valid for training pipelines after anonymization.