The EU AI Act imposes data governance obligations on providers of high-risk AI systems under Article 10. For any training dataset that contains personal data, the fastest path to compliance is anonymization — removing PII before it ever enters the training pipeline. anonym.plus processes training datasets entirely offline, keeping your data inside your infrastructure.
Who Is Affected by EU AI Act Art. 10
Article 10 applies to providers of high-risk AI systems — organizations that develop, train, or deploy AI systems listed in Annex III of the EU AI Act. These include:
- AI systems for biometric identification and categorization
- AI used in critical infrastructure (transport, energy, water)
- Educational and vocational training AI
- AI in employment decisions (hiring, HR management, worker monitoring)
- Essential private and public services (credit scoring, insurance risk assessment)
- Law enforcement AI
- Migration, asylum, and border control AI
- AI in administration of justice
Organizations that fine-tune foundation models (GPT-4, Claude, Llama) on their proprietary datasets for these purposes are also covered.
What Article 10 Requires for Training Data
Article 10 mandates that training, validation, and testing data must:
- Be relevant, representative, and free from errors for the intended purpose
- Have appropriate statistical properties for the AI's use case
- Take into account biases that could lead to prohibited discrimination
- Be subject to documented data governance practices — covering origin, collection methods, preprocessing, and known limitations
- Not contain personal data — unless Art. 10(5) exceptional processing conditions apply (bias monitoring and correction of high-risk AI, under strict safeguards)
The default expectation is that training data for high-risk AI does not contain personal data. If it does, organizations must demonstrate a specific lawful basis and apply strict technical safeguards.
Anonymization as the Compliance Path
Removing personal data from training datasets before the AI training pipeline begins is the most straightforward route to Art. 10 compliance:
- Anonymized training data is not personal data (GDPR Recital 26). No GDPR lawful basis required for training. No data subject rights apply to the dataset. No DPA needed for processors handling the dataset.
- Art. 10's default requirement is met — the training data does not contain personal data.
- Data governance documentation is simplified — you document that PII was removed, what entity types were detected, and what tool was used.
Training Data Formats Supported by anonym.plus
| Format | Typical Use in AI Training | Max Size |
|---|---|---|
| CSV | Tabular datasets, labeled examples | 30 MB |
| JSON / JSONL | Instruction tuning datasets, chat logs, annotations | 30 MB |
| TXT | Pretraining corpora, raw text documents | 50 MB |
| XLSX | Structured training labels, human-annotated data | 20 MB / 100K rows |
| Document corpora, legal/medical training text | 50 MB | |
| DOCX | Annotated text documents, knowledge bases | 30 MB |
For large datasets above these limits, process files in batches using anonym.plus batch mode (Pro plan). All processing is 100% offline — training data never leaves your infrastructure.
Which PII to Remove from Training Data
For EU AI Act compliance, prioritize removing:
- Direct identifiers: names, email addresses, phone numbers, national IDs, passport numbers
- Quasi-identifiers: dates of birth, job titles, postal codes, rare combinations of demographic attributes
- Special categories (Art. 9 GDPR): health data, racial/ethnic origin indicators, religious beliefs, political opinions, union membership, sexual orientation
- Financial data: IBANs, credit card numbers, account numbers
- Location data: precise GPS coordinates, home addresses, frequently visited places
anonym.plus detects all of these through 340+ built-in entity types. The GDPR Compliance preset (confidence 0.90) is the recommended starting point for training data preparation.
Documenting Compliance for Art. 10
After anonymizing your training datasets, document the following in your AI system's technical documentation (required under Art. 11):
- Data sources and collection methods
- PII removal method: anonym.plus v[x.x], Replace operator, GDPR Compliance preset, confidence threshold 0.90
- Entity types detected and replaced
- Date of processing and dataset version
- Any residual risks identified and mitigations applied
anonym.plus creates a processing history entry for each file, including entity counts, operator used, and timestamp — supporting this documentation requirement.
Start preparing your training data now. Learn how batch processing works →
Frequently Asked Questions
What does EU AI Act Article 10 require for training data?
Art. 10 requires high-risk AI training data to be relevant, representative, properly governed, and — by default — free of personal data. Organizations must document data origin, preprocessing steps, and any biases. Anonymization is the primary compliance mechanism for training data containing personal information.
When does the EU AI Act training data requirement take effect?
August 2, 2026. The EU AI Act entered into force August 1, 2024; high-risk AI system obligations apply 24 months later. Organizations should begin data governance and anonymization preparation well before this deadline.
Does anonym.plus support large training datasets for EU AI Act compliance?
Yes. Use Batch mode (Pro plan) to process up to 20 files in parallel. Supported formats include CSV, JSON, TXT, XLSX, PDF, and DOCX. All processing is 100% offline — training data never leaves your servers. For very large datasets, process in batches by splitting files.