EU AI Act Data Governance (Art. 10)
EU_AI_ACTfreeValidate AI training data quality — label completeness, class balance, train/test leakage, feature null rates, data drift, bias coverage, and provenance tracking per EU AI Act Article 10.
Checks included (10)
Training Data Label Completeness
Validates that training data labels are non-null for all supervised learning records. Under EU AI Act Article 10, high-risk AI systems must be developed with training data that meets quality criteria including completeness. Missing labels in supervised learning datasets compromise model reliability and violate data governance requirements.
Feature Column Null Rate Threshold
Validates that feature columns do not exceed the configured null rate threshold. Excessive missing values in feature columns degrade model training quality and can introduce bias. Under EU AI Act Article 10, training data must be complete in view of the intended purpose of the AI system.
Dataset Documentation Completeness
Validates that each dataset has required documentation fields populated: description, source, collection_date, size, and intended_use. Under EU AI Act Article 10, providers of high-risk AI systems must maintain comprehensive documentation of training data including its characteristics, properties, and intended purpose.
Data Provenance Tracking Completeness
Validates that each record has provenance fields populated: source_system, ingestion_date, and data_version. Under EU AI Act Article 10, providers must maintain data governance practices that ensure traceability of training data origin and lineage. Provenance tracking is essential for auditing, debugging model behavior, and demonstrating regulatory compliance.
PII Annotation and Anonymization Flag
Validates that records containing personal data have pii_flag set to true and anonymization_method populated. Under EU AI Act Article 10, training data containing personal data requires appropriate data governance measures including privacy-preserving techniques. Proper PII annotation ensures transparency and supports GDPR compliance alongside AI Act requirements.
Training Data Class Balance Metric
Validates that the class distribution in training data does not have an imbalance ratio exceeding the configured threshold between any two classes. Under EU AI Act Article 10, training datasets must be representative and free from bias. Severe class imbalance can lead to biased model predictions and underperformance on minority classes.
Feature Distribution Data Drift Detection
Validates that feature distributions in production data do not deviate more than the configured number of standard deviations from the training baseline. Data drift indicates that production inputs have shifted from what the model was trained on, potentially degrading AI system performance. Under EU AI Act Article 10, ongoing data governance requires monitoring for dataset relevance.
Outlier Rate Threshold Check
Validates that the rate of statistical outliers (values exceeding 3 standard deviations from the mean) in a feature column stays below the configured threshold. Excessive outliers in training data can skew model learning and produce unreliable AI systems. Under EU AI Act Article 10, training data must be free of errors to the best extent possible.
Protected Attribute Representation Coverage
Validates that protected attributes (such as gender, age_group, ethnicity) are represented with a minimum coverage percentage per group. Under EU AI Act Article 10, training data must be examined for possible biases that are likely to affect the health, safety, or fundamental rights of persons. Insufficient representation of protected groups can lead to discriminatory AI outcomes.
Train-Test Data Leakage Detection
Validates that no record IDs appear in both the training and test datasets. Data leakage between training and test sets leads to artificially inflated model performance metrics and unreliable AI systems. Under EU AI Act Article 10, datasets must support proper evaluation of AI system performance.