Data Management
Purpose
This document establishes Rygen Technologies’ AI Data Management Process in accordance with ISO/IEC 42001:2023 Section B.7. It ensures that data used in AI systems is properly acquired, validated, documented, and maintained throughout the AI system lifecycle.
Scope
This process applies to all data used within the AIMS scope:
- Training data for custom ML models
- Validation and test data
- Production/operational data
- Data transmitted to/from third-party AI APIs
- Internal data used with internal AI tools
Applicability by System Type
This procedure applies to all AI systems within the AIMS scope, but deliverable requirements differ based on system architecture:
Custom Model Systems
Systems that train, fine-tune, or retrain models on proprietary data must produce all deliverables defined in this procedure as standalone documents: Data Requirements Specification, Data Quality Report, data preparation documentation, and data provenance records.
API-Based Systems
Systems that consume third-party AI APIs (e.g., OpenAI, Google Gemini, Anthropic) without training custom models have different data management characteristics. These systems process operational data in real-time rather than curating training datasets. For API-based systems:
- Data management documentation is integrated into the System Design Specification rather than produced as standalone deliverables. The design spec must include a dedicated Data Management section covering: data requirements and sources, data quality assessment, data preparation and transformation, data provenance and audit trail, and data governance and security.
- Data Quality Reports as defined in Section 5 are not required. Instead, the design spec’s data quality assessment must evaluate fitness of input data for the system’s intended use, using quality dimensions appropriate to the data type (e.g., for knowledge bases: completeness of topic coverage, accuracy against current product features, currency of content).
- Data provenance must document data flows through the system: what data enters, how it is transformed, what data is sent to third-party APIs, and what data is stored and where.
- Data preparation must document any transformation pipeline (e.g., chunking, embedding, prompt construction) in sufficient detail for reproducibility. Pipeline code resides in application source repositories; the design spec must reference the repository location.
See GOV-056 (Shipment Tracking Copilot System Design) Section 5 for a reference implementation of API-based system data management documentation.
Data Management Lifecycle
Data Strategy and Planning
Objective: Define data requirements before AI system development
Activities:
- Identify data categories needed for the AI system
- Determine data volume and velocity requirements
- Assess available data sources (internal, purchased, shared, synthetic)
- Define data quality criteria specific to the use case
Deliverables:
- Data Requirements Specification
Data Acquisition
Objective: Obtain data from identified sources with proper documentation
Activities:
Source Documentation:
- Record data source characteristics (static, streaming, batch, real-time)
- Document data collection methods and timeframes
Rights and Compliance:
- Verify data usage rights and licenses
- Ensure compliance with privacy regulations
Technical Acquisition:
- Implement secure data transfer mechanisms
- Validate data format and schema
- Perform initial data profiling
Deliverables:
- Initial data profile report
Data Quality Assessment
Objective: Evaluate data fitness for intended AI use
Process: Apply the Data Quality Assessment (Section 5) to evaluate:
- Completeness
- Accuracy
- Consistency
- Timeliness
- Relevance
- Representativeness
Activities:
- Run automated quality checks
- Generate statistical profiles
- Identify quality issues and gaps
- Document quality assessment results
Deliverables:
- Data Quality Report
Data Preparation
Objective: Transform data for AI system consumption
Standard Preparation Methods:
Statistical Exploration:
- Distribution analysis (mean, median, standard deviation)
- Range and outlier detection
- Correlation analysis
Cleaning:
- Correct invalid entries
- Handle missing values (document method used)
- Remove duplicates
- Standardize formats
Transformation:
- Imputation (document methods and rationale)
- Normalization/scaling
- Feature encoding
- Label encoding for categorical variables
Critical Requirements:
- Document ALL preparation steps for reproducibility
- Maintain original data separate from prepared data
- Version control transformation scripts
Deliverables:
- Data preparation documentation (integrated into AI System Design Document)
- Prepared dataset with metadata
Data Provenance Management
Objective: Maintain lineage of data throughout lifecycle
For API-based Systems:
- Document data flows and API usage
- Record data transformations and processing steps
For Custom Models:
- Creation: Original source, collection method, timestamp
- Updates: All modifications with who/what/when/why
- Transformations: Complete record of preparation steps
- Usage: Which AI systems use this data
Implementation:
- Maintain provenance in version-controlled documentation
- Include provenance metadata with datasets
Deliverables:
- Data provenance records
Data Governance and Security
Objective: Ensure appropriate controls throughout data lifecycle
Security Controls:
- Encryption at rest and in transit
- Access control and authentication
- Audit logging of data access
- Secure disposal procedures
Governance Controls:
- Regular review of data quality
- Monitoring for data drift
- Compliance verification
- Stakeholder access management
Integration with AI Development Process
Phase 1 – Problem Definition:
- Define data requirements
- Assess data availability
- Include data risks in risk assessment
Phase 2 – Data Collection and Preparation:
- Execute this Data Management Process
- Generate required documentation
- Validate data quality
Phase 3 – Training and Development:
- Use prepared data with full provenance
- Document data splits (train/validation/test)
- Monitor data usage
Phase 6 – Monitoring:
- Track data drift in production
- Monitor ongoing data quality
- Update data as needed
Data Quality Assessment
Quality Dimensions
| Dimension | Excellent (5) | Good (4) | Acceptable (3) | Poor (2) | Unacceptable (1) |
|---|---|---|---|---|---|
| Completeness | <1% missing values, all required fields present | 1-5% missing, minor gaps | 5-10% missing, manageable gaps | 10-25% missing, significant gaps | >25% missing, critical gaps |
| Accuracy | Verified against ground truth, <1% errors | High confidence, <5% errors | Reasonable accuracy, <10% errors | Questionable accuracy, 10-20% errors | Poor accuracy, >20% errors |
| Consistency | No contradictions, uniform formats | Minor format variations | Some inconsistencies, manageable | Notable inconsistencies | Severe inconsistencies |
| Timeliness | Real-time or <1 day old | <1 week old | <1 month old | <6 months old | >6 months old |
| Relevance | Perfectly matched to use case | Highly relevant, minor gaps | Generally relevant | Partially relevant | Minimally relevant |
| Representativeness | Fully represents target domain | Good coverage, minor gaps | Adequate coverage | Limited coverage | Poor representation |
Fitness for Use Assessment
- Production Ready: All dimensions score 4-5, suitable for production use
- Acceptable with Conditions: Mix of 3-5 scores, usable with documented limitations
- Requires Improvement: Any dimension scores 2, needs remediation before production use
- Not Suitable: Any dimension scores 1, not suitable for AI use
Data Quality Report Template
# Data Quality Report
## Dataset Information
- **Dataset Name**: [Name]
- **Source**: [Source description]
- **Collection Period**: [Start date - End date]
- **Assessment Date**: [Date]
- **Assessor**: [Name/Role]
## Quality Assessment Summary
### Dimension Scores
| Dimension | Score | Issues Identified | Remediation Actions |
| ------------------ | ----- | ----------------- | ------------------- |
| Completeness | X/5 | [Issues] | [Actions] |
| Accuracy | X/5 | [Issues] | [Actions] |
| Consistency | X/5 | [Issues] | [Actions] |
| Timeliness | X/5 | [Issues] | [Actions] |
| Relevance | X/5 | [Issues] | [Actions] |
| Representativeness | X/5 | [Issues] | [Actions] |
## Statistical Profile
- **Total Records**: [Number]
- **Features/Columns**: [Number]
- **Data Types**: [List types and counts]
- **Missing Value Summary**: [Table or description]
## Key Findings
1. [Finding 1]
2. [Finding 2]
3. [Finding 3]
## Recommendations
- [Recommendation 1]
- [Recommendation 2]
## Fitness for Use Assessment
**Intended Use Case**: [Description]
**Fitness Rating**: [Production Ready/Acceptable with Conditions/Requires Improvement/Not Suitable]
**Conditions/Limitations**: [If applicable]
## Approval
- **Reviewed By**: [Name/Role]
- **Approved By**: [Principal AI Engineer]
- **Date**: [Date]
Roles and Responsibilities
| Role | Data Management Responsibilities |
|---|---|
| Principal AI Engineer | • Approve data quality criteria • Review quality reports • Approve production data use |
| Data Engineers | • Implement data pipelines • Execute quality checks • Maintain provenance records |
| AI/ML Engineers | • Define data requirements • Perform data preparation • Document transformations |
| DevOps Team | • Ensure secure data storage • Implement access controls • Monitor data pipelines |
Monitoring and Continuous Improvement
Regular Reviews
- Monthly: Data quality metrics review
- Quarterly: Process effectiveness assessment
- Annually: Comprehensive process review
Key Metrics
- Data quality assessment completion
- Data preparation time
- Data-related incidents
- Provenance completeness
Related Documents
- AI-013 AI System Development Lifecycle
- AI-008 AI Risk Management Framework
- AI-009 AI System Impact Assessment Process
Revision History
| Version | Date | Author | Summary of Change |
|---|---|---|---|
| 1.0 | 2025-06-05 | Field Bradley | Initial draft. |
| 1.1 | 2025-09-02 | Field Bradley | Migrated to markdown and gitlab |
| 1.2 | 2026-02-17 | Field Bradley | Added applicability by system type section clarifying deliverables for API-based vs custom model systems; updated related documents; fixed review date to quarterly schedule |