Data Management

Purpose

This document establishes Rygen Technologies’ AI Data Management Process in accordance with ISO/IEC 42001:2023 Section B.7. It ensures that data used in AI systems is properly acquired, validated, documented, and maintained throughout the AI system lifecycle.

Scope

This process applies to all data used within the AIMS scope:

  • Training data for custom ML models
  • Validation and test data
  • Production/operational data
  • Data transmitted to/from third-party AI APIs
  • Internal data used with internal AI tools

Applicability by System Type

This procedure applies to all AI systems within the AIMS scope, but deliverable requirements differ based on system architecture:

Custom Model Systems

Systems that train, fine-tune, or retrain models on proprietary data must produce all deliverables defined in this procedure as standalone documents: Data Requirements Specification, Data Quality Report, data preparation documentation, and data provenance records.

API-Based Systems

Systems that consume third-party AI APIs (e.g., OpenAI, Google Gemini, Anthropic) without training custom models have different data management characteristics. These systems process operational data in real-time rather than curating training datasets. For API-based systems:

  • Data management documentation is integrated into the System Design Specification rather than produced as standalone deliverables. The design spec must include a dedicated Data Management section covering: data requirements and sources, data quality assessment, data preparation and transformation, data provenance and audit trail, and data governance and security.
  • Data Quality Reports as defined in Section 5 are not required. Instead, the design spec’s data quality assessment must evaluate fitness of input data for the system’s intended use, using quality dimensions appropriate to the data type (e.g., for knowledge bases: completeness of topic coverage, accuracy against current product features, currency of content).
  • Data provenance must document data flows through the system: what data enters, how it is transformed, what data is sent to third-party APIs, and what data is stored and where.
  • Data preparation must document any transformation pipeline (e.g., chunking, embedding, prompt construction) in sufficient detail for reproducibility. Pipeline code resides in application source repositories; the design spec must reference the repository location.

See GOV-056 (Shipment Tracking Copilot System Design) Section 5 for a reference implementation of API-based system data management documentation.

Data Management Lifecycle

Data Strategy and Planning

Objective: Define data requirements before AI system development

Activities:

  • Identify data categories needed for the AI system
  • Determine data volume and velocity requirements
  • Assess available data sources (internal, purchased, shared, synthetic)
  • Define data quality criteria specific to the use case

Deliverables:

  • Data Requirements Specification

Data Acquisition

Objective: Obtain data from identified sources with proper documentation

Activities:

Source Documentation:

  • Record data source characteristics (static, streaming, batch, real-time)
  • Document data collection methods and timeframes

Rights and Compliance:

  • Verify data usage rights and licenses
  • Ensure compliance with privacy regulations

Technical Acquisition:

  • Implement secure data transfer mechanisms
  • Validate data format and schema
  • Perform initial data profiling

Deliverables:

  • Initial data profile report

Data Quality Assessment

Objective: Evaluate data fitness for intended AI use

Process: Apply the Data Quality Assessment (Section 5) to evaluate:

  • Completeness
  • Accuracy
  • Consistency
  • Timeliness
  • Relevance
  • Representativeness

Activities:

  • Run automated quality checks
  • Generate statistical profiles
  • Identify quality issues and gaps
  • Document quality assessment results

Deliverables:

  • Data Quality Report

Data Preparation

Objective: Transform data for AI system consumption

Standard Preparation Methods:

Statistical Exploration:

  • Distribution analysis (mean, median, standard deviation)
  • Range and outlier detection
  • Correlation analysis

Cleaning:

  • Correct invalid entries
  • Handle missing values (document method used)
  • Remove duplicates
  • Standardize formats

Transformation:

  • Imputation (document methods and rationale)
  • Normalization/scaling
  • Feature encoding
  • Label encoding for categorical variables

Critical Requirements:

  • Document ALL preparation steps for reproducibility
  • Maintain original data separate from prepared data
  • Version control transformation scripts

Deliverables:

  • Data preparation documentation (integrated into AI System Design Document)
  • Prepared dataset with metadata

Data Provenance Management

Objective: Maintain lineage of data throughout lifecycle

For API-based Systems:

  • Document data flows and API usage
  • Record data transformations and processing steps

For Custom Models:

  • Creation: Original source, collection method, timestamp
  • Updates: All modifications with who/what/when/why
  • Transformations: Complete record of preparation steps
  • Usage: Which AI systems use this data

Implementation:

  • Maintain provenance in version-controlled documentation
  • Include provenance metadata with datasets

Deliverables:

  • Data provenance records

Data Governance and Security

Objective: Ensure appropriate controls throughout data lifecycle

Security Controls:

  • Encryption at rest and in transit
  • Access control and authentication
  • Audit logging of data access
  • Secure disposal procedures

Governance Controls:

  • Regular review of data quality
  • Monitoring for data drift
  • Compliance verification
  • Stakeholder access management

Integration with AI Development Process

Phase 1 – Problem Definition:

  • Define data requirements
  • Assess data availability
  • Include data risks in risk assessment

Phase 2 – Data Collection and Preparation:

  • Execute this Data Management Process
  • Generate required documentation
  • Validate data quality

Phase 3 – Training and Development:

  • Use prepared data with full provenance
  • Document data splits (train/validation/test)
  • Monitor data usage

Phase 6 – Monitoring:

  • Track data drift in production
  • Monitor ongoing data quality
  • Update data as needed

Data Quality Assessment

Quality Dimensions

DimensionExcellent (5)Good (4)Acceptable (3)Poor (2)Unacceptable (1)
Completeness<1% missing values, all required fields present1-5% missing, minor gaps5-10% missing, manageable gaps10-25% missing, significant gaps>25% missing, critical gaps
AccuracyVerified against ground truth, <1% errorsHigh confidence, <5% errorsReasonable accuracy, <10% errorsQuestionable accuracy, 10-20% errorsPoor accuracy, >20% errors
ConsistencyNo contradictions, uniform formatsMinor format variationsSome inconsistencies, manageableNotable inconsistenciesSevere inconsistencies
TimelinessReal-time or <1 day old<1 week old<1 month old<6 months old>6 months old
RelevancePerfectly matched to use caseHighly relevant, minor gapsGenerally relevantPartially relevantMinimally relevant
RepresentativenessFully represents target domainGood coverage, minor gapsAdequate coverageLimited coveragePoor representation

Fitness for Use Assessment

  • Production Ready: All dimensions score 4-5, suitable for production use
  • Acceptable with Conditions: Mix of 3-5 scores, usable with documented limitations
  • Requires Improvement: Any dimension scores 2, needs remediation before production use
  • Not Suitable: Any dimension scores 1, not suitable for AI use

Data Quality Report Template

# Data Quality Report

## Dataset Information

- **Dataset Name**: [Name]
- **Source**: [Source description]
- **Collection Period**: [Start date - End date]
- **Assessment Date**: [Date]
- **Assessor**: [Name/Role]

## Quality Assessment Summary

### Dimension Scores

| Dimension | Score | Issues Identified | Remediation Actions |
| ------------------ | ----- | ----------------- | ------------------- |
| Completeness | X/5 | [Issues] | [Actions] |
| Accuracy | X/5 | [Issues] | [Actions] |
| Consistency | X/5 | [Issues] | [Actions] |
| Timeliness | X/5 | [Issues] | [Actions] |
| Relevance | X/5 | [Issues] | [Actions] |
| Representativeness | X/5 | [Issues] | [Actions] |

## Statistical Profile

- **Total Records**: [Number]
- **Features/Columns**: [Number]
- **Data Types**: [List types and counts]
- **Missing Value Summary**: [Table or description]

## Key Findings

1. [Finding 1]
2. [Finding 2]
3. [Finding 3]

## Recommendations

- [Recommendation 1]
- [Recommendation 2]

## Fitness for Use Assessment

**Intended Use Case**: [Description]
**Fitness Rating**: [Production Ready/Acceptable with Conditions/Requires Improvement/Not Suitable]
**Conditions/Limitations**: [If applicable]

## Approval

- **Reviewed By**: [Name/Role]
- **Approved By**: [Principal AI Engineer]
- **Date**: [Date]

Roles and Responsibilities

RoleData Management Responsibilities
Principal AI Engineer• Approve data quality criteria
• Review quality reports
• Approve production data use
Data Engineers• Implement data pipelines
• Execute quality checks
• Maintain provenance records
AI/ML Engineers• Define data requirements
• Perform data preparation
• Document transformations
DevOps Team• Ensure secure data storage
• Implement access controls
• Monitor data pipelines

Monitoring and Continuous Improvement

Regular Reviews

  • Monthly: Data quality metrics review
  • Quarterly: Process effectiveness assessment
  • Annually: Comprehensive process review

Key Metrics

  • Data quality assessment completion
  • Data preparation time
  • Data-related incidents
  • Provenance completeness
  • AI-013 AI System Development Lifecycle
  • AI-008 AI Risk Management Framework
  • AI-009 AI System Impact Assessment Process

Revision History

VersionDateAuthorSummary of Change
1.02025-06-05Field BradleyInitial draft.
1.12025-09-02Field BradleyMigrated to markdown and gitlab
1.22026-02-17Field BradleyAdded applicability by system type section clarifying deliverables for API-based vs custom model systems; updated related documents; fixed review date to quarterly schedule