Incident Response

Purpose

This procedure establishes the framework for responding to incidents involving AI systems within Rygen’s AI Management System (AIMS) in accordance with ISO/IEC 42001:2023 Section 8.1 and Annex A.8.4. It ensures AI incidents are properly identified, managed, communicated, and used to improve the AIMS.

Scope

This procedure applies to all incidents involving AI systems within the AIMS scope:

  • X1 Platform AI features
  • Corsair TMS AI capabilities
  • Internal AI tools and third-party AI services
  • AI development and deployment processes

AI Incident Types

An AI incident is any unplanned event or situation that affects or has the potential to affect:

  • System Availability: AI system outages, failures, or unavailability
  • Performance: Degradation below defined thresholds, accuracy issues
  • Trustworthiness: Bias, fairness issues, explainability failures, lack of human oversight
  • Data: Data quality issues, data integrity problems, unauthorized access
  • Security/Privacy: Breaches involving AI systems or training data
  • Compliance: Violations of regulations, policies, or contractual obligations
  • Safety: Potential harm to individuals or operations

Severity Classification

AI incidents are classified using the following severity levels, aligned with Rygen’s operational incident severity framework:

SeverityAI Impact ExamplesResponse TimeEscalation
SEV1 (Critical)Complete AI system outage affecting all users; major data breach or loss; critical safety issue; severe bias causing harm; regulatory violation with immediate impactImmediate response requiredCTO, Principal AI Engineer, relevant team leads
SEV2 (High)Partial AI system outage or degradation affecting multiple users; performance below thresholds; data quality issues impacting decisions; moderate bias or fairness concerns; compliance concerns requiring investigationResponse within 1 hourPrincipal AI Engineer, relevant team leads
SEV3 (Medium)Minor user-visible issues; intermittent problems; non-critical performance degradation; documentation or process gapsResponse within 4 hoursTeam lead, Principal AI Engineer notified

Note: Issues below Medium severity (e.g., minor bugs, documentation gaps, minor performance variations within acceptable ranges) are tracked as standard Jira issues and do not trigger the incident response process. If uncertain about severity classification, consult with the relevant team lead or Principal AI Engineer.

Incident Response Process

Detection and Initial Response

Who: Any team member, automated monitoring, or stakeholder report

Actions:

  1. Immediately notify the relevant system lead and Principal AI Engineer
  2. Assess initial severity and potential AI-specific impacts (trustworthiness, bias, safety)
  3. Initiate operational incident response per Rygen’s Engineering Incident Response Procedure
  4. For SEV1/SEV2: Ensure InfoSec Officer is engaged if security, privacy, or data concerns exist

Operational Execution

The operational response follows Rygen’s Engineering Incident Response Procedure, which includes:

  • Role assignments (Incident Commander, Technical Lead, Communications Lead)
  • Dedicated Slack incident channel
  • Jira incident ticket creation (DEVOPS project, Incident type)
  • Investigation and remediation
  • Stakeholder communication

AI-Specific Operational Considerations:

  • Containment: For AI systems, consider disabling affected features or reverting to non-AI fallbacks if available
  • Evidence Preservation: Capture model versions, configuration states, input/output samples, and relevant logs before making changes
  • Bias/Fairness Issues: Involve diverse stakeholders in assessment; consider immediate mitigation even if root cause is unclear
  • Data Issues: Assess scope of affected data and downstream impacts on decisions or operations
  • Third-Party AI Services: Engage vendor support; document service provider response and resolution

Communication Requirements

Communication follows the timelines and stakeholder groups defined in AI-007 Communication Policy:

SEV1 (Critical):

  • CEO/Board: Immediate notification via email for urgent items
  • CTO: Immediate notification via Slack
  • Affected Clients: Within 4 hours via support tickets and email
  • End Users: Within 60 minutes via Beamer (in-app notifications)

SEV2 (High):

  • CTO: Immediate notification via Slack
  • Affected Clients: Within 24 hours via email and support tickets
  • End Users: Within 60 minutes via Beamer if user-facing impact

SEV3 (Medium):

  • CTO: Next business day via monthly reports or Slack as appropriate
  • Clients: As needed based on impact
  • End Users: As needed via help documentation or support portal

Post-Incident Activities

Postmortem (Required for SEV1/SEV2, Optional for SEV3):

Within 5 business days, complete a postmortem using Rygen’s standard template…

AIMS Integration:

Within the postmortem or follow-up actions:

  1. Nonconformity Assessment: Determine if incident represents a nonconformity per AI-014

    • Did we fail to follow an AIMS procedure?
    • Did we miss a required control?
    • Did we violate a policy or objective?
    • If yes: Create NCR per AI-014; NCR process handles root cause analysis and corrective actions
    • If no: Proceed with items 2-5 below
  2. Risk Register Update: Document new risks identified or update existing risk assessments

  3. Corrective Actions: Create Jira tasks for technical remediation (if not handled by NCR)

  4. Lessons Learned: Update relevant AIMS documentation

  5. Management Review Input: Include incident summary in next quarterly review

Follow-Up Tracking:

  • All remediation tasks tracked in Jira with assigned owners and due dates
  • Principal AI Engineer reviews completion and effectiveness
  • Incident formally closed only after corrective actions are verified effective

Roles and Responsibilities

Principal AI Engineer

  • Overall accountability for AI incident response effectiveness
  • Ensures AIMS integration activities are completed
  • Reviews and approves significant corrective actions
  • Reports incidents to CTO and AI Governance Committee
  • Updates risk register and drives process improvements

Incident Commander (Operational Role)

  • Leads incident coordination and resolution
  • Ensures AI-specific considerations are addressed
  • Coordinates with Principal AI Engineer on AIMS integration

Technical Lead (Operational Role)

  • Investigates technical root cause
  • Implements fixes and validates resolution
  • Provides technical input for postmortem

Communications Lead (Operational Role)

  • Manages stakeholder communication per AI-007 requirements
  • Coordinates with Principal AI Engineer on AI-specific messaging

InfoSec Officer

  • Assesses security, privacy, and compliance impact
  • Determines if incident requires regulatory notification
  • Ensures security controls are updated as needed

Team Leads

  • Support incident response in their domain
  • Implement corrective actions
  • Ensure team learning from incidents

Metrics and Monitoring

The following metrics are tracked and reported in management reviews:

  • Number of AI incidents by severity and type
  • Mean time to detect (MTTD) and mean time to resolve (MTTR)
  • Percentage of incidents with completed postmortems
  • Percentage of corrective actions completed on time
  • Repeat incidents (same root cause)
  • Stakeholder satisfaction with incident communication

Integration with Other AIMS Processes

  • Risk Management (AI-008): Incidents inform risk identification and assessment
  • Nonconformity (AI-014): Incidents may trigger NCRs and corrective action process
  • Communication Policy (AI-007): Defines stakeholder communication requirements
  • Management Review (AI-012): Incident trends and lessons learned reported quarterly
  • Monitoring (AI-017): Incident data contributes to AIMS performance evaluation

Operational Implementation

This procedure is implemented operationally through Rygen’s Engineering Incident Response Procedure, which provides detailed operational guidance on:

  • Incident detection and initial response
  • Role assignments and coordination
  • Communication channels and tools
  • Postmortem documentation
  • Follow-up tracking

Review and Improvement

This procedure is reviewed:

  • Annually as part of scheduled review
  • After any SEV1 incident
  • When operational incident response procedure is updated
  • When audit findings or lessons learned suggest improvements

Revision History

VersionDateAuthorSummary of Change
1.02025-01-14Field BradleyInitial version integrating AIMS requirements with operational incident response