Build this pipeline yourself

Open in the interactive wizard to customize and export

Incident Response Automation

Auto-triage alerts, analyze root cause with AI, and remediate known issues

advanced IT Operations

Components Used

http_request json_transform generator router trycatch logger

Incident Response Automation Pipeline Visualization

The Problem

Alert fires at 3am. The on-call engineer:

Wakes up, checks phone (2 min)
Opens laptop, VPNs in (5 min)
Reads alert, tries to understand context (5 min)
SSHs to server, checks logs (10 min)
Realizes it’s a memory leak they’ve seen before (5 min)
Restarts the service (2 min)
Verifies it’s healthy (5 min)
Goes back to sleep, forgets to document (0 min)

Total: 34 minutes. For a known issue with a documented fix.

Pain Points We’re Solving

Context gathering - Manually correlating logs, metrics, deploys
Repeated incidents - Same issue, same fix, nobody automated it
Alert noise - Woken up for issues that auto-resolve
Missing runbooks - The fix is in someone’s head, not documented

Thinking Process

What if the pipeline could handle the first 25 minutes automatically?

flowchart TB
    subgraph Auto["Automated (2 min)"]
        A1["Receive alert"]
        A2["Fetch context (logs, metrics, deploys)"]
        A3["AI root cause analysis"]
        A4["Match to known issues"]
        A5["Execute remediation"]
    end

    subgraph Human["Human (if needed)"]
        H1["Review AI analysis"]
        H2["Decide on action"]
        H3["Execute fix"]
    end

    Auto --> |"known issue"| Success
    Auto --> |"unknown issue"| Human
    Human --> Success

Key Insight: Encode Runbooks as Pipelines

Every runbook is: “If X symptoms, do Y actions.” That’s a conditional + actions - exactly what FlowMason does.

Solution Architecture

flowchart TB
    subgraph Alert["Alert Received"]
        I1["PagerDuty webhook"]
        I2["Alert details"]
    end

    subgraph Context["Gather Context"]
        C1["Fetch logs (last 30 min)"]
        C2["Fetch metrics"]
        C3["Recent deploys"]
        C4["Service dependencies"]
    end

    subgraph Analysis["AI Analysis"]
        A1["Root cause identification"]
        A2["Severity assessment"]
        A3["Match known patterns"]
    end

    subgraph Action["Take Action"]
        R1["Known issue → Auto-remediate"]
        R2["Unknown → Escalate to human"]
    end

    subgraph Notify["Update & Notify"]
        N1["Update status page"]
        N2["Notify team"]
        N3["Create postmortem draft"]
    end

    Alert --> Context
    Context --> Analysis
    Analysis --> Action
    Action --> Notify

Pipeline Stages

Stage 1: Parse Incoming Alert

{
  "id": "receive-alert",
  "component": "json_transform",
  "config": {
    "data": "{{input}}",
    "expression": "{alert_id: id, service: payload.source, severity: payload.severity, summary: payload.summary, triggered_at: payload.triggered_at, custom_details: payload.custom_details}"
  }
}

Stage 2-4: Fetch Context (Parallel)

{
  "id": "fetch-logs",
  "component": "http-request",
  "depends_on": ["receive-alert"],
  "config": {
    "url": "{{input.logging_api}}/query",
    "method": "POST",
    "headers": {
      "Authorization": "Bearer {{secrets.LOGGING_TOKEN}}"
    },
    "body": {
      "query": "service:{{stages.receive-alert.output.service}} level:error",
      "from": "-30m",
      "limit": 100
    },
    "timeout": 15000
  }
}

{
  "id": "fetch-metrics",
  "component": "http-request",
  "depends_on": ["receive-alert"],
  "config": {
    "url": "{{input.metrics_api}}/query_range",
    "method": "POST",
    "body": {
      "query": "up{service=\"{{stages.receive-alert.output.service}}\"}",
      "start": "{{now() - 1800}}",
      "end": "{{now()}}",
      "step": "60"
    },
    "timeout": 10000
  }
}

{
  "id": "fetch-recent-deploys",
  "component": "http-request",
  "depends_on": ["receive-alert"],
  "config": {
    "url": "{{input.deploy_api}}/deployments",
    "method": "GET",
    "query_params": {
      "service": "{{stages.receive-alert.output.service}}",
      "since": "-24h",
      "limit": 5
    },
    "timeout": 10000
  }
}

Stage 5: AI Root Cause Analysis

{
  "id": "analyze-incident",
  "component": "generator",
  "depends_on": ["fetch-logs", "fetch-metrics", "fetch-recent-deploys"],
  "config": {
    "model": "gpt-4",
    "temperature": 0.2,
    "system_prompt": "You are an expert SRE performing incident triage. Analyze the provided data and determine:\n1. Root cause (be specific)\n2. Severity (critical/high/medium/low)\n3. Is this a known issue pattern? (OOM, connection pool, rate limit, etc.)\n4. Recommended immediate action\n5. Confidence level (high/medium/low)\n\nFormat as JSON.",
    "prompt": "Incident triage for {{stages.receive-alert.output.service}}:\n\nAlert: {{stages.receive-alert.output.summary}}\nSeverity: {{stages.receive-alert.output.severity}}\n\n## Recent Logs (errors):\n{{stages.fetch-logs.output.body.logs | tojson}}\n\n## Metrics (last 30 min):\n{{stages.fetch-metrics.output.body | tojson}}\n\n## Recent Deployments:\n{{stages.fetch-recent-deploys.output.body | tojson}}\n\nProvide your analysis:"
  }
}

Stage 6: Route by Issue Type

{
  "id": "determine-action",
  "component": "router",
  "depends_on": ["analyze-incident"],
  "config": {
    "routes": [
      {
        "name": "oom-restart",
        "condition": "{{stages.analyze-incident.output.issue_type == 'OOM' and stages.analyze-incident.output.confidence == 'high'}}",
        "stages": ["auto-restart-service"]
      },
      {
        "name": "connection-pool",
        "condition": "{{stages.analyze-incident.output.issue_type == 'connection_pool_exhausted'}}",
        "stages": ["auto-scale-connections"]
      },
      {
        "name": "recent-deploy",
        "condition": "{{stages.analyze-incident.output.issue_type == 'bad_deploy' and stages.fetch-recent-deploys.output.body[0].age_minutes < 60}}",
        "stages": ["auto-rollback"]
      },
      {
        "name": "unknown",
        "condition": "{{true}}",
        "stages": ["escalate-to-human"]
      }
    ]
  }
}

Stage 7: Auto-Remediation with Safety

{
  "id": "auto-remediate",
  "component": "trycatch",
  "depends_on": ["determine-action"],
  "config": {
    "try": ["execute-remediation", "verify-health"],
    "catch": ["remediation-failed-escalate"],
    "finally": ["log-remediation-attempt"]
  }
}

Stage 8: Execute Restart

{
  "id": "auto-restart-service",
  "component": "http-request",
  "depends_on": ["determine-action"],
  "config": {
    "url": "{{input.k8s_api}}/apis/apps/v1/namespaces/production/deployments/{{stages.receive-alert.output.service}}/scale",
    "method": "PATCH",
    "headers": {
      "Authorization": "Bearer {{secrets.K8S_TOKEN}}",
      "Content-Type": "application/strategic-merge-patch+json"
    },
    "body": {
      "spec": {
        "replicas": 0
      }
    },
    "timeout": 30000
  }
}

{
  "id": "restore-replicas",
  "component": "http-request",
  "depends_on": ["auto-restart-service"],
  "config": {
    "url": "{{input.k8s_api}}/apis/apps/v1/namespaces/production/deployments/{{stages.receive-alert.output.service}}/scale",
    "method": "PATCH",
    "body": {
      "spec": {
        "replicas": "{{stages.receive-alert.output.custom_details.original_replicas | default(3)}}"
      }
    }
  }
}

Stage 9: Verify Fix

{
  "id": "verify-health",
  "component": "http-request",
  "depends_on": ["restore-replicas"],
  "config": {
    "url": "https://{{stages.receive-alert.output.service}}.internal/health",
    "method": "GET",
    "timeout": 10000,
    "retry": {
      "max_attempts": 6,
      "delay_seconds": 10,
      "condition": "{{output.status_code != 200}}"
    }
  }
}

Stage 10: Escalate to Human

When auto-fix isn’t possible:

{
  "id": "escalate-to-human",
  "component": "http-request",
  "depends_on": ["determine-action"],
  "config": {
    "url": "{{secrets.PAGERDUTY_EVENTS_URL}}",
    "method": "POST",
    "body": {
      "routing_key": "{{secrets.PAGERDUTY_KEY}}",
      "event_action": "trigger",
      "payload": {
        "summary": "[NEEDS HUMAN] {{stages.receive-alert.output.summary}}",
        "severity": "{{stages.receive-alert.output.severity}}",
        "custom_details": {
          "ai_analysis": "{{stages.analyze-incident.output.root_cause}}",
          "recommended_action": "{{stages.analyze-incident.output.recommended_action}}",
          "confidence": "{{stages.analyze-incident.output.confidence}}",
          "why_not_auto": "{{stages.analyze-incident.output.issue_type}} not in auto-remediation playbook",
          "logs_summary": "{{stages.fetch-logs.output.body.logs | length}} error logs in last 30 min",
          "recent_deploy": "{{stages.fetch-recent-deploys.output.body[0].version}} ({{stages.fetch-recent-deploys.output.body[0].age_minutes}} min ago)"
        }
      }
    }
  }
}

Stage 11: Update Status Page

{
  "id": "update-status",
  "component": "http-request",
  "depends_on": ["verify-health"],
  "config": {
    "url": "{{input.statuspage_api}}/incidents",
    "method": "POST",
    "headers": {
      "Authorization": "OAuth {{secrets.STATUSPAGE_TOKEN}}"
    },
    "body": {
      "incident": {
        "name": "{{stages.receive-alert.output.service}} - {{stages.analyze-incident.output.issue_type}}",
        "status": "resolved",
        "body": "Auto-remediated by FlowMason. Root cause: {{stages.analyze-incident.output.root_cause}}. Action taken: {{stages.analyze-incident.output.recommended_action}}.",
        "component_ids": ["{{input.statuspage_component_id}}"]
      }
    }
  }
}

Stage 12: Notify Team

{
  "id": "notify-team",
  "component": "http-request",
  "depends_on": ["verify-health"],
  "config": {
    "url": "{{secrets.SLACK_WEBHOOK}}",
    "method": "POST",
    "body": {
      "blocks": [
        {
          "type": "header",
          "text": {
            "type": "plain_text",
            "text": "Incident Auto-Resolved"
          }
        },
        {
          "type": "section",
          "fields": [
            { "type": "mrkdwn", "text": "*Service:*\n{{stages.receive-alert.output.service}}" },
            { "type": "mrkdwn", "text": "*Issue:*\n{{stages.analyze-incident.output.issue_type}}" },
            { "type": "mrkdwn", "text": "*Root Cause:*\n{{stages.analyze-incident.output.root_cause}}" },
            { "type": "mrkdwn", "text": "*Action Taken:*\nService restarted" },
            { "type": "mrkdwn", "text": "*Resolution Time:*\n{{execution.duration_ms / 1000 | round(1)}}s" }
          ]
        }
      ]
    }
  }
}

Stage 13: Generate Postmortem Draft

{
  "id": "create-postmortem",
  "component": "generator",
  "depends_on": ["notify-team"],
  "config": {
    "model": "gpt-4",
    "temperature": 0.3,
    "system_prompt": "Generate a concise incident postmortem draft with: Summary, Timeline, Root Cause, Resolution, Action Items. Be factual and specific.",
    "prompt": "Create postmortem for:\n\nService: {{stages.receive-alert.output.service}}\nIncident: {{stages.receive-alert.output.summary}}\nRoot Cause: {{stages.analyze-incident.output.root_cause}}\nTriggered: {{stages.receive-alert.output.triggered_at}}\nResolved: {{now()}}\nAction Taken: {{stages.analyze-incident.output.recommended_action}}\nAuto-resolved: Yes\n\nRecent deploys: {{stages.fetch-recent-deploys.output.body | tojson}}"
  }
}

Execution Timeline

gantt
    title Incident Response Timeline (Auto-Resolved)
    dateFormat X
    axisFormat %L

    section Alert
    receive-alert      :0, 100

    section Context (parallel)
    fetch-logs         :100, 5000
    fetch-metrics      :100, 3000
    fetch-deploys      :100, 2000

    section Analysis
    analyze-incident   :5000, 8000

    section Action
    determine-action   :8000, 8100
    auto-restart       :8100, 15000
    verify-health      :15000, 75000

    section Notify
    update-status      :75000, 76000
    notify-team        :75000, 76000
    create-postmortem  :76000, 79000

Total: ~79 seconds vs 34 minutes manual.

Sample Input

{
  "id": "incident-12345",
  "payload": {
    "source": "payment-service",
    "severity": "critical",
    "summary": "payment-service: High memory usage (>90%)",
    "triggered_at": "2024-01-15T03:42:00Z",
    "custom_details": {
      "memory_percent": 94,
      "pod": "payment-service-abc123",
      "namespace": "production"
    }
  },
  "logging_api": "https://logs.internal.company.com",
  "metrics_api": "https://prometheus.internal.company.com",
  "deploy_api": "https://deploy.internal.company.com",
  "k8s_api": "https://kubernetes.internal.company.com"
}

Expected Output

{
  "incident_id": "incident-12345",
  "service": "payment-service",
  "resolution": {
    "type": "auto-remediated",
    "action": "service_restart",
    "duration_seconds": 79,
    "success": true
  },
  "analysis": {
    "root_cause": "Memory leak in payment processing loop - OOM pattern detected",
    "issue_type": "OOM",
    "confidence": "high",
    "supporting_evidence": [
      "Memory grew from 45% to 94% over 2 hours",
      "No recent deploys (last deploy 3 days ago)",
      "Pattern matches previous OOM incidents"
    ]
  },
  "notifications": {
    "status_page": "updated",
    "slack": "sent",
    "pagerduty": "resolved"
  },
  "postmortem_draft": "Generated and saved to Confluence"
}

Key Learnings

1. Runbook-to-Pipeline Pattern

Runbook Step	Pipeline Stage
”Check memory usage”	fetch-metrics
”Look at recent logs”	fetch-logs
”Check recent deploys”	fetch-recent-deploys
”If OOM, restart service”	router → auto-restart
”Verify fix worked”	verify-health (retry)
“Update status page”	update-status

2. Safety Guardrails

Confidence threshold: Only auto-remediate with high confidence
Known issues only: Unknown patterns escalate to humans
Verify before closing: Health check confirms fix worked
Always notify: Team knows what happened, even if auto-fixed

3. AI Analysis Value

The AI doesn’t just say “restart it” - it provides:

Root cause hypothesis
Supporting evidence from logs/metrics
Confidence level for decision making
Context for the human if escalation needed

Try It Yourself

# Test with a simulated alert
fm run pipelines/devops-incident-response.pipeline.json \
  --input inputs/sample-alert.json

Service Health Monitor

Multi-Service API Orchestration