Operational runbook

Incident Response

Entry point for any FlowMason incident. Pick the matching playbook below; each follows discovery → contain → diagnose → recover → post-mortem.

Severity matrix

Sev	Definition	Response time
SEV-1	All Org Chat / pipelines down OR data exfiltration suspected	< 15 min
SEV-2	Major surface degraded (one provider down, sustained errors > 10%)	< 1 h
SEV-3	Specific workflow broken or perf regression	< 1 day
SEV-4	Cosmetic, telemetry-only, single-user issue	next business day

Universal first 5 minutes

Confirm scope. Telemetry dashboard → error rate trend.
Snapshot state. Capture for post-mortem:

sf data query --target-org <org> \
  --query "SELECT Id, Status__c, COUNT(Id) c FROM PipelineExecution__c \
           WHERE LastModifiedDate = LAST_N_HOURS:1 GROUP BY Status__c"
sf data query --target-org <org> \
  --query "SELECT Action__c, COUNT(Id) c FROM FM_Run_Audit__c \
           WHERE CreatedDate = LAST_N_HOURS:1 GROUP BY Action__c"

Check vendor status if LLM-related. Anthropic / OpenAI / Bedrock / Azure status pages.
Decide kill-switch posture. See § Kill switches.

Playbook 1. Provider outage

Symptoms. All LLM stages return error; specific provider 5xx / timeout in stage logs; __meta.providerAttempts climbing.

Diagnose

List<Pipeline_Stage_Log__c> recent = [
  SELECT Stage_Type__c, Provider_Response__c, Error_Message__c
  FROM Pipeline_Stage_Log__c
  WHERE Status__c = 'Failed'
    AND Started_At__c = LAST_N_HOURS:1
    AND Stage_Type__c LIKE 'llm_%'
  LIMIT 50
];

Contain

Set provider fallback chain in FM_Config.defaultProvider or per-pipeline providerFallback.
Disable the dead provider's NC so the router skips immediately.

Recover

for (PipelineExecution__c r : [
  SELECT ExecutionId__c FROM PipelineExecution__c
  WHERE Status__c = 'Failed' AND LastModifiedDate = LAST_N_HOURS:1
]) {
  PipelineQueueable.resume(r.ExecutionId__c);
}

Playbook 2. Rate-limit storm

Symptoms. Vendor returns 429 repeatedly; per-user orgChatMaxTurnsPerMinutePerUser rejections climbing.

List<FM_Run_Audit__c> rl = [
  SELECT Actor_Id__c, Error_Code__c, COUNT(Id) c
  FROM FM_Run_Audit__c
  WHERE Error_Code__c LIKE '%RATE_LIMIT%'
    AND CreatedDate = LAST_N_HOURS:1
  GROUP BY Actor_Id__c, Error_Code__c
];

One specific user. Investigate; possible compromised account.
Org-wide. Vendor caps too low. Increase quota or switch provider.
Specific pipeline looping. Set pipelineMaxIterations lower to fail fast.

Playbook 3. Cache wedge / stale state

Symptoms. New MDT values not taking effect; tool-calling stuck returning old results; local.FMLLMCache unresponsive.

Cache.OrgPartition p = Cache.Org.getPartition('local.FMLLMCache');
System.debug(p.getKeys());

// Force config refresh
FMConfig.refresh();

// Heavy hammer: clear all FMLLMCache keys (loses tool-calling thread state)
for (String k : p.getKeys()) {
  p.remove(k);
}

Threads rebootstrap on next turn. Users see continuity (LWC preserves transcript).

Playbook 4. Suspected prompt injection / data exfil

Symptoms. Unusual SOQL patterns in FM_Run_Audit__c; user report of unexpected data; spike in org_chat_turn_error events.

Discovery

List<FM_Run_Audit__c> suspect = [
  SELECT Actor_Id__c, Detail__c, CreatedDate
  FROM FM_Run_Audit__c
  WHERE Action__c LIKE 'org_chat%'
    AND CreatedDate = LAST_N_HOURS:6
    AND Detail__c LIKE '%LIMIT 9999%'
  ORDER BY CreatedDate DESC
  LIMIT 200
];

Indicators: LIMIT near platform max; SOQL referencing previously-unused SObjects from one user; repeated validator-rejection spikes from one user; FM_Org_Chat_Dml_Audit__c rows mass-targeting one SObject.

Contain (immediate)

List<PermissionSetAssignment> psa = [
  SELECT Id FROM PermissionSetAssignment
  WHERE AssigneeId = :suspectUserId
    AND PermissionSet.Name IN ('FlowMason_Org_Chat_User', 'FlowMason_Org_Chat_Dml_User')
];
delete psa;

Org-wide kill if scope unclear: FM_Config.orgChatSurfacesEnabled = none.

Diagnose

Confirm validator wasn't bypassed (it shouldn't have been; FmOrgChatController.runTurn blocks DML keywords + multi-statement via FMSoqlValidator):

List<FM_Org_Chat_Dml_Audit__c> dml = [
  SELECT Operation__c, Sobject_Type__c, Record_Count__c, Diff_Json__c, Filter_Used__c
  FROM FM_Org_Chat_Dml_Audit__c
  WHERE Actor_Id__c = :suspectUserId
    AND CreatedDate = LAST_N_DAYS:7
];

Recover

Validator rejected → user blocked, no data egress. Retain logs; reinstate after review.
DML fired but legitimate → user-confirmed via two-step modal; audit row is the proof. Decide based on business context.
DML fired AND not legitimate → security incident. Rotate Named Credential keys, run forensic SOQL on FM_Org_Chat_Dml_Audit__c to bound damage, restore from backup if needed.

Playbook 5. Inventory harvester failure

See Inventory Harvester runbook § F1-F5.

Playbook 6. Circuit-breaker buffer fills

Symptoms. FM_Circuit_Queue__c row count climbing; FM_Circuit_Dead_Letter__e events firing.

Quick contain: increase CIRCUIT_DRAIN_BATCH_SIZE + CIRCUIT_DRAIN_MAX_ATTEMPTS temporarily; investigate the failing downstream.

Playbook 7. Telemetry pipeline broken

Symptoms. FM_Run_Audit__c row count flat despite chat / pipeline activity.

Setup → Apex Triggers → FlowMasonRunSubscriber — Active = true?
Check trigger debug log for handler exceptions.
Test publish:

EventBus.publish(new FlowMasonRun__e(
  Action__c = 'incident_test',
  Actor_Id__c = UserInfo.getUserId(),
  Pipeline_Id__c = 'test',
  Detail__c = ''
));

Kill switches (priority order)

Switch	Effect
`orgChatSurfacesEnabled = none`	Org Chat completely off
`orgChatDmlEnabled = false`	DML refused; reads still work
`orgChatToolCallingEnabled = false`	Single-shot path; less power, less risk
Deactivate Named Credential	Specific provider out
Revoke `FlowMason_Org_Chat_User` permset	Per-user lockout
Abort cron `FM Org Inventory Nightly`	Stops nightly Tooling-API egress

All take effect on next FMConfig.refresh() (runs automatically per request).

Post-mortem template

After any SEV-1 / SEV-2:

Timeline. Minute-by-minute, from first signal to all-clear.
Root cause. One sentence.
Detection. How was it caught? How fast? Could it be faster?
Mitigation. What stopped the bleeding?
Recovery. When did normal service resume?
Customer impact. Users affected, requests failed, data compromised (if any).
Action items. Fix, test, doc, runbook, alert. Each owned + dated.
Lessons learned. What's the one thing we'd change?

Incident Response

Severity matrix

Universal first 5 minutes

Playbook 1. Provider outage

Diagnose

Contain

Recover

Playbook 2. Rate-limit storm

Playbook 3. Cache wedge / stale state

Playbook 4. Suspected prompt injection / data exfil

Discovery

Contain (immediate)

Diagnose

Recover

Playbook 5. Inventory harvester failure

Playbook 6. Circuit-breaker buffer fills

Playbook 7. Telemetry pipeline broken

Kill switches (priority order)

Post-mortem template

Related