Incident Response
Entry point for any FlowMason incident. Pick the matching playbook below; each follows discovery → contain → diagnose → recover → post-mortem.
Severity matrix
| Sev | Definition | Response time |
|---|---|---|
| SEV-1 | All Org Chat / pipelines down OR data exfiltration suspected | < 15 min |
| SEV-2 | Major surface degraded (one provider down, sustained errors > 10%) | < 1 h |
| SEV-3 | Specific workflow broken or perf regression | < 1 day |
| SEV-4 | Cosmetic, telemetry-only, single-user issue | next business day |
Universal first 5 minutes
- Confirm scope. Telemetry dashboard → error rate trend.
- Snapshot state. Capture for post-mortem:
sf data query --target-org <org> \
--query "SELECT Id, Status__c, COUNT(Id) c FROM PipelineExecution__c \
WHERE LastModifiedDate = LAST_N_HOURS:1 GROUP BY Status__c"
sf data query --target-org <org> \
--query "SELECT Action__c, COUNT(Id) c FROM FM_Run_Audit__c \
WHERE CreatedDate = LAST_N_HOURS:1 GROUP BY Action__c" - Check vendor status if LLM-related. Anthropic / OpenAI / Bedrock / Azure status pages.
- Decide kill-switch posture. See § Kill switches.
Playbook 1. Provider outage
Symptoms. All LLM stages return error; specific provider 5xx / timeout in stage logs; __meta.providerAttempts climbing.
Diagnose
List<Pipeline_Stage_Log__c> recent = [
SELECT Stage_Type__c, Provider_Response__c, Error_Message__c
FROM Pipeline_Stage_Log__c
WHERE Status__c = 'Failed'
AND Started_At__c = LAST_N_HOURS:1
AND Stage_Type__c LIKE 'llm_%'
LIMIT 50
]; Contain
- Set provider fallback chain in
FM_Config.defaultProvideror per-pipelineproviderFallback. - Disable the dead provider's NC so the router skips immediately.
Recover
for (PipelineExecution__c r : [
SELECT ExecutionId__c FROM PipelineExecution__c
WHERE Status__c = 'Failed' AND LastModifiedDate = LAST_N_HOURS:1
]) {
PipelineQueueable.resume(r.ExecutionId__c);
} Playbook 2. Rate-limit storm
Symptoms. Vendor returns 429 repeatedly; per-user orgChatMaxTurnsPerMinutePerUser rejections climbing.
List<FM_Run_Audit__c> rl = [
SELECT Actor_Id__c, Error_Code__c, COUNT(Id) c
FROM FM_Run_Audit__c
WHERE Error_Code__c LIKE '%RATE_LIMIT%'
AND CreatedDate = LAST_N_HOURS:1
GROUP BY Actor_Id__c, Error_Code__c
]; - One specific user. Investigate; possible compromised account.
- Org-wide. Vendor caps too low. Increase quota or switch provider.
- Specific pipeline looping. Set
pipelineMaxIterationslower to fail fast.
Playbook 3. Cache wedge / stale state
Symptoms. New MDT values not taking effect; tool-calling stuck returning old results; local.FMLLMCache unresponsive.
Cache.OrgPartition p = Cache.Org.getPartition('local.FMLLMCache');
System.debug(p.getKeys());
// Force config refresh
FMConfig.refresh();
// Heavy hammer: clear all FMLLMCache keys (loses tool-calling thread state)
for (String k : p.getKeys()) {
p.remove(k);
} Threads rebootstrap on next turn. Users see continuity (LWC preserves transcript).
Playbook 4. Suspected prompt injection / data exfil
Symptoms. Unusual SOQL patterns in FM_Run_Audit__c; user report of unexpected data; spike in org_chat_turn_error events.
Discovery
List<FM_Run_Audit__c> suspect = [
SELECT Actor_Id__c, Detail__c, CreatedDate
FROM FM_Run_Audit__c
WHERE Action__c LIKE 'org_chat%'
AND CreatedDate = LAST_N_HOURS:6
AND Detail__c LIKE '%LIMIT 9999%'
ORDER BY CreatedDate DESC
LIMIT 200
]; Indicators: LIMIT near platform max; SOQL referencing previously-unused SObjects from one user; repeated validator-rejection spikes from one user; FM_Org_Chat_Dml_Audit__c rows mass-targeting one SObject.
Contain (immediate)
List<PermissionSetAssignment> psa = [
SELECT Id FROM PermissionSetAssignment
WHERE AssigneeId = :suspectUserId
AND PermissionSet.Name IN ('FlowMason_Org_Chat_User', 'FlowMason_Org_Chat_Dml_User')
];
delete psa; Org-wide kill if scope unclear: FM_Config.orgChatSurfacesEnabled = none.
Diagnose
Confirm validator wasn't bypassed (it shouldn't have been; FmOrgChatController.runTurn blocks DML keywords + multi-statement via FMSoqlValidator):
List<FM_Org_Chat_Dml_Audit__c> dml = [
SELECT Operation__c, Sobject_Type__c, Record_Count__c, Diff_Json__c, Filter_Used__c
FROM FM_Org_Chat_Dml_Audit__c
WHERE Actor_Id__c = :suspectUserId
AND CreatedDate = LAST_N_DAYS:7
]; Recover
- Validator rejected → user blocked, no data egress. Retain logs; reinstate after review.
- DML fired but legitimate → user-confirmed via two-step modal; audit row is the proof. Decide based on business context.
- DML fired AND not legitimate → security incident. Rotate Named Credential keys, run forensic SOQL on
FM_Org_Chat_Dml_Audit__cto bound damage, restore from backup if needed.
Playbook 5. Inventory harvester failure
See Inventory Harvester runbook § F1-F5.
Playbook 6. Circuit-breaker buffer fills
Symptoms. FM_Circuit_Queue__c row count climbing; FM_Circuit_Dead_Letter__e events firing.
Quick contain: increase CIRCUIT_DRAIN_BATCH_SIZE + CIRCUIT_DRAIN_MAX_ATTEMPTS temporarily; investigate the failing downstream.
Playbook 7. Telemetry pipeline broken
Symptoms. FM_Run_Audit__c row count flat despite chat / pipeline activity.
Setup → Apex Triggers → FlowMasonRunSubscriber— Active = true?- Check trigger debug log for handler exceptions.
- Test publish:
EventBus.publish(new FlowMasonRun__e(
Action__c = 'incident_test',
Actor_Id__c = UserInfo.getUserId(),
Pipeline_Id__c = 'test',
Detail__c = ''
)); Kill switches (priority order)
| Switch | Effect |
|---|---|
orgChatSurfacesEnabled = none | Org Chat completely off |
orgChatDmlEnabled = false | DML refused; reads still work |
orgChatToolCallingEnabled = false | Single-shot path; less power, less risk |
| Deactivate Named Credential | Specific provider out |
Revoke FlowMason_Org_Chat_User permset | Per-user lockout |
Abort cron FM Org Inventory Nightly | Stops nightly Tooling-API egress |
All take effect on next FMConfig.refresh() (runs automatically per request).
Post-mortem template
After any SEV-1 / SEV-2:
- Timeline. Minute-by-minute, from first signal to all-clear.
- Root cause. One sentence.
- Detection. How was it caught? How fast? Could it be faster?
- Mitigation. What stopped the bleeding?
- Recovery. When did normal service resume?
- Customer impact. Users affected, requests failed, data compromised (if any).
- Action items. Fix, test, doc, runbook, alert. Each owned + dated.
- Lessons learned. What's the one thing we'd change?