Full Walkthrough

What if you could rehearse your worst incident 50 times?

Follow along as we simulate a payment gateway outage at “ShopFlow” — a mid-size ecommerce platform — and watch in real time as the simulator reveals exactly where their incident response breaks down.

Every team has a process. Few teams know where it actually fails under pressure.

1
Model your org
Teams, tools, policies
2
Define an incident
Narrative, actors, chaos
3
Get answers
TTN, delays, fixes

Each layer is composable. Change one variable, re-run, measure impact.

1

Model Your Organization

The simulator needs to understand how your team actually works — who's on call, what tools they use, which policies slow things down, and how your services depend on each other. Two ways to start:

Fastest

Try the sample org

One click sets up “ShopFlow” — a realistic mid-size ecommerce platform with 4 teams, 6 services, PagerDuty + Slack, and a ready-to-run scenario. Get to your first simulation in under a minute.

4 teams6 services8 rolesPre-built scenario

Model your own org

Configure your actual teams, on-call rotations, comm tools, policies, service graph, and detection signals. The simulator uses your real structure so the results map directly to your process.

Teams & rotationsComm toolsPoliciesService graph

Here's what the ShopFlow sample org looks like inside the simulator:

ShopFlow — Sample Organization Model
Teams
Payments
PST, mature
Platform
PST, partial
Frontend
PST+IST
SRE
24/7
Comm Tools
PagerDuty
ack: 3min
Slack
noise: 0.3
Statuspage
VP approval
Email
legal review
Policies & Services
SEV Declaration
on-call or IC
Customer Notif
VP must approve
Service Graph
4 services
Signals
3 sources
4
Teams
4
Comm Tools
6
Services
3
Signal Sources
10
Chaos Cards
12 states
Situations
Why this matters: Most incident response tools model runbooks. This models everything else — the human at 2am, the Slack message lost in noise, the VP who takes 8 minutes to approve a statuspage post. That's where the real delays hide.
2

Define the Scenario

A partial payment gateway outage. PayCore's EU endpoint returns 503s, causing 60% of EU checkouts to fail. The US endpoint is fine, making initial diagnosis confusing.

AI-Powered

Generate scenarios automatically

Don't know where to start? Click “Generate with AI” and the system analyzes your org's topology — your teams, services, dependencies, and on-call schedules — to create a realistic incident scenario tailored to your architecture.

AI-generated scenarios include actors, chaos cards, and failure propagation paths that match your actual service graph.

Generate Scenario with AI
Load organization config
4 teams, 6 services
Analyze service topology
3 critical paths found
Select scenario parameters
SEV1, payments path
Generate with AI
Claude is writing...
Validate scenario
Done
Failure Propagation
PayCore EU
503 errors
0 min
payments-api
60% EU failures
~0 min
checkout
Orders failing
2 min
Customers
EU merchants impacted
soft dep
fraud-service
Degraded
Blended error rate only 35%
Below auto-SEV1 threshold of 50%
US endpoint healthy
Masks EU-specific failure
On-call is in PST
It's 2am for them
PayCore says 'all operational'
Status page lags reality

Who responds — and when

Detection
Threshold alert
Anomaly detector
Support tickets
Triage
Payments On-Call
Platform On-Call
Escalation
Incident Commander
VP Engineering
Notification
Comms Lead
Account Manager
Each actor is played by Claude with full situational context

Chaos cards enabled

30%
PayCore status page inaccurate
60%
Blended metrics mask severity
15%
On-call phone on silent
40%
EU metrics not on dashboard
3

Watch it Unfold in Real Time

Hit “Run” and watch your incident play out live. Every actor makes decisions based on their context — what they can see, how alert they are, what tools they have, what policies bind them.

50 iterations. Each one different. Because real incidents are stochastic.

ShopFlow — Payment Gateway Outage
streaming
Iteration 23 of 5046%
Current Iteration
23
Current Phase
Escalation
Sim Time
18.4 min
Events
47
Event TimelineHorizontal swim lanes — click any event for details
0m5m10m15m20m25m30m35m40m45m
Signals
error ratelatencytickets
Payments OC
pagedwakes upinvestigatingescalates
Platform OC
pagedsees symptoms
IC
joinsSEV1bridge
VP Eng
Slack msgapproves
Comms Lead
drafts statuspageposted
Account Mgr
calling partnersdone
pagedworkinginvestigatingcriticaldonesignal
Event Log47
14:23:01@2.0m

Signal: error_rate_5xx fires

EU endpoint 503 rate exceeds 30% threshold for 2 consecutive intervals

14:23:02@2.1m

Payments OC: paged

PagerDuty high-urgency page sent. Actor is asleep (2:02am PST)

14:23:08AI

LLM call: Payments OC

Generating wake-up decision...

14:24:01@6.3m

Payments OC: wakes up

Groggy, took 4.2min to acknowledge page and open laptop

14:24:15@9.0mCHAOS

Chaos: Phone on silent

Injector fired (15% prob). Platform OC page delayed by 3 min

14:25:30@10.5m

Payments OC: investigating

Checking Datadog, sees elevated 5xx but blended rate looks OK...

14:26:45@18.4m

IC: declares SEV1

Commander recognizes EU-specific failure pattern after reviewing regional metrics

14:27:00

Phase: Escalation complete

Duration: 5.2 min

Every event is AI-generated. Claude considers the actor's situation — asleep at 2am, groggy, looking at blended metrics — and produces realistic narratives with calibrated timing. The event log captures every decision and its context in real time.
Chaos happens mid-incident. Injectors fire stochastically — a phone on silent, a misleading metric, a key person on PTO. Each iteration faces different chaos, producing a distribution of outcomes instead of a single guess.
Watch it like a movie of your process. The swim lanes show the critical path in real time — who's blocked, who's waiting, where the bottleneck is right now. The event log tells you why. After 50 iterations, you have a statistical model of your incident response, not a single anecdote.
4

Analyze the Results

After 50 iterations, the engine aggregates the data. Here's what ShopFlow's numbers reveal — and more importantly, why:

TTN Radar (p90)

Minutes until each stakeholder group is notified

Int. EngICLeadershipCustomersPartners17m22m28m35m42m

Key Metrics

Aggregated across 50 iterations

TTN p50 (worst target)28.3 min
TTN p90 (worst target)42.1 min
TTN p95 (worst target)48.7 min
Time to IC on bridge22.4 min
Time to bridge assembled25.1 min
Severity flip-flops1.4 avg
Misroutes0.8 avg
SLA breaches (>30 min)68%

Where are the minutes going?

Every delay on the critical path is tagged with a cause

24%
18%
16%
13%
11%
9%
9%
Policy Approval8.2m · 24%
VP must approve customer comms
Human Decision6.1m · 18%
On-call at 2am, groggy wake-up
External Dependency5.4m · 16%
Waiting for PayCore status page
Tool Latency4.3m · 13%
PagerDuty ack + Slack delivery
Coordination3.8m · 11%
Multi-team bridge overhead
Ownership Ambiguity2.9m · 9%
PayCore gateway misrouted
Chaos Events3.1m · 9%
EU metrics masked severity
68% of iterations breach the 30-minute SLA. The #1 delay cause isn't technology — it's a policy: VP approval for customer comms adds 8.2 minutes on average. You can't find this in a runbook audit.
5

Get Ranked Interventions

The engine maps delay causes to concrete fixes, estimates impact from the simulation data, and ranks by expected TTN reduction. No opinions — just numbers.

Projected TTN p90 Reduction (cumulative)
Current baseline
42.1m
#1 Pre-approve statuspage
35.3m
-6.8
#2 EU-specific alerting
31.1m
-4.2
#3 Follow-the-sun on-call
27.6m
-3.5
#4 Auto-create bridge
24.8m
-2.8
#5 Clarify ownership
22.7m
-2.1
Projected after all 5
22.7m
1

Pre-approve statuspage template for payment incidents

Policy-6.8 min p90High

76% of iterations: VP approval added 5-12 min. Pre-approved templates eliminate this.

2

Add EU-specific error rate dashboard + dedicated alert

Monitoring-4.2 min p90High

40% of iterations: blended metrics masked EU severity. EU-only alerting fires 3-5 min earlier.

3

Add secondary on-call to Payments (follow-the-sun)

Staffing-3.5 min p90Medium

62% of iterations: 2am wake-up added 4-8 min. EU-timezone secondary cuts night latency.

One-click testing: Each intervention includes a run override config. Click “Test This” to re-run the same scenario with that change applied and compare results side-by-side.
6

Compare and Iterate

ShopFlow tests intervention #1 (pre-approved statuspage templates). Re-run the simulation. Side-by-side comparison:

Before After
Internal Engineering
17min
16.8min
Incident Commander
22.4min
22.1min
Internal Leadership-5.6 min
28.1min
22.5min
Customers (statuspage)-7.2 min
35min
27.8min
Enterprise Partners-7.6 min
42.1min
34.5min
Customer TTN p90
35.027.8
-7.2min
SLA Breach Rate
6831
-37.0%
Policy Delay
8.21.1
-7.1min
A single policy change — pre-approving statuspage templates — cuts customer TTN p90 by 7 minutes and drops the SLA breach rate from 68% to 31%. ShopFlow takes this to the VP with concrete simulation data, not a gut feeling.
Role-Based Dashboards

Built for every role in the room

Different people need different views during and after an incident. Six tailored dashboards surface the right information at the right time.

On-Call Engineer

Page-to-ack time
Investigation duration
Escalation rate
Your on-call acks in 4.2 min avg — but night pages take 3x longer. Consider follow-the-sun.

Incident Commander

Time to bridge
SEV accuracy
Coordination overhead
Bridge assembly takes 25 min avg. Auto-creating a Slack bridge at SEV1 saves 4.8 min.

Comms Lead

Time to first update
Approval bottlenecks
Draft iterations
68% of comms delay is VP approval wait. Pre-approved templates cut this to under 2 min.

Account Manager

Time to customer notify
Partner call completion
Coverage gaps
EU partners learn about outages 42 min in. That’s after they’ve already noticed.

VP / Leadership

SLA breach rate
TTN distribution
Investment priorities
One policy change (pre-approved templates) drops SLA breaches from 68% to 31%.

Simulation Owner

Iteration convergence
Plausibility scores
LLM usage & cost
50 iterations show stable p90 by iteration 35. Plausibility scores average 87%.
Show your team what they need to see. The on-call engineer sees page response patterns. The VP sees breach rates and ROI. The comms lead sees where notification bottlenecks live. Same simulation data, six different stories.

Why this works

Runbook audits check if a document exists. Tabletop exercises capture one opinion. Simulation reveals what actually happens when the Slack message lands in a noisy channel at 2am.

Models the real world

  • Actors have context: asleep, fatigued, in a meeting
  • Tools have mechanics: latency, noise, outages
  • Policies have cost: approval chains add minutes
  • Chaos happens: injected disruptions fire stochastically

Quantifies the invisible

  • TTN is a vector, not a single number
  • Every delay is tagged with a cause category
  • Critical path shows which delays actually matter
  • p50/p90/p95 across N iterations, not one guess

Turns data into action

  • Ranked interventions with expected TTN reduction
  • One-click re-runs to test each fix
  • Side-by-side comparisons prove impact
  • Role-based dashboards tell each stakeholder their story

Stop guessing where your process breaks.Start measuring it.

Configure your org, define a scenario, and get your first TTN analysis in minutes. No agents to install, no infrastructure to manage.