What if you could rehearse your
worst incident 50 times?
Follow along as we simulate a payment gateway outage at “ShopFlow” — a mid-size ecommerce platform — and watch in real time as the simulator reveals exactly where their incident response breaks down.
Every team has a process. Few teams know where it actually fails under pressure.
Each layer is composable. Change one variable, re-run, measure impact.
Model Your Organization
The simulator needs to understand how your team actually works — who's on call, what tools they use, which policies slow things down, and how your services depend on each other. Two ways to start:
Try the sample org
One click sets up “ShopFlow” — a realistic mid-size ecommerce platform with 4 teams, 6 services, PagerDuty + Slack, and a ready-to-run scenario. Get to your first simulation in under a minute.
Model your own org
Configure your actual teams, on-call rotations, comm tools, policies, service graph, and detection signals. The simulator uses your real structure so the results map directly to your process.
Here's what the ShopFlow sample org looks like inside the simulator:
Define the Scenario
A partial payment gateway outage. PayCore's EU endpoint returns 503s, causing 60% of EU checkouts to fail. The US endpoint is fine, making initial diagnosis confusing.
Generate scenarios automatically
Don't know where to start? Click “Generate with AI” and the system analyzes your org's topology — your teams, services, dependencies, and on-call schedules — to create a realistic incident scenario tailored to your architecture.
AI-generated scenarios include actors, chaos cards, and failure propagation paths that match your actual service graph.
Who responds — and when
Chaos cards enabled
Watch it Unfold in Real Time
Hit “Run” and watch your incident play out live. Every actor makes decisions based on their context — what they can see, how alert they are, what tools they have, what policies bind them.
50 iterations. Each one different. Because real incidents are stochastic.
Signal: error_rate_5xx fires
EU endpoint 503 rate exceeds 30% threshold for 2 consecutive intervals
Payments OC: paged
PagerDuty high-urgency page sent. Actor is asleep (2:02am PST)
LLM call: Payments OC
Generating wake-up decision...
Payments OC: wakes up
Groggy, took 4.2min to acknowledge page and open laptop
Chaos: Phone on silent
Injector fired (15% prob). Platform OC page delayed by 3 min
Payments OC: investigating
Checking Datadog, sees elevated 5xx but blended rate looks OK...
IC: declares SEV1
Commander recognizes EU-specific failure pattern after reviewing regional metrics
Phase: Escalation complete
Duration: 5.2 min
Analyze the Results
After 50 iterations, the engine aggregates the data. Here's what ShopFlow's numbers reveal — and more importantly, why:
TTN Radar (p90)
Minutes until each stakeholder group is notified
Key Metrics
Aggregated across 50 iterations
Where are the minutes going?
Every delay on the critical path is tagged with a cause
Get Ranked Interventions
The engine maps delay causes to concrete fixes, estimates impact from the simulation data, and ranks by expected TTN reduction. No opinions — just numbers.
Pre-approve statuspage template for payment incidents
76% of iterations: VP approval added 5-12 min. Pre-approved templates eliminate this.
Add EU-specific error rate dashboard + dedicated alert
40% of iterations: blended metrics masked EU severity. EU-only alerting fires 3-5 min earlier.
Add secondary on-call to Payments (follow-the-sun)
62% of iterations: 2am wake-up added 4-8 min. EU-timezone secondary cuts night latency.
Compare and Iterate
ShopFlow tests intervention #1 (pre-approved statuspage templates). Re-run the simulation. Side-by-side comparison:
Built for every role in the room
Different people need different views during and after an incident. Six tailored dashboards surface the right information at the right time.
On-Call Engineer
Incident Commander
Comms Lead
Account Manager
VP / Leadership
Simulation Owner
Why this works
Runbook audits check if a document exists. Tabletop exercises capture one opinion. Simulation reveals what actually happens when the Slack message lands in a noisy channel at 2am.
Models the real world
- Actors have context: asleep, fatigued, in a meeting
- Tools have mechanics: latency, noise, outages
- Policies have cost: approval chains add minutes
- Chaos happens: injected disruptions fire stochastically
Quantifies the invisible
- TTN is a vector, not a single number
- Every delay is tagged with a cause category
- Critical path shows which delays actually matter
- p50/p90/p95 across N iterations, not one guess
Turns data into action
- Ranked interventions with expected TTN reduction
- One-click re-runs to test each fix
- Side-by-side comparisons prove impact
- Role-based dashboards tell each stakeholder their story
Stop guessing where your process breaks.
Start measuring it.
Configure your org, define a scenario, and get your first TTN analysis in minutes. No agents to install, no infrastructure to manage.