Walkthrough

How Sharely.ai Response works

Follow along as we simulate a payment gateway outage at “ShopFlow” — a mid-size ecommerce platform — and see how the simulator reveals exactly where their incident response breaks down.

1
Model your org
Teams, tools, policies
2
Define an incident
Narrative, actors, chaos
3
Get answers
TTN, delays, fixes

Each layer is composable. Change one variable, re-run, measure impact.

1

Configure the Org Profile

ShopFlow has 4 teams, PagerDuty + Slack for comms, and a policy requiring VP approval before customer notifications. We model all of it as the baseline.

Organization Model
Teams
Payments
PST, mature
Platform
PST, partial
Frontend
PST+IST
SRE
24/7
Comm Tools
PagerDuty
ack: 3min
Slack
noise: 0.3
Statuspage
VP approval
Email
legal review
Policies & Services
SEV Declaration
on-call or IC
Customer Notif
VP must approve
Service Graph
4 services
Signals
3 sources
4
Teams
4
Comm Tools
4
Services
3
Signal Sources
4
Chaos Cards
12 states
Situations
Why this matters: Most incident response tools model runbooks. This models everything else — the human at 2am, the Slack message lost in noise, the VP who takes 8 minutes to approve a statuspage post. That's where the real delays hide.
2

Define the Scenario

A partial payment gateway outage. PayCore's EU endpoint returns 503s, causing 60% of EU checkouts to fail. The US endpoint is fine, making initial diagnosis confusing.

Failure Propagation
PayCore EU
503 errors
0 min
payments-api
60% EU failures
~0 min
checkout
Orders failing
2 min
Customers
EU merchants impacted
soft dep
fraud-service
Degraded
Blended error rate only 35%
Below auto-SEV1 threshold of 50%
US endpoint healthy
Masks EU-specific failure
On-call is in PST
It's 2am for them
PayCore says 'all operational'
Status page lags reality

Who responds — and when

Detection
Threshold alert
Anomaly detector
Support tickets
Triage
Payments On-Call
Platform On-Call
Escalation
Incident Commander
VP Engineering
Notification
Comms Lead
Account Manager
Each actor is played by Claude Haiku with full situational context

Chaos cards enabled

30%
PayCore status page inaccurate
60%
Blended metrics mask severity
15%
On-call phone on silent
40%
EU metrics not on dashboard
3

Run the Simulation

50 iterations. Each one uses Claude Haiku to role-play every actor with their full situational context. Every iteration produces a different outcome — because real incidents are stochastic.

Live Swim Lane — Iteration 23 of 50
streaming
TIME (min) →  0     5     10    15    20    25    30    35    40    45
SIGNALS
error_rate fireslatency_anomalysupport_tickets
Payments OC
paged (asleep)wakes upinvestigatingescalates
Platform OC
pagedsees symptoms
IC
joinsdeclares SEV1starts bridge
VP Eng
Slack msgapproves
Comms Lead
drafts statuspageposted
Account Mgr
calling EU partnersdone
paged working investigating critical done
Each colored block is an AI-generated event. Claude Haiku considers the actor's situation (asleep at 2am), the signals they can see, their team's runbook quality, and the org's policies to decide what happens and how long it takes. No two iterations are the same.
4

Analyze the Results

After 50 iterations, the engine aggregates the data. Here's what ShopFlow's numbers reveal:

TTN Radar (p90)

Minutes until each stakeholder group is notified

Int. EngICLeadershipCustomersPartners17m22m28m35m42m

Key Metrics

Aggregated across 50 iterations

TTN p50 (worst target)28.3 min
TTN p90 (worst target)42.1 min
TTN p95 (worst target)48.7 min
Time to IC on bridge22.4 min
Time to bridge assembled25.1 min
Severity flip-flops1.4 avg
Misroutes0.8 avg
SLA breaches (>30 min)68%

Where are the minutes going?

Every delay on the critical path is tagged with a cause

24%
18%
16%
13%
11%
9%
9%
Policy Approval8.2m · 24%
VP must approve customer comms
Human Decision6.1m · 18%
On-call at 2am, groggy wake-up
External Dependency5.4m · 16%
Waiting for PayCore status page
Tool Latency4.3m · 13%
PagerDuty ack + Slack delivery
Coordination3.8m · 11%
Multi-team bridge overhead
Ownership Ambiguity2.9m · 9%
PayCore gateway misrouted
Chaos Events3.1m · 9%
EU metrics masked severity
68% of iterations breach the 30-minute SLA. The #1 delay cause isn't technology — it's a policy: VP approval for customer comms adds 8.2 minutes on average. You can't find this in a runbook audit.
5

Get Ranked Interventions

The engine maps delay causes to concrete fixes, estimates impact from the simulation data, and ranks by expected TTN reduction.

Projected TTN p90 Reduction (cumulative)
Current baseline
42.1m
#1 Pre-approve statuspage
35.3m
-6.8
#2 EU-specific alerting
31.1m
-4.2
#3 Follow-the-sun on-call
27.6m
-3.5
#4 Auto-create bridge
24.8m
-2.8
#5 Clarify ownership
22.7m
-2.1
Projected after all 5
22.7m
1

Pre-approve statuspage template for payment incidents

Policy-6.8 min p90High

76% of iterations: VP approval added 5-12 min. Pre-approved templates eliminate this.

2

Add EU-specific error rate dashboard + dedicated alert

Monitoring-4.2 min p90High

40% of iterations: blended metrics masked EU severity. EU-only alerting fires 3-5 min earlier.

3

Add secondary on-call to Payments (follow-the-sun)

Staffing-3.5 min p90Medium

62% of iterations: 2am wake-up added 4-8 min. EU-timezone secondary cuts night latency.

One-click testing: Each intervention includes a run override config. Click “Test This” to re-run the same scenario with that change applied and compare results side-by-side.
6

Compare and Iterate

ShopFlow tests intervention #1 (pre-approved statuspage templates). Re-run the simulation. Side-by-side comparison:

Before After
Internal Engineering
17min
16.8min
Incident Commander
22.4min
22.1min
Internal Leadership-5.6 min
28.1min
22.5min
Customers (statuspage)-7.2 min
35min
27.8min
Enterprise Partners-7.6 min
42.1min
34.5min
Customer TTN p90
35.027.8
-7.2min
SLA Breach Rate
6831
-37.0%
Policy Delay
8.21.1
-7.1min
A single policy change — pre-approving statuspage templates — cuts customer TTN p90 by 7 minutes and drops the SLA breach rate from 68% to 31%. ShopFlow takes this to the VP with concrete simulation data, not a gut feeling.

Why this works

Incident response is a complex system. Runbook audits and tabletop exercises miss the emergent behavior. Simulation captures it.

Models the real world

  • Actors have context: asleep, fatigued, in a meeting
  • Tools have mechanics: latency, noise, outages
  • Policies have cost: approval chains add minutes
  • Chaos happens: injected disruptions fire stochastically

Quantifies the invisible

  • TTN is a vector, not a single number
  • Every delay is tagged with a cause category
  • Critical path shows which delays actually matter
  • p50/p90/p95/p99 across N iterations, not one guess

Turns data into action

  • Ranked interventions with expected TTN reduction
  • One-click re-runs to test each fix
  • Side-by-side comparisons prove impact
  • Evidence-based proposals, not opinion

Ready to find your bottlenecks?

Configure your org, define a scenario, and get your first TTN analysis in minutes.