Walkthrough

How Sharely.ai Response works

Follow along as we simulate a payment gateway outage at “ShopFlow” — a mid-size ecommerce platform — and see how the simulator reveals exactly where their incident response breaks down.

Model your org

Teams, tools, policies

Define an incident

Narrative, actors, chaos

Get answers

TTN, delays, fixes

Each layer is composable. Change one variable, re-run, measure impact.

Configure the Org Profile

ShopFlow has 4 teams, PagerDuty + Slack for comms, and a policy requiring VP approval before customer notifications. We model all of it as the baseline.

Organization Model

Teams

Payments

PST, mature

Platform

PST, partial

Frontend

PST+IST

SRE

24/7

Comm Tools

PagerDuty

ack: 3min

Slack

noise: 0.3

Statuspage

VP approval

legal review

Policies & Services

SEV Declaration

on-call or IC

Customer Notif

VP must approve

Service Graph

4 services

Signals

3 sources

Teams

Comm Tools

Services

Signal Sources

Chaos Cards

12 states

Situations

Why this matters: Most incident response tools model runbooks. This models everything else — the human at 2am, the Slack message lost in noise, the VP who takes 8 minutes to approve a statuspage post. That's where the real delays hide.

Define the Scenario

A partial payment gateway outage. PayCore's EU endpoint returns 503s, causing 60% of EU checkouts to fail. The US endpoint is fine, making initial diagnosis confusing.

Failure Propagation

PayCore EU

503 errors

0 min

payments-api

60% EU failures

~0 min

checkout

Orders failing

2 min

Customers

EU merchants impacted

soft dep

fraud-service

Degraded

Blended error rate only 35%

Below auto-SEV1 threshold of 50%

US endpoint healthy

Masks EU-specific failure

On-call is in PST

It's 2am for them

PayCore says 'all operational'

Status page lags reality

Who responds — and when

Detection

Threshold alert

Anomaly detector

Support tickets

Triage

Payments On-Call

Platform On-Call

Escalation

Incident Commander

VP Engineering

Notification

Comms Lead

Account Manager

Each actor is played by Claude Haiku with full situational context

Chaos cards enabled

30%

PayCore status page inaccurate

60%

Blended metrics mask severity

15%

On-call phone on silent

40%

EU metrics not on dashboard

Run the Simulation

50 iterations. Each one uses Claude Haiku to role-play every actor with their full situational context. Every iteration produces a different outcome — because real incidents are stochastic.

Live Swim Lane — Iteration 23 of 50

streaming

TIME (min) → 0 5 10 15 20 25 30 35 40 45

SIGNALS

error_rate fireslatency_anomalysupport_tickets

Payments OC

paged (asleep)wakes upinvestigatingescalates

Platform OC

pagedsees symptoms

joinsdeclares SEV1starts bridge

VP Eng

Slack msgapproves

Comms Lead

drafts statuspageposted

Account Mgr

calling EU partnersdone

paged working investigating critical done

Each colored block is an AI-generated event. Claude Haiku considers the actor's situation (asleep at 2am), the signals they can see, their team's runbook quality, and the org's policies to decide what happens and how long it takes. No two iterations are the same.

Analyze the Results

After 50 iterations, the engine aggregates the data. Here's what ShopFlow's numbers reveal:

TTN Radar (p90)

Minutes until each stakeholder group is notified

Key Metrics

Aggregated across 50 iterations

TTN p50 (worst target)28.3 min

TTN p90 (worst target)42.1 min

TTN p95 (worst target)48.7 min

Time to IC on bridge22.4 min

Time to bridge assembled25.1 min

Severity flip-flops1.4 avg

Misroutes0.8 avg

SLA breaches (>30 min)68%

Where are the minutes going?

Every delay on the critical path is tagged with a cause

24%

18%

16%

13%

11%

Policy Approval8.2m · 24%

VP must approve customer comms

Human Decision6.1m · 18%

On-call at 2am, groggy wake-up

External Dependency5.4m · 16%

Waiting for PayCore status page

Tool Latency4.3m · 13%

PagerDuty ack + Slack delivery

Coordination3.8m · 11%

Multi-team bridge overhead

Ownership Ambiguity2.9m · 9%

PayCore gateway misrouted

Chaos Events3.1m · 9%

EU metrics masked severity

68% of iterations breach the 30-minute SLA. The #1 delay cause isn't technology — it's a policy: VP approval for customer comms adds 8.2 minutes on average. You can't find this in a runbook audit.

Get Ranked Interventions

The engine maps delay causes to concrete fixes, estimates impact from the simulation data, and ranks by expected TTN reduction.

Projected TTN p90 Reduction (cumulative)

Current baseline

42.1m

#1 Pre-approve statuspage

35.3m

-6.8

#2 EU-specific alerting

31.1m

-4.2

#3 Follow-the-sun on-call

27.6m

-3.5

#4 Auto-create bridge

24.8m

-2.8

#5 Clarify ownership

22.7m

-2.1

Projected after all 5

22.7m

Pre-approve statuspage template for payment incidents

Policy-6.8 min p90High

76% of iterations: VP approval added 5-12 min. Pre-approved templates eliminate this.

Add EU-specific error rate dashboard + dedicated alert

Monitoring-4.2 min p90High

40% of iterations: blended metrics masked EU severity. EU-only alerting fires 3-5 min earlier.

Add secondary on-call to Payments (follow-the-sun)

Staffing-3.5 min p90Medium

62% of iterations: 2am wake-up added 4-8 min. EU-timezone secondary cuts night latency.

One-click testing: Each intervention includes a run override config. Click “Test This” to re-run the same scenario with that change applied and compare results side-by-side.

Compare and Iterate

ShopFlow tests intervention #1 (pre-approved statuspage templates). Re-run the simulation. Side-by-side comparison:

Before After

Internal Engineering

17min

16.8min

Incident Commander

22.4min

22.1min

Internal Leadership-5.6 min

28.1min

22.5min

Customers (statuspage)-7.2 min

35min

27.8min

Enterprise Partners-7.6 min

42.1min

34.5min

Customer TTN p90

35.027.8

-7.2min

SLA Breach Rate

6831

-37.0%

Policy Delay

8.21.1

-7.1min

A single policy change — pre-approving statuspage templates — cuts customer TTN p90 by 7 minutes and drops the SLA breach rate from 68% to 31%. ShopFlow takes this to the VP with concrete simulation data, not a gut feeling.

Why this works

Incident response is a complex system. Runbook audits and tabletop exercises miss the emergent behavior. Simulation captures it.

Models the real world

Actors have context: asleep, fatigued, in a meeting
Tools have mechanics: latency, noise, outages
Policies have cost: approval chains add minutes
Chaos happens: injected disruptions fire stochastically

Quantifies the invisible

TTN is a vector, not a single number
Every delay is tagged with a cause category
Critical path shows which delays actually matter
p50/p90/p95/p99 across N iterations, not one guess

Turns data into action

Ranked interventions with expected TTN reduction
One-click re-runs to test each fix
Side-by-side comparisons prove impact
Evidence-based proposals, not opinion

Ready to find your bottlenecks?

Configure your org, define a scenario, and get your first TTN analysis in minutes.

Get started Read the docs