What is Sharely.ai Response?

An incident response simulator that models the full complexity of how your organization responds to incidents. AI actors (Claude Haiku) role-play human responders, communication tool mechanics are simulated, your org's policies are applied, and Time to Notify (TTN) is measured across every stakeholder group.

Org Profile
baseline config
+
Scenario
incident + actors
+
Run Overrides
experiment knobs
=
Results
TTN + interventions
Composable
swap any layer
Explainable
every delay tagged
Actionable
ranked fixes
TTN is not a single number. It's a vector: internal engineering might learn in 8 minutes, but enterprise partners might not hear until minute 45. This tool measures the full vector.

Core Concepts

TTN (Time to Notify)

Elapsed time from incident start to when each stakeholder group is notified. Measured as a vector across targets.

Org Profile

Your org's baseline: teams, roles, comm tools, policies, service dependencies, signal sources, resource constraints.

Scenario

A specific incident: narrative, severity, affected services, actors, active signals, and chaos cards.

Run

N iterations of a scenario against an org profile. Each iteration produces a different outcome due to stochastic elements.

Actor

A participant in the response. Has a role, team, comm tool, situation state, and phase. Claude Haiku role-plays each one.

Signal Source

A detection mechanism (threshold alert, anomaly detection, SLO burn rate, etc.) that fires at incident start.

Event Injector

A "chaos card" that fires during simulation with a configured probability, adding delays or forcing escalations.

Iteration

A single pass through detection, triage, escalation, and notification phases. Produces a TTN vector and delay attribution.

Quick Start Guide

Org Profile

Your organization's baseline configuration. Everything here applies to all scenarios unless overridden. Think of it as “how our org works on any given day.”

Access from the sidebar under Org Profile. The overview page shows all sections and current resource constraints.

Teams & Roles

Teams

Each team has a name, timezone(s), on-call rotation style, and a maturity profile that affects simulation behavior:

Runbook coverage
nonepartialcomprehensive
Runbook quality
outdateddecentexcellent
Training level
untrainedbasicdrilled

Roles

Job functions with phase assignments and concurrency limits:

On-Call Engineertriage
Incident Commanderescalation
Comms Leadnotification
Engineering Leaderescalation
Account Managernotification

Communication Tools

Comm tools model real delivery mechanics, not just “notification sent.”

Delivery Pipeline (per message)
Send
Message dispatched
Delivery Latency
normal/lognormal dist.
Noise Filter
Missed? (0-1 prob)
Ack Wait
Timeout → escalate
Received
Actor sees it
PagerDuty
  • Ack timeout + escalation policy
  • Multi-step escalation chains
Slack
  • Channel noise probability
  • Rate limiting at high msg volumes
Statuspage
  • Draft time + approval chain
  • Template availability
Email
  • Legal review for SEV1
  • Outage correlation tags

Policies

Org policies govern who can do what during an incident. Each policy is a configurable knob that the simulator enforces:

Severity Declaration

Who can declare, strictness level, auto-declare conditions

Notification Thresholds

Which SEV levels trigger leadership, customer, partner notifications

Approvals

Who approves customer comms, statuspage, partner notifications, legal

Paging Rules

IC can page any team? Manager approval? Restrictions?

Policy delays are one of the most common TTN contributors. The simulator tracks “policy_approval” as a delay cause so you can see exactly how many minutes your approval chains add.

Services & Dependencies

The service dependency graph models how failures propagate through your architecture.

Example Dependency Graph
checkout
hard, 0min
soft, 1min
payments-api
fraud-service
hard, 2min
gateway
Ownership ambiguity: 0 (clear) → 1 (unknown)Misrouting cost: sampled from distribution

Signal Sources

Signal sources define how incidents are detected. The simulator races all active signals — the first confident signal determines detection time.

Signal Race (first confident signal wins)
threshold
1-3 min
WINNER
90%
anomaly
3-7 min
60%
burn_rate
5-10 min
75%
synthetic
2-5 min
85%
support_ticket
15-45 min
30%
thresholdError rate > X%
anomalyStatistical detection
burn_rateSLO error budget
syntheticCanary monitoring
correlationMulti-service
support_ticketCustomer reports
socialSocial / partner
dashboardHuman observation

Situations

Situations model the real-world context of each actor as a state machine, not a static sample.

State Machine Example
asleep
2-8 min
groggy
5 min
alert
cloud_outage
triggers
vpn_down
Availability
asleepin_meetingtravelingon_break
Cognitive
focusedcontext_switchingfatiguedgroggy
Tooling
vpn_downlaptop_not_availablephone_dead
Coordination
on_another_incidenthandoff_in_progress

Weighted by time of day (asleep more likely at 2am) and triggered by events (cloud_outage triggers vpn_down).

Event Injectors

“Chaos cards” — controlled disruptions that fire during simulation based on probability and trigger conditions.

5%
Monitoring dashboard outage
+8 min to investigation
12%
Slack degraded (cloud_outage)
+5 min all Slack comms
6%
Customer exec escalation
IC distracted
7%
Shift handoff mid-incident
+10 min context transfer

Each injector has probability (0-1), trigger conditions (phase, severity, tags), and an effect (delay, escalation, severity change, tool outage).

Calibration

Two systems ensure realistic outputs:

Plausibility Bounds
  • "asleep" must be ≥ 2 min response
  • "at_desk" must be ≤ 15 min response
  • LLM responses clamped to bounds
Monotonicity Rules
  • VP cannot be notified before IC paged
  • Statuspage cannot post before SEV declared
  • Violations reduce plausibility score

Creating Scenarios

A scenario defines a specific incident to simulate:

Name & descriptionBrief summary
NarrativeDetailed context for Claude Haiku actors
SeveritySEV1 – SEV4
TagsInjector correlation + categorization
ActorsParticipants with role, team, phase
SignalsActive signal sources for this incident

Actors

Each actor is assigned to a phase which determines when they become active:

detectionSignal sources fire
triageOn-call investigates
escalationIC + leadership engaged
notificationCustomers + partners notified

The notification target field determines which TTN vector component this actor contributes to: internal_eng, internal_leadership, customers, partners, regulatory.

Running Simulations

Click “Run Simulation” from any scenario detail page. Results stream via SSE to a live swim lane view.

Per Iteration
1
Race signals
Detection time
2
Propagate failures
Dependency graph
3
Run actors
LLM + plausibility
4
Apply comm tools
Latency + escalation
5
Compute results
TTN + delays

Run Overrides

Experiment knobs. Modify any configuration for a single run without changing the org profile or scenario. Enables apples-to-apples comparisons.

What if we add auto-correlation paging?
What if Slack is degraded today?
What if we add a secondary on-call?
What if we loosen statuspage approval?

TTN Vector

The primary output. Notification time per target, with distributions (p50, p90, p95, p99) across N iterations.

Example TTN Vector (p90)
internal_engineering17.2 min
incident_commander22.4 min
internal_leadership28.1 min
customers_statuspage35 min
enterprise_partners42.1 min

Secondary Metrics

Time to IC
Minutes until IC is on the bridge
Time to Bridge
Minutes until all responders assembled
Time to Correct Severity
Includes severity flip-flops
Misroutes Count
Wrong team paged before right team
Pages Sent Total
Alert fatigue proxy
Notification Quality
1 (too-early-vague) to 5 (timely-accurate)
SLA Breached
Whether TTN exceeded the threshold
Severity Flip-flops
How many times severity was changed

Delay Attribution

Every delay on the critical path is tagged with a cause category:

policy
human
external
tool
coord
policy_approvalWaiting for approval
tool_latencyComm tool delivery time
tool_outageComm tool is down
staffing_gapNo one available
ownership_ambiguityWrong team paged
misrouteRouted to wrong team
runbook_gapNo/bad runbook
human_decisionActor deliberation time
external_dependencyWaiting on vendor
coordination_overheadMulti-team overhead
chaos_eventInjector-caused delay

Critical Path

The longest chain of dependent events from incident start to each notification target.

Example Critical Path → Customer Notification
Alert fires
2m
OC wakes up
+6m
Investigates
+7m
IC joins
+3m
VP approves
+8m
Statuspage
+9m

Nodes marked in purple are delay bottlenecks. Only delays on the critical path directly extend TTN.

Interventions

After a run completes, the engine generates ranked intervention recommendations:

1
Pre-approve statuspage templates
Policy-6.8 min p90High confidence

Evidence from simulation data + one-click re-run override

Each intervention includes: the specific change, category (tooling, policy, staffing, process, monitoring), expected TTN p90 reduction, confidence level, evidence, and a run override config for one-click testing.

Comparisons

View runs side-by-side with overlaid metrics, radar charts, and grouped delay attribution.

Run baseline
Test intervention
Add to comparison
See exact impact

Simulation Engine

The engine orchestrates each iteration through four phases:

1
Detection
Signal racing, failure propagation
2
Triage
On-call investigation, severity assessment
3
Escalation
IC engagement, bridge creation, paging
4
Notification
Customer comms, partner outreach, statuspage

Also manages situation state machines, resource constraints, dependency graph propagation, and event injector evaluation. All events stream via SSE.

LLM Actors

Each actor step calls Claude Haiku with a structured prompt:

Prompt (input)
Scenario narrative + incident state
Actor role, team maturity, runbooks
Current situation (asleep, etc.)
Active signals + confidence
Comm tool behaviors
Org policies + approval chains
Event log from prior steps
Plausibility bounds
Response (output)
response_time (validated)
actions taken
complications encountered
decision + reasoning
downstream_effects
narrative summary

Plausibility Guards

Two guard systems ensure realistic outputs:

Plausibility Bounds

Response times clamped to min/max per situation:

asleep2 min15 min
at_desk15 sec5 min
in_meeting1 min10 min

Monotonicity Rules

Event ordering validated per iteration:

VP notified before IC paged
Statuspage before severity declared
Violations reduce plausibility score