Sharely.ai Response

What is Sharely.ai Response?

An incident response simulator that models the full complexity of how your organization responds to incidents. AI actors (Claude Haiku) role-play human responders, communication tool mechanics are simulated, your org's policies are applied, and Time to Notify (TTN) is measured across every stakeholder group.

Org Profile

baseline config

Scenario

incident + actors

Run Overrides

experiment knobs

Results

TTN + interventions

Composable

swap any layer

Explainable

every delay tagged

Actionable

ranked fixes

TTN is not a single number. It's a vector: internal engineering might learn in 8 minutes, but enterprise partners might not hear until minute 45. This tool measures the full vector.

Core Concepts

TTN (Time to Notify)

Elapsed time from incident start to when each stakeholder group is notified. Measured as a vector across targets.

Org Profile

Your org's baseline: teams, roles, comm tools, policies, service dependencies, signal sources, resource constraints.

Scenario

A specific incident: narrative, severity, affected services, actors, active signals, and chaos cards.

Run

N iterations of a scenario against an org profile. Each iteration produces a different outcome due to stochastic elements.

Actor

A participant in the response. Has a role, team, comm tool, situation state, and phase. Claude Haiku role-plays each one.

Signal Source

A detection mechanism (threshold alert, anomaly detection, SLO burn rate, etc.) that fires at incident start.

Event Injector

A "chaos card" that fires during simulation with a configured probability, adding delays or forcing escalations.

Iteration

A single pass through detection, triage, escalation, and notification phases. Produces a TTN vector and delay attribution.

Quick Start Guide

Set up Org Profile

Teams & roles
Comm tools
Policies
Services

Add Detection

Signal sources
Situations
Chaos cards
Calibration

Create Scenario

Incident narrative
Severity level
Add actors
Assign phases

Run Simulation

Set iterations (10-50)
Watch live swim lane
SSE streaming

Analyze Results

TTN distributions
Delay attribution
Critical path

Test Interventions

Ranked fixes
One-click re-run
Side-by-side compare

Org Profile

Your organization's baseline configuration. Everything here applies to all scenarios unless overridden. Think of it as “how our org works on any given day.”

Teams Roles Comm Tools Policies Services Signals Situations Injectors

Access from the sidebar under Org Profile. The overview page shows all sections and current resource constraints.

Teams & Roles

Teams

Each team has a name, timezone(s), on-call rotation style, and a maturity profile that affects simulation behavior:

Runbook coverage

nonepartialcomprehensive

Runbook quality

outdateddecentexcellent

Training level

untrainedbasicdrilled

Roles

Job functions with phase assignments and concurrency limits:

On-Call Engineertriage

Incident Commanderescalation

Comms Leadnotification

Engineering Leaderescalation

Account Managernotification

Communication Tools

Comm tools model real delivery mechanics, not just “notification sent.”

Delivery Pipeline (per message)

Send

Message dispatched

Delivery Latency

normal/lognormal dist.

Noise Filter

Missed? (0-1 prob)

Ack Wait

Timeout → escalate

Received

Actor sees it

PagerDuty

Ack timeout + escalation policy
Multi-step escalation chains

Slack

Channel noise probability
Rate limiting at high msg volumes

Statuspage

Draft time + approval chain
Template availability

Legal review for SEV1
Outage correlation tags

Policies

Org policies govern who can do what during an incident. Each policy is a configurable knob that the simulator enforces:

Severity Declaration

Who can declare, strictness level, auto-declare conditions

Notification Thresholds

Which SEV levels trigger leadership, customer, partner notifications

Approvals

Who approves customer comms, statuspage, partner notifications, legal

Paging Rules

IC can page any team? Manager approval? Restrictions?

Policy delays are one of the most common TTN contributors. The simulator tracks “policy_approval” as a delay cause so you can see exactly how many minutes your approval chains add.

Services & Dependencies

The service dependency graph models how failures propagate through your architecture.

Example Dependency Graph

checkout

hard, 0min

soft, 1min

payments-api

fraud-service

hard, 2min

gateway

Ownership ambiguity: 0 (clear) → 1 (unknown)Misrouting cost: sampled from distribution

Signal Sources

Signal sources define how incidents are detected. The simulator races all active signals — the first confident signal determines detection time.

Signal Race (first confident signal wins)

threshold

1-3 min

WINNER

90%

anomaly

3-7 min

60%

burn_rate

5-10 min

75%

synthetic

2-5 min

85%

support_ticket

15-45 min

30%

thresholdError rate > X%

anomalyStatistical detection

burn_rateSLO error budget

syntheticCanary monitoring

correlationMulti-service

support_ticketCustomer reports

socialSocial / partner

dashboardHuman observation

Situations

Situations model the real-world context of each actor as a state machine, not a static sample.

State Machine Example

asleep

2-8 min

groggy

5 min

alert

cloud_outage

triggers

vpn_down

Availability

asleepin_meetingtravelingon_break

Cognitive

focusedcontext_switchingfatiguedgroggy

Tooling

vpn_downlaptop_not_availablephone_dead

Coordination

on_another_incidenthandoff_in_progress

Weighted by time of day (asleep more likely at 2am) and triggered by events (cloud_outage triggers vpn_down).

Event Injectors

“Chaos cards” — controlled disruptions that fire during simulation based on probability and trigger conditions.

Monitoring dashboard outage

+8 min to investigation

12%

Slack degraded (cloud_outage)

+5 min all Slack comms

Customer exec escalation

IC distracted

Shift handoff mid-incident

+10 min context transfer

Each injector has probability (0-1), trigger conditions (phase, severity, tags), and an effect (delay, escalation, severity change, tool outage).

Calibration

Two systems ensure realistic outputs:

Plausibility Bounds

"asleep" must be ≥ 2 min response
"at_desk" must be ≤ 15 min response
LLM responses clamped to bounds

Monotonicity Rules

VP cannot be notified before IC paged
Statuspage cannot post before SEV declared
Violations reduce plausibility score

Creating Scenarios

A scenario defines a specific incident to simulate:

Name & descriptionBrief summary

NarrativeDetailed context for Claude Haiku actors

SeveritySEV1 – SEV4

TagsInjector correlation + categorization

ActorsParticipants with role, team, phase

SignalsActive signal sources for this incident

Actors

Each actor is assigned to a phase which determines when they become active:

detectionSignal sources fire

triageOn-call investigates

escalationIC + leadership engaged

notificationCustomers + partners notified

The notification target field determines which TTN vector component this actor contributes to: internal_eng, internal_leadership, customers, partners, regulatory.

Running Simulations

Click “Run Simulation” from any scenario detail page. Results stream via SSE to a live swim lane view.

Per Iteration

Race signals

Detection time

Propagate failures

Dependency graph

Run actors

LLM + plausibility

Apply comm tools

Latency + escalation

Compute results

TTN + delays

Run Overrides

Experiment knobs. Modify any configuration for a single run without changing the org profile or scenario. Enables apples-to-apples comparisons.

“What if we add auto-correlation paging?”

“What if Slack is degraded today?”

“What if we add a secondary on-call?”

“What if we loosen statuspage approval?”

TTN Vector

The primary output. Notification time per target, with distributions (p50, p90, p95, p99) across N iterations.

Example TTN Vector (p90)

internal_engineering17.2 min

incident_commander22.4 min

internal_leadership28.1 min

customers_statuspage35 min

enterprise_partners42.1 min

Secondary Metrics

Time to IC

Minutes until IC is on the bridge

Time to Bridge

Minutes until all responders assembled

Time to Correct Severity

Includes severity flip-flops

Misroutes Count

Wrong team paged before right team

Pages Sent Total

Alert fatigue proxy

Notification Quality

1 (too-early-vague) to 5 (timely-accurate)

SLA Breached

Whether TTN exceeded the threshold

Severity Flip-flops

How many times severity was changed

Delay Attribution

Every delay on the critical path is tagged with a cause category:

policy

human

external

tool

coord

policy_approvalWaiting for approval

tool_latencyComm tool delivery time

tool_outageComm tool is down

staffing_gapNo one available

ownership_ambiguityWrong team paged

misrouteRouted to wrong team

runbook_gapNo/bad runbook

human_decisionActor deliberation time

external_dependencyWaiting on vendor

coordination_overheadMulti-team overhead

chaos_eventInjector-caused delay

Critical Path

The longest chain of dependent events from incident start to each notification target.

Example Critical Path → Customer Notification

Alert fires

OC wakes up

+6m

Investigates

+7m

IC joins

+3m

VP approves

+8m

Statuspage

+9m

Nodes marked in purple are delay bottlenecks. Only delays on the critical path directly extend TTN.

Interventions

After a run completes, the engine generates ranked intervention recommendations:

Pre-approve statuspage templates

Policy-6.8 min p90High confidence

Evidence from simulation data + one-click re-run override

Each intervention includes: the specific change, category (tooling, policy, staffing, process, monitoring), expected TTN p90 reduction, confidence level, evidence, and a run override config for one-click testing.

Comparisons

View runs side-by-side with overlaid metrics, radar charts, and grouped delay attribution.

Run baseline

Test intervention

Add to comparison

See exact impact

Simulation Engine

The engine orchestrates each iteration through four phases:

Detection

Signal racing, failure propagation

Triage

On-call investigation, severity assessment

Escalation

IC engagement, bridge creation, paging

Notification

Customer comms, partner outreach, statuspage

Also manages situation state machines, resource constraints, dependency graph propagation, and event injector evaluation. All events stream via SSE.

LLM Actors

Each actor step calls Claude Haiku with a structured prompt:

Prompt (input)

Scenario narrative + incident state

Actor role, team maturity, runbooks

Current situation (asleep, etc.)

Active signals + confidence

Comm tool behaviors

Org policies + approval chains

Event log from prior steps

Plausibility bounds

Response (output)

response_time (validated)

actions taken

complications encountered

decision + reasoning

downstream_effects

narrative summary

Plausibility Guards

Two guard systems ensure realistic outputs:

Plausibility Bounds

Response times clamped to min/max per situation:

asleep2 min – 15 min

at_desk15 sec – 5 min

in_meeting1 min – 10 min

Monotonicity Rules

Event ordering validated per iteration:

✗VP notified before IC paged

✗Statuspage before severity declared

✓Violations reduce plausibility score