> ## Documentation Index
> Fetch the complete documentation index at: https://altostrat.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Upstream provider outage: bulk customer comms in one workflow

> Multiple monitoring triggers light up. Confirm the outage is upstream, open a carrier ticket, update the IVR, bulk-SMS subscribers in the affected region, run a status page update, and keep a Slack war room alive.

At 18:47 Zabbix raises 43 subscriber-facing triggers and two core BGP-session-down alerts. The ISP's NOC needs to prove in three minutes that the problem is upstream, open a trouble ticket with the carrier, warn subscribers before the call volume spikes, and run a structured war room until the session is back.

## Systems involved

| System                        | Role                                         |
| ----------------------------- | -------------------------------------------- |
| Zabbix / LibreNMS             | Flood of subscriber and core triggers.       |
| BGP looking glass             | Confirm the peer outage externally.          |
| SSH to core routers           | Local state, show bgp summary, neighbor log. |
| Carrier trouble-ticket portal | Open a formal ticket with the upstream NOC.  |
| Twilio / Bulk SMS             | Subscriber outbound SMS.                     |
| IVR provider (Asterisk / 3CX) | Update the on-hold message.                  |
| Atlassian Statuspage          | Public status page.                          |
| Slack `#noc-war-room`         | Live operational channel.                    |
| Splynx / Sonar BSS            | Subscriber affected-region lookup.           |

## Walkthrough

<Steps>
  <Step title="Triage the alarm flood">
    Copilot groups the Zabbix triggers by root cause — 41 of 43 are downstream symptoms of the two core BGP neighbor drops. The two that aren't are unrelated customer-side issues.
  </Step>

  <Step title="Confirm upstream">
    Copilot runs the SSH procedure: `show bgp summary` on both core routers, `show log | include bgp` on each. Both show the neighbor reset reason as received-from-neighbor, matching a carrier-side event. The looking glass connector confirms the carrier's own prefix advertisements are withdrawn.
  </Step>

  <Step title="Open the carrier ticket">
    Through the carrier's trouble-ticket connector (or email if there's no API), Copilot drafts a ticket with: peer IPs, your ASN, the carrier ASN, timestamps in UTC, the two local log snippets, and a callback phone. Opened and ticket number captured.
  </Step>

  <Step title="Subscriber impact lookup">
    Query Splynx for subscribers whose last-mile depends on the affected upstream paths. 4,812 subscribers across three regions. Copilot groups them by region and prepares the outbound SMS list.
  </Step>

  <Step title="Bulk SMS">
    Twilio connector sends a terse SMS: cause (upstream carrier), scope (region), ETA (updating), status page URL. Approval prompt shows the SMS count and approximate cost before send.
  </Step>

  <Step title="IVR update">
    SSH into the Asterisk dialplan. Swap the on-hold message to the outage notice. Calls landing in the support queue now hear the same message the SMS says.
  </Step>

  <Step title="Publish status">
    Push an Identified incident to the Atlassian Statuspage with the affected components, the upstream cause, and the carrier ticket reference (not the ticket number — for security).
  </Step>

  <Step title="Open the war room">
    Copilot opens a Slack thread in `#noc-war-room` with the timeline, the carrier ticket, the affected subscriber count, the last update time. Updates to the thread auto-propagate to Statuspage and the on-hold message.
  </Step>

  <Step title="Watch for resolution">
    Copilot polls the BGP sessions every 60 seconds. When both peers come back up and prefixes re-populate, the all-clear fires — SMS goes out, IVR reverts, Statuspage closes, Slack thread gets the resolution summary and a commitment for the post-mortem.
  </Step>
</Steps>

## Where Studio earns its keep

* The 43-alarm flood becomes one root cause in two minutes, not thirty minutes of triage.
* Subscriber SMS, IVR, and status page update in parallel instead of waiting for whoever remembers each one.
* The carrier ticket references the same timestamps the local logs have, which speeds the carrier's side of the investigation.
* The all-clear closes every external comms channel the same way it opened them — no stale status page messages at 04:00.

## Related

<CardGroup cols={2}>
  <Card title="AI Copilot" icon="sparkles" href="../../ai-copilot" arrow="true" cta="Orchestrate">
    Planning mode for the bulk SMS approval — the cost and scope need to be visible.
  </Card>

  <Card title="Procedures" icon="workflow" href="../../procedures" arrow="true" cta="Save it">
    `Upstream outage response` with region as an argument.
  </Card>
</CardGroup>
