Upstream provider outage: bulk customer comms in one workflow

At 18:47 Zabbix raises 43 subscriber-facing triggers and two core BGP-session-down alerts. The ISP’s NOC needs to prove in three minutes that the problem is upstream, open a trouble ticket with the carrier, warn subscribers before the call volume spikes, and run a structured war room until the session is back.

Systems involved

System	Role
Zabbix / LibreNMS	Flood of subscriber and core triggers.
BGP looking glass	Confirm the peer outage externally.
SSH to core routers	Local state, show bgp summary, neighbor log.
Carrier trouble-ticket portal	Open a formal ticket with the upstream NOC.
Twilio / Bulk SMS	Subscriber outbound SMS.
IVR provider (Asterisk / 3CX)	Update the on-hold message.
Atlassian Statuspage	Public status page.
Slack `#noc-war-room`	Live operational channel.
Splynx / Sonar BSS	Subscriber affected-region lookup.

Walkthrough

Triage the alarm flood

Copilot groups the Zabbix triggers by root cause — 41 of 43 are downstream symptoms of the two core BGP neighbor drops. The two that aren’t are unrelated customer-side issues.

Confirm upstream

Copilot runs the SSH procedure: show bgp summary on both core routers, show log | include bgp on each. Both show the neighbor reset reason as received-from-neighbor, matching a carrier-side event. The looking glass connector confirms the carrier’s own prefix advertisements are withdrawn.

Open the carrier ticket

Through the carrier’s trouble-ticket connector (or email if there’s no API), Copilot drafts a ticket with: peer IPs, your ASN, the carrier ASN, timestamps in UTC, the two local log snippets, and a callback phone. Opened and ticket number captured.

Subscriber impact lookup

Query Splynx for subscribers whose last-mile depends on the affected upstream paths. 4,812 subscribers across three regions. Copilot groups them by region and prepares the outbound SMS list.

Bulk SMS

Twilio connector sends a terse SMS: cause (upstream carrier), scope (region), ETA (updating), status page URL. Approval prompt shows the SMS count and approximate cost before send.

IVR update

SSH into the Asterisk dialplan. Swap the on-hold message to the outage notice. Calls landing in the support queue now hear the same message the SMS says.

Publish status

Push an Identified incident to the Atlassian Statuspage with the affected components, the upstream cause, and the carrier ticket reference (not the ticket number — for security).

Open the war room

Copilot opens a Slack thread in #noc-war-room with the timeline, the carrier ticket, the affected subscriber count, the last update time. Updates to the thread auto-propagate to Statuspage and the on-hold message.

Watch for resolution

Copilot polls the BGP sessions every 60 seconds. When both peers come back up and prefixes re-populate, the all-clear fires — SMS goes out, IVR reverts, Statuspage closes, Slack thread gets the resolution summary and a commitment for the post-mortem.

Where Studio earns its keep

The 43-alarm flood becomes one root cause in two minutes, not thirty minutes of triage.
Subscriber SMS, IVR, and status page update in parallel instead of waiting for whoever remembers each one.
The carrier ticket references the same timestamps the local logs have, which speeds the carrier’s side of the investigation.
The all-clear closes every external comms channel the same way it opened them — no stale status page messages at 04:00.

AI Copilot

Planning mode for the bulk SMS approval — the cost and scope need to be visible.

Procedures

Upstream outage response with region as an argument.

Ransomware containment: alert to clean restore

DDoS mitigation activation and customer comms

⌘I

Overview

MSP scenarios

ISP and WISP scenarios

VoIP and telco scenarios

Upstream provider outage: bulk customer comms in one workflow

Systems involved

Walkthrough

Where Studio earns its keep

AI Copilot

Procedures

​Systems involved

​Walkthrough

​Where Studio earns its keep

​Related

AI Copilot

Procedures

Systems involved

Walkthrough

Where Studio earns its keep

Related