Systems involved
| System | Role |
|---|---|
| Zabbix / LibreNMS | Flood of subscriber and core triggers. |
| BGP looking glass | Confirm the peer outage externally. |
| SSH to core routers | Local state, show bgp summary, neighbor log. |
| Carrier trouble-ticket portal | Open a formal ticket with the upstream NOC. |
| Twilio / Bulk SMS | Subscriber outbound SMS. |
| IVR provider (Asterisk / 3CX) | Update the on-hold message. |
| Atlassian Statuspage | Public status page. |
Slack #noc-war-room | Live operational channel. |
| Splynx / Sonar BSS | Subscriber affected-region lookup. |
Walkthrough
Triage the alarm flood
Copilot groups the Zabbix triggers by root cause — 41 of 43 are downstream symptoms of the two core BGP neighbor drops. The two that aren’t are unrelated customer-side issues.
Confirm upstream
Copilot runs the SSH procedure:
show bgp summary on both core routers, show log | include bgp on each. Both show the neighbor reset reason as received-from-neighbor, matching a carrier-side event. The looking glass connector confirms the carrier’s own prefix advertisements are withdrawn.Open the carrier ticket
Through the carrier’s trouble-ticket connector (or email if there’s no API), Copilot drafts a ticket with: peer IPs, your ASN, the carrier ASN, timestamps in UTC, the two local log snippets, and a callback phone. Opened and ticket number captured.
Subscriber impact lookup
Query Splynx for subscribers whose last-mile depends on the affected upstream paths. 4,812 subscribers across three regions. Copilot groups them by region and prepares the outbound SMS list.
Bulk SMS
Twilio connector sends a terse SMS: cause (upstream carrier), scope (region), ETA (updating), status page URL. Approval prompt shows the SMS count and approximate cost before send.
IVR update
SSH into the Asterisk dialplan. Swap the on-hold message to the outage notice. Calls landing in the support queue now hear the same message the SMS says.
Publish status
Push an Identified incident to the Atlassian Statuspage with the affected components, the upstream cause, and the carrier ticket reference (not the ticket number — for security).
Open the war room
Copilot opens a Slack thread in
#noc-war-room with the timeline, the carrier ticket, the affected subscriber count, the last update time. Updates to the thread auto-propagate to Statuspage and the on-hold message.Where Studio earns its keep
- The 43-alarm flood becomes one root cause in two minutes, not thirty minutes of triage.
- Subscriber SMS, IVR, and status page update in parallel instead of waiting for whoever remembers each one.
- The carrier ticket references the same timestamps the local logs have, which speeds the carrier’s side of the investigation.
- The all-clear closes every external comms channel the same way it opened them — no stale status page messages at 04:00.
Related
AI Copilot
Planning mode for the bulk SMS approval — the cost and scope need to be visible.
Procedures
Upstream outage response with region as an argument.