Skip to main content
At 18:47 Zabbix raises 43 subscriber-facing triggers and two core BGP-session-down alerts. The ISP’s NOC needs to prove in three minutes that the problem is upstream, open a trouble ticket with the carrier, warn subscribers before the call volume spikes, and run a structured war room until the session is back.

Systems involved

SystemRole
Zabbix / LibreNMSFlood of subscriber and core triggers.
BGP looking glassConfirm the peer outage externally.
SSH to core routersLocal state, show bgp summary, neighbor log.
Carrier trouble-ticket portalOpen a formal ticket with the upstream NOC.
Twilio / Bulk SMSSubscriber outbound SMS.
IVR provider (Asterisk / 3CX)Update the on-hold message.
Atlassian StatuspagePublic status page.
Slack #noc-war-roomLive operational channel.
Splynx / Sonar BSSSubscriber affected-region lookup.

Walkthrough

1

Triage the alarm flood

Copilot groups the Zabbix triggers by root cause — 41 of 43 are downstream symptoms of the two core BGP neighbor drops. The two that aren’t are unrelated customer-side issues.
2

Confirm upstream

Copilot runs the SSH procedure: show bgp summary on both core routers, show log | include bgp on each. Both show the neighbor reset reason as received-from-neighbor, matching a carrier-side event. The looking glass connector confirms the carrier’s own prefix advertisements are withdrawn.
3

Open the carrier ticket

Through the carrier’s trouble-ticket connector (or email if there’s no API), Copilot drafts a ticket with: peer IPs, your ASN, the carrier ASN, timestamps in UTC, the two local log snippets, and a callback phone. Opened and ticket number captured.
4

Subscriber impact lookup

Query Splynx for subscribers whose last-mile depends on the affected upstream paths. 4,812 subscribers across three regions. Copilot groups them by region and prepares the outbound SMS list.
5

Bulk SMS

Twilio connector sends a terse SMS: cause (upstream carrier), scope (region), ETA (updating), status page URL. Approval prompt shows the SMS count and approximate cost before send.
6

IVR update

SSH into the Asterisk dialplan. Swap the on-hold message to the outage notice. Calls landing in the support queue now hear the same message the SMS says.
7

Publish status

Push an Identified incident to the Atlassian Statuspage with the affected components, the upstream cause, and the carrier ticket reference (not the ticket number — for security).
8

Open the war room

Copilot opens a Slack thread in #noc-war-room with the timeline, the carrier ticket, the affected subscriber count, the last update time. Updates to the thread auto-propagate to Statuspage and the on-hold message.
9

Watch for resolution

Copilot polls the BGP sessions every 60 seconds. When both peers come back up and prefixes re-populate, the all-clear fires — SMS goes out, IVR reverts, Statuspage closes, Slack thread gets the resolution summary and a commitment for the post-mortem.

Where Studio earns its keep

  • The 43-alarm flood becomes one root cause in two minutes, not thirty minutes of triage.
  • Subscriber SMS, IVR, and status page update in parallel instead of waiting for whoever remembers each one.
  • The carrier ticket references the same timestamps the local logs have, which speeds the carrier’s side of the investigation.
  • The all-clear closes every external comms channel the same way it opened them — no stale status page messages at 04:00.

AI Copilot

Planning mode for the bulk SMS approval — the cost and scope need to be visible.

Procedures

Upstream outage response with region as an argument.