> ## Documentation Index
> Fetch the complete documentation index at: https://altostrat.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Office 365 outage triage and bulk customer comms

> A PRTG sensor flips red across multiple managed customers. Confirm the upstream nature of the outage, broadcast it once, and update every affected ticket without typing the same sentence forty times.

PRTG sends a red sensor for "M365 Auth Latency" against three different customer probes within five minutes. The on-call MSP engineer needs to know if it's the customers' networks or Microsoft — and if it's Microsoft, get one consistent message in front of every customer before the phones start ringing.

## Systems involved

| System                       | Role                                                                          |
| ---------------------------- | ----------------------------------------------------------------------------- |
| PRTG                         | Source alarm and sensor history.                                              |
| Studio diagnostics           | Ping, traceroute, DNS, and HTTPS path checks against `outlook.office365.com`. |
| Microsoft 365 Service Health | Confirm whether Microsoft has acknowledged an incident.                       |
| Halo PSA / ConnectWise       | Bulk-update affected customer tickets.                                        |
| Microsoft Teams              | Internal `#noc` channel and customer-shared channels.                         |
| StatusPage.io                | Public status page update.                                                    |
| Gmail / Outlook              | Customer comms with technical contacts.                                       |

## Walkthrough

<Steps>
  <Step title="Acknowledge the PRTG alarm">
    Copilot pulls the three sensors and their history. They started failing within 90 seconds of each other across three different customer probes — not a customer-side coincidence.
  </Step>

  <Step title="Rule out customer-network paths">
    Copilot runs a parallel diagnostic sweep: ping and HTTPS probe against `outlook.office365.com`, `login.microsoftonline.com`, and `graph.microsoft.com` from each customer probe via SSH. All three customers have a clean Internet path; Microsoft endpoints respond slowly or 5xx.
  </Step>

  <Step title="Check Microsoft Service Health">
    Copilot calls the Microsoft 365 Service Health connector. There is an acknowledged incident `EX{number}` for Exchange Online authentication, scope global. That settles the diagnosis.
  </Step>

  <Step title="Compose the customer message once">
    Copilot drafts a short customer-facing message: cause (Microsoft incident), scope (Exchange auth), what's affected (Outlook, OWA), what isn't (Teams chat, SharePoint), workaround (existing sessions still work), the Microsoft incident ID, and the next update time.
  </Step>

  <Step title="Bulk-update tickets in the PSA">
    The PSA connector lists every open ticket in the last 60 minutes that mentions Outlook, M365, or "email is slow." Copilot stages a bulk update with the message, links the Microsoft incident, and pauses for approval. You scan the list, untick two unrelated tickets, approve.
  </Step>

  <Step title="Post in customer-shared Teams channels">
    For customers with a shared Teams channel, Copilot posts the same message tagged to the right contacts. The message sticks at the top of each channel for visibility.
  </Step>

  <Step title="Update the public status page">
    The StatusPage.io connector publishes a Monitoring incident pointing at the Microsoft outage and links the upstream Microsoft advisory.
  </Step>

  <Step title="Set a follow-up timer">
    Copilot adds a 30-minute follow-up reminder. When the timer fires, it re-checks Service Health, the PRTG sensors, and updates the same channels with progress or an all-clear.
  </Step>
</Steps>

## Where Studio earns its keep

* One diagnostic run touches every customer probe at once — no SSH-jumping between consoles to confirm a global pattern.
* The same message reaches the PSA, Teams, and the status page with one approval, instead of forty manual posts.
* The follow-up loop is automatic: the 30-minute check happens whether you remember it or not.
* The all-clear closes every ticket and posts a final status without you composing it three times.

## Related

<CardGroup cols={2}>
  <Card title="AI Copilot" icon="sparkles" href="../../ai-copilot" arrow="true" cta="See modes">
    Use Planning when the bulk update needs a careful review before it goes out.
  </Card>

  <Card title="Connectors and MCP" icon="plug" href="../../connectors-and-mcp" arrow="true" cta="Wire connectors">
    How PRTG, the PSA, Microsoft Service Health, Teams, and StatusPage.io are reachable.
  </Card>
</CardGroup>
