First Checks
Before feature-specific troubleshooting, check:- The site exists in the expected workspace.
- The site status and last heartbeat are current.
- The management VPN is connected or recently connected.
- The control plane policy still allows the required management services.
- The fault log shows the same symptom you are investigating.
- Recent scripts, workflows, or policy changes did not coincide with the issue.
Site Is Offline
- Check the site’s last heartbeat time.
- Confirm the local router has outbound internet access.
- Confirm outbound access to SDX endpoints is not blocked by an upstream firewall.
- Review recent WAN faults or ISP issues.
- If the router is reachable locally, check whether the SDX management interface still exists.
- Use Recreate Management VPN only when you have reason to believe the tunnel configuration is missing or corrupted.
The platform marks site health from router check-ins. A site can have working local LAN traffic and still appear offline if the management path is blocked.
Management VPN Is Missing Or Broken
- Open the site controls.
- Run Recreate Management VPN.
- If management firewall rules are also suspect, run Recreate Management Filter.
- Watch the orchestration or site job output.
- Confirm the site returns online and that management tasks work again.
failure: not allowed by device-mode, enable advanced mode on the device before retrying the relevant setup action. The portal surfaces the RouterOS command when this condition applies.
Device Job Or Script Failed
- Open the scheduled script, workflow run, or site orchestration log.
- Find the target site outcome rather than relying only on the parent job status.
- Check whether the script depends on a RouterOS feature not present on that device.
- Confirm the
altostrat-apiuser and control plane policy still permit required actions. - Retry on one test site before relaunching a broad rollout.
WAN Failover Is Not Switching
- Confirm each WAN tunnel has the expected interface and gateway.
- Confirm priorities are saved in the intended order.
- Check live WAN health for packet loss, jitter, latency, and tunnel status.
- Review WAN faults for offline and recovery events.
- Test during a maintenance window by changing priority order or disconnecting the primary link.
Captive Portal Users Cannot Log In
- Confirm the portal instance is attached to the correct site and subnet.
- Confirm the user’s device is actually on that subnet.
- For OAuth2, confirm the auth integration is valid and the provider can be reached before login.
- For coupons, confirm the code is valid, unexpired, and not already redeemed.
- Check the session TTL and whether the user had a previous active session.
Workflow Did Not Run
- Confirm the workflow is active.
- Confirm the trigger matches the expected event type.
- Check whether the workflow authorization is still valid.
- Check vault secrets used by the failing nodes.
- Open the workflow run and inspect node-level logs.
- For workflow chaining, check for dependency validation errors or inactive target workflows.
When To Escalate
Escalate with these details ready:- Workspace and site name.
- Time range of the failure.
- Relevant fault IDs or workflow run IDs.
- The affected service, such as management VPN, WAN failover, captive portal, or workflows.
- The last known successful change or run.
- Whether local router access is available.