Introduction

Microsoft recently introduced generic service health notifications for WebSocket connection issues affecting services like Copilot. One of our customer received one and it allowed for some further investigation. At first glance, the message appears very generic without providing specific details like locations or affected neworks. It also seems that the Incident/Advisory Number is unique across all customers and does not correlate with others.

Example of the message the admin sees:

As mentioned, the message is quite generic, lacking specific references to locations, networks, or groups of affected individuals. For further investigation, Microsoft advises accessing the Microsoft Admin Center for additional details.

Identifying websocket issues in the Microsoft Admin Center

Looking at the Microsoft Admin Center, you can access the Network Connectivity page for overall information on connectivity topics including ISP information and network traffic. (reference: https://techcommunity.microsoft.com/discussions/deploymentnetworking/optimizing-customer-network-connectivity-for-microsoft-365-copilot/4374772)

Screenprint of Microsoft Network Connectivity menu

A new tab called Connection blockers shows a timeseries of connection and websocket error events for up to the last 90 days across your organization. However, if you have a large organization with only connectivity issues at certain locations, this graph can be deceiving. After all, if the connectivity issues are limited to certain locations, the relatively small number of users in those locations can be too small compared to the overall end user base to have a significant impact on this chart, resulting in a nearly flat line on the graph, as shown below. Obviously in this case, looking at that graph, an admin is not able to observe any issues.

Network Connectivity line graph showing a flat line at 0

This doesn’t mean though that you can ignore the critical websocket connectivity failures listed in the alert.

Going back to the Overview tab, there is also a ‘Take action’ section, that lists Insights that require action. This feature was recently introduced and “Websocket connection to critical Microsoft 365 domains is failing” is one the insights that can be listed here.

By clicking on the insight, you get an overview of affected office locations:

Screenshot of menu showing affected sites when clicking on an insight

With further drill down into each location for a location summary (tab), additional details, the sample count and a timeline of websocket connection failures.

Line graph showing impact of the insight

Insights & service health notifications relation:

During investigations with one of our customers we found however that the customer had received the service health notifications about websocket connectivity errors but did not see the insight on the ‘Microsoft 365 Network Connectivity’ – Take Action list.

Without that, you cannot quickly identify which location exceeds Microsoft’s thresholds that trigger the service health notification. So you need to go through and review each individual location to determine this.

For this customer the location was eventually identified through simply checking a long list of potential suspect locations. On selecting the location where the issues occured from the drop down, the chart provided a clearer image, showing a high percentage of websocket errors.

Graph showing network connectivity with websocket failures for a specific location

For this customer clearly something happened at the beginning of April. 

Knowing the name of the affected location, you can then click on the Locations tab, select the location and then on the Details tab to see ther information that would have otherwise been provided on the drill down from the insight if it had been there:

Screenprint of failure rating

The location information provides some insights as to what public IP’s and ingress points are related as well as insight into other connections, which could be a starting point to identify if there are settings in place that, for instance, block the websocket connections to the *.cloud.microsoft domain.

The Journey ends here because no further insights are provided. It would have been useful to see a bit more information, like the affected users or maybe the network information where the users are connecting from. It is also unclear why the insight wasn’t listed on the Microsoft 365 “Network Connectivity – Take action” list for this particular customer and instance. Not having that insight on the Take Action list certainly makes it hard to identify the specific location the service health notifications related to.

Speculating about possible reasons why there was no insight listed in the Take Action list, it could be that the number of websocket failures no longer exceeded the threshold for being listed as an insight when the admin looked at it after receiving the service health notification.  Another (speculation) could be that the thresholds for sending a service health notification and actually being listed as an Insight are not the same. Either way, it would mean that it is nearly impossible to do any retrospective research into the issue mentioned in the service health notifications if not looked at right at the moment it is received, or if the threshold isn’t met, to be an Insight.

How to resolve websocket problems in general

Coming back to the issue itself, the following items are some general guidelines to address issues with websocket connectivity.
 
First, ensure your network permits outbound connections to *.cloud.microsoft, *.office.com, *static.microsoft, and *.usercontent.microsoft. These are specified in the official Microsoft IP address ranges/URLs documentation.
 
Second, ensure that your firewall or proxy does not block or interfere with Websocket traffic. SSL/TLS inspection could also break the Websocket handshake. If using a proxy, check the configuration to confirm that the proxy does not modify any headers.

TrueDEM insights?

Currently, TrueDEM does not provide websocket monitoring yet, however due to growing demand for real-time services like Copilot, Teams, and other M365 apps, and the experience of lacking post mortem options, we are exploring ways to assist our customers with such needs.