WebRTC connections cause overload on servers - 2

Hello,

We opened an issue like this a couple of months ago, and we had no problems until recently, but unfortunately our client started experiencing the same problem again. I am referring to the closed issue: WebRTC connections cause overload on servers

This time we have logs and I am attaching them to this reopened issue.

If you need any other logs, we will be happy to provide them. Thank you again in advance.

Logs (2).zip (80.3 KB)

For our understanding. Does the issue appear random or do you have a way where you can consistently (or often) reproduce it?
Looking at he logs we were a bit lost, do you have a note when the issue was seen for us to zoom in on the right time frame in the logs?
The XProtect API Gateway is it installed on the Management Server or in its own dedicated server?

  1. The issue appears when multiple requests are made consequently.

  2. The logs are taken right after the issue appeared.

  3. The API Gateway is installed on the Management Server.

    Thank you for your help.

We are not able to determine the problem with the current information. Please enable debug logging. Hopefully this will give us a clue. Enable debug logging, reproduce the issue and share the new log.

To enable debug logging modify the appsettings like this -

"Logging":  {

                "LogLevel":  {

                                 "Default":  "Debug"

                                                …

  }

},

we enabled the debugger logs and hit the issue again. please see logs attached:

gateway.log (1.1 KB)

gateway-0002.log (976.5 KB)

gateway-0001.log (829.6 KB)

Overall, we don’t see any errors in the logs, so there is nothing to indicate that the API Gateway itself is doing something out of the ordinary.

However, we do observe some odd behavior. It appears that each camera is connected to approximately four times within the span of one second. Since there is no other communication occurring at this time, it is likely that the same client is repeatedly connecting to the same camera.

2026-03-31 12:53:34.644+03:00 [ 293] DEBUG - Session initiated for deviceId: ae7c301c-ede0-4279-b2f6-833a27e1be64 sessionId: da97b4e6-b036-4d16-8187-27650e958587 streamId: (null)
2026-03-31 12:53:34.712+03:00 [ 302] DEBUG - Session initiated for deviceId: ae7c301c-ede0-4279-b2f6-833a27e1be64 sessionId: 735f6ec1-9825-4faf-9034-5079f78a94ef streamId: (null)
2026-03-31 12:53:35.349+03:00 [ 198] DEBUG - Session initiated for deviceId: ae7c301c-ede0-4279-b2f6-833a27e1be64 sessionId: 5877fd43-de2b-4768-8914-f8a1398e35d7 streamId: (null)
2026-03-31 12:53:35.380+03:00 [ 104] DEBUG - Session initiated for deviceId: ae7c301c-ede0-4279-b2f6-833a27e1be64 sessionId: 9475b40b-2fc2-4d4f-a041-f1a3538e47e6 streamId: (null)
2026-03-31 12:53:46.946+03:00 [ 293] DEBUG - Closing session: da97b4e6-b036-4d16-8187-27650e958587
2026-03-31 12:53:46.962+03:00 [ 307] DEBUG - Closing session: 735f6ec1-9825-4faf-9034-5079f78a94ef
2026-03-31 12:53:47.743+03:00 [ 198] DEBUG - Closing session: 5877fd43-de2b-4768-8914-f8a1398e35d7
2026-03-31 12:53:47.728+03:00 [ 275] DEBUG - Closing session: 9475b40b-2fc2-4d4f-a041-f1a3538e47e6

The above is taken from gateway-0002.log, and many connections follow this same pattern. In general, it appears that connections last for around 20 seconds, which is quite short.
The server itself should have no issues handling this behavior, but the connection pattern still seems unusual.

Is the behavior we see in the logs caused by the CPU usage growing to 100% and then the client starts to reconnect every 20th second?

Hello mr Andresen,

Thank you for looking at our problem. I would like to add you some logs that we have from when our CPU is not 100%. At 2026-04-17 14:46:11.622+03:00 we have initialized a couple of connections. We speculate that the sessions get released too slow, but we will appreciate your opinion. I have also attached the code that we use to initialize a session. It is almost verbatim from the example Milestone provides in github, but still changes were made. Mainly we do not use STUN server (we do not pass it, so the code does not execute in the “if”)

streaming.ts.zip (3.0 KB)

as we are in the same network.

gateway 1.log (23.2 KB)

Thank you for your input in advance.

At the moment, we can’t tell what is going wrong—or even confirm that something is wrong—based on the logs alone.

To better understand the situation, could you briefly describe what is happening during your test? (The test the logs cover.)
For example, what kind of actions are performed and how the system is used while the test is running. (Opening X cameras, closing them after Y seconds, -, or similar description.)

Also, when the CPU usage increases and eventually reaches 100%, which service or process is consuming the CPU?

The test consists of opening the Alarm Matrix page in a browser. The page receives alarms over a SignalR push channel and, for each “active” alarm, opens two simultaneous WebRTC sessions against the Milestone XProtect server identified by the camera’s nodeUrl:

  • one live session,

  • one playback session at the alarm’s activityDate.

Authentication uses NTLM against ${nodeUrl}/IDP/connect/token (grant_type=windows_credentials, client GrantValidatorClient); the token is cached for ~55 min. The WebRTC handshake follows the standard POST /API/REST/v1/WebRTC/SessionPATCH /Session/{id} (answer SDP) flow.

We do not use STUN or TURN. The RTCPeerConnection is created with iceServers: [] and the ICE-candidate GET/POST /IceCandidates/{sessionId} endpoints are not called. We rely entirely on host candidates over the local network.

Every time the active alarm changes (driven by alarms arriving from SignalR or by the operator selecting one), the previous two RTCPeerConnections are pc.close()d and two new ones are created against the same or another Milestone host. During the test, alarms cycle frequently, so the page opens/closes camera pairs on the order of every few seconds. After running this for a while, either CPU on the Milestone host climbs toward 100 % and network usage grows filling the available bandwidth.

This is a good explanation and it leads us to a better understanding.

Have you looked into which service it is that consumes most of CPU when going 100%? Please make that observation, if you haven’t already, and let us know what you find.

Could you please redo the reproduction and then send the logs, this time not only from the API Gateway but also from the XProtect server in general. You can use the Diagnostics tool to gather the logs.