Deconnection of RTSP stream from AI brige

Hi,

I am currently working with the AI bridge v1.1 and have some deconnection problems with rtsp stream

My project is running into a docker container, as does the AI bridge.

I managed to get a video stream through rtsp from our camera, but after a certain amount of time, the stream timeout, and impossible to reconnect to it without restarting the docker in which the AI bridge run.

When this docker is restarted, everything work like a charm again … until the next deconnection (that can take place hours later).

In the docker (aibridge-streaming), I have the following error:

```

2022-06-06T12:06:16Z [ 226080c1-0079-478d-b6b8-46452ccabce2/28dc44c3-079e-4c94-8ec9-60363451eb40 ] RTSP SETUP request using TCP handled with error: waiting for response timed out

2022-06-06T12:06:16Z [ Unknown ] RTSP session closed (not in use)

```

and in the docker aibridge-connector, I have this other message (I am not sure it is related, but I give you the info in case):

```

2022-06-06T12:20:48Z Error receiving command from destination (address: ‘ip:port’) rpc error: code = ResourceExhausted desc = grpc: received message larger than max (2378421379 vs. 67108864)

```

The problem exist also if I read the stream through VLC, from which I got the following debug:

```

main error: ES_OUT_SET_(GROUP_)PCR is called too late (pts_delay increased to 3957 ms)

main debug: ES_OUT_RESET_PCR called

main debug: Buffering 0%

live555 warning: no data received in 10s, eof ?

main debug: EOF reached

main debug: Stream buffering done (0 ms in 0 ms)

```

Also, our milestone server works correctly, no overload, enough ram …

I was unfortunately not able to find a related issue in the forum. I am currently trying to upgrade AI Bridge to V1.2, hoping it may fix the problem.

Do you have any clue about the problem ?

Regards,

I will try to see if I can replicate the issue in my setup. Do you see the problem with just one camera? Also let me know if it is the same with v1.2. Even if it is, the log messages might reveal more details. That is one of the changes with v1.2.

Hi John,

I still have the problem with v1.2. I checked this morning and we had more than one camera, so we will try with only one.

Moreover, I notice that the hour of the crash seems to be the same in two days (17h51 UTC), so we are investigating our server installation.

About the logs, I check the ones of our programm, it starts loosing the rtsp stream at 17h23 UTC, then tries to reconnect (I’m not 100% sure, but it seems to work for few frames before de-connecting again) until 17h51 UTC, where I got an Error server 5XX, without any more context, and this until I restart the AIBridge dockers.

From the docker AIBridge-connector, we still have the same grpc error as written in my first message, but only until 16h25 UTC, so it does not seem to be linked to our problem.

From the docker AIBridge-streaming, again same TCP error, but from the time I started the docker until now so I do not think it is linked also.

I checked all the others AIBridge docker’s logs, and found nothing relevant.

I have been running several tests and I have also observed that after several hours I get an unexpected disconnect. The logs do not reveal what / why it failed. I can re-connect without problems (not restarting containers), but the sudden disconnect is still something I want to investigate further. I think the problem you see could be related to this. I will continue to test it and add more log messages. Since my test has to run for 5-9 hours before the issue appears, it might take a bit of time before I have a conclusion.

Hi John,

An update about the situation, we cleaned our milestone server, cleaned video’s records, removed all cameras and kept only one.

I am running the stream at the moment, I do not have a crash yet, but got some errors from ffmpeg every minutes :

```

Error Immediate exit requested

```

Also, the stream through VLC is not always stable, with few seconds of freezing.

I talked also with a colleague, and he had same crashes as mine (but some days earlier), at the same hour (still 17h51 UTC). Do you know if there is something happening at constant hour in AIBridge ?

We are requesting status information from the recording sever once every hour. It will however be running relative to when the AI Bridge is starting up so if you consistently see it at 17h51, then I am not sure this is related.

I might have found what is causing the issue of you having to restart the containers. It turns out that some of the http requests that AI Bridge send to XProtect are configured with no timeout. In case of a network error (just a glitch), this can cause these to hang forever and not fail as I would have expected. To reproduce the problem I had 10 clients connect / disconnect every second. While doing so, I forced a network disconnect (unplugged the cable for a short while) and in this way I could sometimes get it to hang. In the debugger, it was clear that a lot of i/o threads were blocked forever and also blocked for new connections to be made. With the timeout introduced, all threads are nicely cleaned up after a network error / reconnect and I no longer see the hang.

I will test it some more and expect to make the fix available during next week as part of v1.3 of AI Bridge.

AI Bridge v1.3 is now available on NGC.

Hi John,

An update about the situation, I did some testing with the new version of AI Bridge over the last few days.

AI Bridge v1.3 fixed half of the problem. I still got the deconnections from the camera. But now, I do not have a 5XX error from the milestone server and I don’t need to restart all the AIBridge’s services. Instead, I just restart our program and everything is back to normal.

I am currently investigating our code, to be sure we reinitialize correctly all the decoding context in case of a timeout error.

I did not test yet with VLC instead of our program (I thought about it too late)

But still, that is weird we got a timeout in the night from our camera (the exact hour seems to change, but I cannot draw a conclusion yet).

Also, with the 1.3, I do not have any new debug messages. As a reminder, it was :

From streaming service:

23:47:21RTSP session closed (tear down by [ip of the program which run our program])

From connector service:

23:49:07 rpc error: msg larger than max (from [ip inside the docker network])

And from our program:

a timeout message, since we received and EOF error from the decoder (we use libav from ffmpeg).

After more testing, I suspect the remaining problem is on the AI Bridge side, unfortunately.

I achieved to simulate the camera timeout which trigger the reconnection process of our program. When our program is connected to a milestone stream served via AI Bridge, the reconnection failed and I got timeout on timeouts.

But when our program is connected to another stream (created with rtsp-simple-server : https://github.com/aler9/rtsp-simple-server), the reconnection succeed.

If you do not achieve to reproduce it, I can take time to create a minimal example of the problem.

I will set up more test and see if I can reproduce it. Do you run the AI Bridge with docker-compose or Kubernetes?

Hi john !

We finally found the bug, and it was on our side this time.

I am running tests during the few coming days and will warn you here if anything happens.