What is the best way to bring two-way audio?

Hi Pablo and Lautaro,

What Bo already proposed probably makes sense.

Mobile Server (MoS) is created exactly with the purpose to convert audio and video in some Web/Mobile friendly manner.

Using directly the raw media is much more harder.

You are right that MoS is harder to scale, although not impossible.

If you give it a try, you could use MoS Direct Streaming for Live Video. It packs raw H.264 or H.265 data into MP4 chunks, that are further playable either directly in the browser or in native players (loading them one after another in the player). And as this is not transcoding, the throughput is significantly higher - lets say 180 FullHD streams per I7 (Gen 4 to 8) based machine.

For the Audio MoS could do transcoding to PCM or MP3. Those two usually could be loaded directly in players (both Web and native ones). Audio transcoding is not so CPU intensive as the video one. You have an option here to use the MoS only for audio. Unfortunately I cannot give you some performance numbers of the MoS using audio transcoding.

As far as the protocol integration and using raw audio streams - it could become harder. Especially if you do not have experience with it.

Because you will get the audio as it is received from the camera. After that you will need first to decode it to PCM (in most cases it is not, but is G.711, AAC, etc.). After that you would (could) want to adjust some of the audio parameters (Sample rate, Bits per sample, Number of channels) and then eventually encode to what you think is playable in your player (like OGG in your example).

Most probably you will need third party libraries for that.

For decoding to PCM could be used one of the Milestone Toolkits (C++, Windows).

Third party libraries that come to my mind (C++, cross platform):

  • Live 555
  • ffmpeg

You can use them directly of extract some code from them.

For Windows:

  • NAudio (.NET)
  • Direct Show
  • Windows Media Foundation
  • Direct Sound (presentation only)

In short - not a trivial task and different possibilities.

Hi Peter, thanks for your time.

Currently we have the audio in G.711 format, our big question is, should we use Milestone Toolkits to decode it (convert it to PCM, for example) as the only option?

Regards.

Hi Lautaro,

Not at all.

G.711 is a pretty simple codec.

Almost every audio library will be able to decode it (to PCM).

Actually it is so simple that you could implement it by yourself (in a reasonable time).

Just look at

https://en.wikipedia.org/wiki/G.711

or

https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-G.711-198811-I!!PDF-E&type=items

Hi Petar, thanks for your time, it’s a great help!

I implemented a library and I was able to listen to audio, only that packets arrive faster than what I reproduce, that is, before I finish listening to one I already have another one and they overlap.

I currently reproduce like this:

Well it depends.

First you have to ensure there is no any problem concerning sample rates of the received and played audio.

After that you could try with different strategies:

  • Push data in queue and dequeue (resp present) once previous has already finished. If what you see as behavior happens only because of network jitter, the queue size (long-term) should not grow.
  • If the queue size grows constantly, you could try to reset it periodically. For example when it holds more than 2 seconds of audio. In this way you will loose some data periodically, but you will have no long-term delay. And if the primary usage of this audio is speech, this could be acceptable.
  • The more sophisticated approach could be to try to measure the speed of data accumulation. For example X number of samples per second. And after that to implement logic that omits such number of samples periodically on smaller intervals. In the previous example - (X / 10) samples on every (1 / 10) seconds (resp 100 ms).
    • The easiest method is those samples to be thrown away. This will cause some little popping (crackling) of the audio.
    • The high quality method is those samples to be “blended” with next samples with some audio mixing effect like cross-fade. This will temporally change the audio “pitch” and is almost unnoticeable for speech.

Of course it could turn out that I’ve completely misunderstood your question :slight_smile:

Hi Petar! Good Day.

As you mentioned, I need to make adjustments to the audio, as I only hear distorted noise. I detail the steps, it may be useful to see where the error is.

First, what I do in each response is to trim the header, then I have the audio packet left, I only arm a buffer with all the audio packets (in bytes).

Second, once I have the full buffer I create an audio file with the following characteristics (I receive G7.11mulaw):

numChannels = 1, sampleRate = 8000, bitDepth = 8-bit mu-Law, buffer. Lastly, I unzip that muLaw file and get it in Base64 to play it.

Library used: (https://www.npmjs.com/package/wavefile#create-wave-files-from-scratch)

thank you very much.

From what you describe, it seems to me you miss the step to decode the G.711 to raw PCM.

First you have to ensure what is your encoding (u-law or A-law). After that to expand those compressed 8-bits per sample of the G.711 to 14 or 13 bit signed integer (resp. per law type). Those 14 or 13 bit integer should be put in standard 16 bit signed integer, preserving the sign. Or alternatively converting to unsigned integer, but this would be harder. So at the end you should receive PCM with bitDepth = 16.

There is nice free program that deals with different audio formats called “Audacity”. I would encourage you to try to load and convert simple dumped buffer in it. You could also use it to compare your “algorithms” and conversions with what the program generates.

Hint - the most performant implementation of such conversion (G.711 8 bit to 16 bit signed integer) is “look-up table”.

Hi Petar!

Some clarifications were not very clear to me.

When I receive in G7.11 mulaw format I put it together in a Wav file with 8-bit, 8000Hz and one channel. Then I decode that 8-bit file into a 16-bit one. Finally, object its representation in base64 to display it.

Am I skipping a step?

What does “Those 14 or 13 bit integer should be put in standard 16 bit signed integer, preserving the sign” mean?

Regards.

Well,

I haven’t worked with this particular API/Lib so I cannot comment on that.

Okay, I understand that you do not know about the Library, but these doubts are generic audio:

  1. Am I skipping a step?
  2. What does “Those 14 or 13 bit integer should be put in standard 16 bit signed integer, preserving the sign” mean?
  • I cannot be sure the steps are correct and in correct order.

If would have been creating it, I would probably do it in completely different manner.

  • Well if you directly place 14 bit integer into 16 bits, it will result in interpreting it as unsigned and will result in wrong value. The easiest way is to left shift it with 2 bits. In this way the sign will be preserved (in case of 1 or 2 complement representation)