Hi Pablo and Lautaro,
What Bo already proposed probably makes sense.
Mobile Server (MoS) is created exactly with the purpose to convert audio and video in some Web/Mobile friendly manner.
Using directly the raw media is much more harder.
You are right that MoS is harder to scale, although not impossible.
If you give it a try, you could use MoS Direct Streaming for Live Video. It packs raw H.264 or H.265 data into MP4 chunks, that are further playable either directly in the browser or in native players (loading them one after another in the player). And as this is not transcoding, the throughput is significantly higher - lets say 180 FullHD streams per I7 (Gen 4 to 8) based machine.
For the Audio MoS could do transcoding to PCM or MP3. Those two usually could be loaded directly in players (both Web and native ones). Audio transcoding is not so CPU intensive as the video one. You have an option here to use the MoS only for audio. Unfortunately I cannot give you some performance numbers of the MoS using audio transcoding.
As far as the protocol integration and using raw audio streams - it could become harder. Especially if you do not have experience with it.
Because you will get the audio as it is received from the camera. After that you will need first to decode it to PCM (in most cases it is not, but is G.711, AAC, etc.). After that you would (could) want to adjust some of the audio parameters (Sample rate, Bits per sample, Number of channels) and then eventually encode to what you think is playable in your player (like OGG in your example).
Most probably you will need third party libraries for that.
For decoding to PCM could be used one of the Milestone Toolkits (C++, Windows).
Third party libraries that come to my mind (C++, cross platform):
- Live 555
- ffmpeg
You can use them directly of extract some code from them.
For Windows:
- NAudio (.NET)
- Direct Show
- Windows Media Foundation
- Direct Sound (presentation only)
In short - not a trivial task and different possibilities.