Live Audio Comparison: System Architecture

When a user is able to stream audio as soon as it is generated, let it be commentary, songs, or concerts, that is Live Audio. There are multiple platforms that provide services around the concept of live audio or are at least trying towards it, including iTunes, Spotify, Amazon Music and many more. But before making the audio available for streaming, it needs to qualify certain criteria for having good customer experience.

When a new music album is launched in market or streaming live concerts, it’s very high quality music, loss-less with no glitches anywhere in the stream, what you may call, “The Original”. But for making is available to customers, there are multiple issues that needs to be dealt with and major one is Internet Bandwidth on user’s device. Streaming high quality streams requires high-end network bandwidth. And glitches while streaming is always a big turn off. To be able to reach customers on all types of devices, these platforms transcode the audio streams into lower bitrates and different audio formats for faster and reliable streaming.

Why we need Audio Comparison?

While transcoding audio stream, there are chances that the output stream might contain glitches such as noise or silence. It may happen at the start or at the end of the stream, or even in the middle of the stream. These glitches needs to be monitored and removed. If not removed, they can cause poor customer experience. It’s not a feasible solution to have a dedicated team for finding these glitches. A better solution to the problem is to build service that can help comparing the original stream(source) and the transcoded stream chunks(target) and provide similarity coefficient between the two. If its below certain threshold, the prediction says there is very less similarly between the source and target chunk and might be because of the glitches introduced during transcoding. For more info, refer here.


  1. the source or the original stream is coming in as RTMP stream.
  2. transcoding service is able to create chunks from the source stream for every time interval, CHUNK_SIZE.
  3. There are real time updates in the database for when the source stream starts and every transcoded chunk is generated.

Service Architecture:

Audio Comparison Service Architecture

Add a listener to the database for capturing update messages on the corresponding db tables. As soon as a message for stream start comes, capture the RTMP stream url and start recording. There are multiple tools available for recoding a RTMP stream, for our architecture, FFMPEG should do the trick. Record the stream in WAV format. The format allows slicing through the audio file from any given point of time, but it takes up a large amount of disk space though.

Now once the source stream is getting recorded, as soon as the first transcoded chunk arrives, create a new target file on disk for the storing all the incoming transcoded chunks. Again the format for storing the chunks is the same as source, WAV. Keep appending the chunks to target file as soon as they are available.

When the first transcoded chunk comes in, start the comparison service. Now, when the service is started, it will spawn a thread which will handle all the updates on the comparison part. Reference to ComparisonResult object is returned before the thread starts its work and thread is responsible for updating the values in that particular reference object. This decouples the working of both components, one is responsible for storing the source and target files and other one is responsible for comparing them.

Before getting on with comparison of source and target files, offset needs to be set for having best results. The offset might be there due to silence or delay in transcoding of source stream. For calculating the offset between both files, files should have certain duration of audio content otherwise the offset might be erroneous. That initial delay for calculating the offset is unto the algorithm used to calculate the fingerprints for the audio files. For more details audio fingerprinting and offset calculation, refer here.

Workflow Diagram: Audio Comparison

Once the initial offset is set, the thread should start generating comparison results for each new chunk appended to target file and corresponding source file. These comparison results are updated in the ComparisonResults object reference returned when the thread execution started, providing real-time updates.
This comparison is done after every MINIMUM_COMPARISON_INTERVAL. Every MINIMUM_COMPARISON_INTERVAL, using FFMPEG, slice the source and target files from current analyzed length to current analyzed length + MINIMUM_COMPARISON_INTERVAL. Calculate the similarly between the two new files and update the ComparisonResults reference object. For calculating the similarly between audio files, refer here.

There may be a case where the source stream was having issues with recoding or the transcoding service is not able generate proper chunks, which might lead to offset between source and target files in the middle of comparison. This will lead to very low similarly coefficients between both audio files. To resolve such issues, one way is to re-calculate the offset whenever there is dip in similarity coefficient. This approach has performance issue, where the source and target files actually has very low similarly, it will keep on re-calculating the offset which will cause less throughput. Another way is to re-calculate the offset after every time interval, OFFSET_UPDATE_INTERVAL. This will ensure that the source and target files keep in sync with each other.

In the end, the comparison thread has following responsibilities -

  1. Set the initial offset once certain duration of source and target files are available.
  2. Start the similarly coefficient calculation after every MINIMUM_COMPARISON_INTERVAL
  3. Re-calculate the offset between files every OFFSET_UPDATE_INTERVAL

Problems faced:

  • As the source stream is a continuous file, It was a problem when slicing it for calculating similarity coefficients. The problem was when a file is recorded completely, based on the format of audio file, metadata of the audio is appended to it. But when its still recoding, that metadata is not present, hence not able to slice the file.
    Source and target files are stored as WAV format. This format stores the metadata about the audio at every frame, hence no issue related to metadata at any give point of time.
  • Appending transcoded chunks to target file. When listening to updates about new transcoded chunk available, it might get delayed or do not even come into light, basically chunk info might get lost. For such cases, whenever there is message for new chunk available, get all the chunks from current chunk number to the chunk number in message. This will help keeping a reliable and available target file.

Any suggestions or thought, let me know:
Insta + Twitter + LinkedIn + Medium | @shivama205



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store