Processing Speed of ASR (II): Streaming Speech-to-Text

In the previous article , we introduced the speed evaluation metric for transcription of audio files — Real-Time Factor (RTF). In scenarios where the duration of the recording is determined at the outset, the real-time factor can be used as a metric for evaluating the speed of speech recognition.

Besides audio file transcription, there is another common speech recognition scenario, which is real-time speech recognition, such as real-time voice notes and real-time meeting subtitles. In real-time speech recognition scenarios, both the input of speech signals and the output of recognition results are continuous. Users have relatively high requirements for real-time performance in such scenarios, so any delay in the output of speech recognition results often directly affects user experience.

Therefore, for performance evaluation of real-time speech recognition, it is necessary to introduce metrics related to latency and optimize perception to improve user experience.

Tail Packet Latency (TPL)

Tail Packet Latency is a core metric for measuring the total time taken from the end of audio input to the output of transcription results. In real-time speech recognition, the client streams audio data packets, and the server streams back the speech recognition results. The calculation of tail packet latency is as follows: timing starts after sending the last data packet of an audio segment and stops upon receiving the speech recognition result corresponding to this data packet. Tail packet latency reflects the user's intuitive perception of delay in real-time speech recognition scenarios.

Sketch Map of Tail Package Delay Sketch Map of Tail Package Delay

In real-time interactive scenarios, tail packet latency needs to be controlled within a range acceptable to users, as excessive delay can cause desynchronization between speech and transcription content, disrupting interactive coherence. By using DolphinVoice's real-time speech recognition service, tail packet latency can be reduced to as low as 150ms.

Intermediate-Result & Final-Result

For streaming APIs (WebSocket API of short speech recognition and real-time speech recognition), recognition results are returned in real-time as the speech is input. For example, the sentence "It's a nice day today." might produce the following recognition results during the recognition process:

Its
It's an 
It's a nice day
It's a nice day today.

Among these, the first three lines are called Intermediate Results, and the last one is called the Final Result. For streaming recognition products, you can control whether to return intermediate results by setting the enable_intermediate_result parameter. If intermediate results are disabled, only the final result will be returned, appearing to the user as receiving the complete recognition result in one go. Enabling intermediate results helps reduce the user's sense of waiting and improves user experience.

Non-streaming APIs (Short Speech Recognition POST API, and Audio File Transcription API) do not have intermediate results; only the final result is provided.

Summary

In real-time speech recognition scenarios, processing speed and latency control are key elements in determining user experience. Optimizing end-packet latency can significantly enhance the fluidity and synchronicity of interactions. Additionally, the real-time feedback mechanism for intermediate results provides users with a more natural experience, particularly in applications requiring instant responses, such as real-time meeting transcription and voice assistants. This progressive output method effectively reduces users' perception of waiting. DolphinVoice, with its streaming interface design, balances real-time capabilities and accuracy, offering developers and end-users an efficient and reliable speech recognition solution.

DolphinAI, K.K. is SOC 2 Type 1 and ISMS (ISO/IEC 27001) certified, providing high-accuracy speech recognition in a secure environment, with an average daily usage of approximately 7,000 hours. In the call center industry, DolphinVoice's services have been officially integrated and commercialized in Cloopen's SimpleConnect platform. We have also collaborated with Sanntsu Telecom Service Corporation to jointly develop and launch the AI Call Memo Service.

For inquiries regarding access to the speech recognition system or related questions, feel free to reach out.

Get started now

Log in to DolphinVoice – start your free trial
Browse the API docs – technical specs & guides
Visit our website – service details & case studies

About the Author

Masahiro Asakura / Andy Yan

CEO, DolphinAI, K.K.
Former Director of Global Business, Advanced Media Inc. (8 years)
12 years of hands-on experience deploying voice-AI solutions
Track Record: Supported voice AI deployment for over 30 enterprises
Domains: ASR, TTS, call center AI, AI meeting minutes, voice-interaction devices
Markets: Japan, Mainland China, Taiwan, Hong Kong
Publications: 100+ technical articles

Public Presentations

"AI New Forces · Product Open Day" by Tokyo Generative AI Development Community (October 25, 2025)
"TOPAI International AI Ecosystem Frontier Private Salon" by TOPAI & Inspireland Incubator (July 29, 2025)
"Global AI Conference & Hackathon" by WaytoAGI (June 7, 2025)

Contact

Email: mh.asakura@dolphin-ai.jp
LinkedIn: https://www.linkedin.com/in/14a9b882/

About DolphinAI, K.K.

An AI company specializing in speech recognition, speech synthesis and voice dialogue technologies for Japanese and other languages.

Product: DolphinVoice (Voice-Interactive AI SaaS Platform)
Key Features: ASR (Japanese, Mandarin, English, Mandarin-English mixed, Japanese-English mixed), TTS (Japanese, Mandarin, English)
Usage: About 7,000 accumulated hours per day in call center and AI meeting minutes scenarios

■ Security & Compliance