Processing Speed of ASR (II): Streaming Speech-to-Text
This article introduces the speed indicator in real-time speech recognition: Tail Packet Latency. DolphinVoice provides the best user experience for real-time speech recognition scenarios through extreme tail packet latency optimization.

In the previous article , we introduced the speed evaluation metric for transcription of audio files — Real-Time Factor (RTF). In scenarios where the duration of the recording is determined at the outset, the real-time factor can be used as a metric for evaluating the speed of speech recognition.
Besides audio file transcription, there is another common speech recognition scenario, which is real-time speech recognition, such as real-time voice notes and real-time meeting subtitles. In real-time speech recognition scenarios, both the input of speech signals and the output of recognition results are continuous. Users have relatively high requirements for real-time performance in such scenarios, so any delay in the output of speech recognition results often directly affects user experience.
Therefore, for performance evaluation of real-time speech recognition, it is necessary to introduce metrics related to latency and optimize perception to improve user experience.
Tail Packet Latency (TPL)
Tail Packet Latency is a core metric for measuring the total time taken from the end of audio input to the output of transcription results. In real-time speech recognition, the client streams audio data packets, and the server streams back the speech recognition results. The calculation of tail packet latency is as follows: timing starts after sending the last data packet of an audio segment and stops upon receiving the speech recognition result corresponding to this data packet. Tail packet latency reflects the user's intuitive perception of delay in real-time speech recognition scenarios.
Sketch Map of Tail Package Delay
In real-time interactive scenarios, tail packet latency needs to be controlled within a range acceptable to users, as excessive delay can cause desynchronization between speech and transcription content, disrupting interactive coherence. By using DolphinVoice's real-time speech recognition service, tail packet latency can be reduced to as low as 150ms.
Intermediate-Result & Final-Result
For streaming APIs (WebSocket API of short speech recognition and real-time speech recognition), recognition results are returned in real-time as the speech is input. For example, the sentence "It's a nice day today." might produce the following recognition results during the recognition process:
Its
It's an
It's a nice day
It's a nice day today.Among these, the first three lines are called Intermediate Results, and the last one is called the Final Result. For streaming recognition products, you can control whether to return intermediate results by setting the enable_intermediate_result parameter. If intermediate results are disabled, only the final result will be returned, appearing to the user as receiving the complete recognition result in one go. Enabling intermediate results helps reduce the user's sense of waiting and improves user experience.
Summary
In real-time speech recognition scenarios, processing speed and latency control are key elements in determining user experience. Optimizing end-packet latency can significantly enhance the fluidity and synchronicity of interactions. Additionally, the real-time feedback mechanism for intermediate results provides users with a more natural experience, particularly in applications requiring instant responses, such as real-time meeting transcription and voice assistants. This progressive output method effectively reduces users' perception of waiting. DolphinVoice, with its streaming interface design, balances real-time capabilities and accuracy, offering developers and end-users an efficient and reliable speech recognition solution.
DolphinAI, K.K. is SOC 2 Type 1 and ISMS (ISO/IEC 27001) certified, providing high-accuracy speech recognition in a secure environment, with an average daily usage of approximately 7,000 hours. In the call center industry, DolphinVoice's services have been officially integrated and commercialized in Cloopen's SimpleConnect platform. We have also collaborated with Sanntsu Telecom Service Corporation to jointly develop and launch the AI Call Memo Service.
For inquiries regarding access to the speech recognition system or related questions, feel free to reach out.
Get started now
- Log in to DolphinVoice – start your free trial
- Browse the API docs – technical specs & guides
- Visit our website – service details & case studies
About the Author
Masahiro Asakura / Andy Yan
- CEO, DolphinAI, K.K.
- Former Director of Global Business, Advanced Media Inc. (8 years)
- 12 years of hands-on experience deploying voice-AI solutions
- Track Record: Supported voice AI deployment for over 30 enterprises
- Domains: ASR, TTS, call center AI, AI meeting minutes, voice-interaction devices
- Markets: Japan, Mainland China, Taiwan, Hong Kong
- Publications: 100+ technical articles
Public Presentations
- "AI New Forces · Product Open Day" by Tokyo Generative AI Development Community (October 25, 2025)
- "TOPAI International AI Ecosystem Frontier Private Salon" by TOPAI & Inspireland Incubator (July 29, 2025)
- "Global AI Conference & Hackathon" by WaytoAGI (June 7, 2025)
Contact
Email: mh.asakura@dolphin-ai.jp
LinkedIn: https://www.linkedin.com/in/14a9b882/
About DolphinAI, K.K.
An AI company specializing in speech recognition, speech synthesis and voice dialogue technologies for Japanese and other languages.
Product: DolphinVoice (Voice-Interactive AI SaaS Platform)
Key Features: ASR (Japanese, Mandarin, English, Mandarin-English mixed, Japanese-English mixed), TTS (Japanese, Mandarin, English)
Usage: About 7,000 accumulated hours per day in call center and AI meeting minutes scenarios
■ Security & Compliance
- ISMS (ISO/IEC 27001) Certified
- SOC 2 Type 1 Report Obtained
- Details ️
■ Contact
️ (+81) 03-6161-7298
Share Article
Read more

Processing Speed of ASR (I): Audio File Transcription
This article will guide you on how to quantitatively evaluate the transcription speed of audio files and explore the role of parallel processing in improving the transcription speed of audio files.

Grasping CER and WER in Speech Recognition
When evaluating the performance of speech recognition systems, CER and WER are two very important metrics. This article introduces the definitions, calculation methods, and limitations of these two metrics, emphasizing the need to consider other indicators for a comprehensive assessment of speech recognition engine performance.

A Brief Look at ITN Technology in Speech Recognition
Inverse Text Normalization (ITN) is the process of converting the "normalized" textual output from AI speech recognition back into a "non-normalized" form that matches written conventions.