Processing Speed of ASR (I): Audio File Transcription
This article will guide you on how to quantitatively evaluate the transcription speed of audio files and explore the role of parallel processing in improving the transcription speed of audio files.

DolphinVoice's Audio File Transcription service is an application of Automatic Speech Recognition (ASR) technology, which can automatically convert the speech content in recordings or audio files into text. It is widely used in fields such as meeting minutes, interview organization, customer service, and monitoring systems. When we finish a meeting, we may have a recording that lasts 1 hour, and at this point, we want to quickly transcribe the recording into text for easier reading, editing, and storage. This is when we can use the transcription function of the recording files.
When we upload a video or audio file for transcription, the system needs to convert the audio signal into text in the shortest possible time, and the core metric for measuring the efficiency of this process is the Real-time Factor (RTF).
Real-time factor (RTF): A speed gauge for non-streaming speech recognition
What is RTF?
RTF is the ratio of the processing speed of the speech recognition system to the duration of the original speech, calculated using the formula:
-
:Total time required to complete the transcription
-
:Duration of the original audio file
Examples:
-
If a 60-second recording is transcribed in 50 seconds by the system, then , indicating that the processing speed is faster than real-time playback. (RTF < 1)
-
If the system takes 70 seconds to complete the transcription, then , indicating that the processing speed lags behind the speech playback speed. (RTF > 1)
The significance of RTF
As can be seen from the above introduction, the higher the RTF, the faster the speed of processing an audio file. In a speech recognition system, RTF is an important indicator of the system's responsiveness, especially in applications that require quick feedback, such as real-time subtitle generation, online translation, and call centers. The level of RTF directly affects user experience and service efficiency.
For speech recognition systems, RTF is mainly limited by server hardware performance. Hardware conditions such as processor speed, memory size, and network bandwidth directly determine the efficiency of the system's audio processing. Additionally, the complexity of the algorithm, the quality and length of the audio file will also impact RTF.
So, how can one improve processing speed under certain hardware conditions? The answer is parallel processing.
Parallel Processing: The Key to Breaking Speed Limits
Parallel processing is a method to improve processing efficiency by breaking tasks into multiple parts that can be executed simultaneously. This technique is widely used across various fields in computer science, including speech recognition. In the DolphinVoice service, task segmentation and multithreading methods are primarily used to achieve parallel processing, thereby enhancing processing speed.
In the context of transcription of recording files, upon receiving a task request from the client, all audio information is obtained from the server. Thus, the system can first appropriately segment the audio internally. Typically, we use the VAD module to segment the audio based on silent positions within the audio, ensuring that the duration of the segmented audio slices is within 60 seconds, then allocate them to different threads for processing.
Generally, the system uses up to 4 threads to process tasks (if the number of segmented audio slices is less than 4, a number of threads consistent with the number of audio slices will be used for processing). For transcription tasks (speed version), the system will use up to 16 threads to process tasks, significantly increasing the processing speed.
In the DolphinVoice Audio File Transcription (Standard Version) service, processing 1 hour of audio typically takes 6-10 minutes. In the Audio File Transcription (VIP Version) service, processing 1 hour of audio takes only 1-2 minutes. Customers can choose the appropriate service type according to their business needs and usage scenarios.
If you need to use the Audio File Transcription (VIP Version) service, please contact us.
DolphinAI, K.K. is SOC 2 Type 1 and ISMS (ISO/IEC 27001) certified, providing high-accuracy speech recognition in a secure environment, with an average daily usage of approximately 7,000 hours. In the call center industry, DolphinVoice's services have been officially integrated and commercialized in Cloopen's SimpleConnect platform. We have also collaborated with Sanntsu Telecom Service Corporation to jointly develop and launch the AI Call Memo Service.
For inquiries regarding access to the speech recognition system or related questions, feel free to reach out.
Get started now
- Log in to DolphinVoice – start your free trial
- Browse the API docs – technical specs & guides
- Visit our website – service details & case studies
About the Author
Masahiro Asakura / Andy Yan
- CEO, DolphinAI, K.K.
- Former Director of Global Business, Advanced Media Inc. (8 years)
- 12 years of hands-on experience deploying voice-AI solutions
- Track Record: Supported voice AI deployment for over 30 enterprises
- Domains: ASR, TTS, call center AI, AI meeting minutes, voice-interaction devices
- Markets: Japan, Mainland China, Taiwan, Hong Kong
- Publications: 100+ technical articles
Public Presentations
- "AI New Forces · Product Open Day" by Tokyo Generative AI Development Community (October 25, 2025)
- "TOPAI International AI Ecosystem Frontier Private Salon" by TOPAI & Inspireland Incubator (July 29, 2025)
- "Global AI Conference & Hackathon" by WaytoAGI (June 7, 2025)
Contact
Email: mh.asakura@dolphin-ai.jp
LinkedIn: https://www.linkedin.com/in/14a9b882/
About DolphinAI, K.K.
An AI company specializing in speech recognition, speech synthesis and voice dialogue technologies for Japanese and other languages.
Product: DolphinVoice (Voice-Interactive AI SaaS Platform)
Key Features: ASR (Japanese, Mandarin, English, Mandarin-English mixed, Japanese-English mixed), TTS (Japanese, Mandarin, English)
Usage: About 7,000 accumulated hours per day in call center and AI meeting minutes scenarios
■ Security & Compliance
- ISMS (ISO/IEC 27001) Certified
- SOC 2 Type 1 Report Obtained
- Details ️
■ Contact
️ (+81) 03-6161-7298
Share Article
Read more

Grasping CER and WER in Speech Recognition
When evaluating the performance of speech recognition systems, CER and WER are two very important metrics. This article introduces the definitions, calculation methods, and limitations of these two metrics, emphasizing the need to consider other indicators for a comprehensive assessment of speech recognition engine performance.

A Brief Look at ITN Technology in Speech Recognition
Inverse Text Normalization (ITN) is the process of converting the "normalized" textual output from AI speech recognition back into a "non-normalized" form that matches written conventions.

Office Relocation Notice
DolphinAI, K.K. will move to a new office location starting December 1, 2025.