Back to Blog
Fundamentals

Grasping CER and WER in Speech Recognition

When evaluating the performance of speech recognition systems, CER and WER are two very important metrics. This article introduces the definitions, calculation methods, and limitations of these two metrics, emphasizing the need to consider other indicators for a comprehensive assessment of speech recognition engine performance.

Grasping CER and WER in Speech Recognition

Speech recognition (Speech to Text) has become an indispensable part of our daily lives. From voice assistants on mobile phones to smart home devices, these systems rely on the ability to convert speech into text. When evaluating the performance of speech recognition systems, CER and WER are two very important metrics. What do they represent, and what is the difference between them?

What is WER?

WER, short for Word Error Rate, is a commonly used metric for evaluating the performance of speech recognition systems. WER is primarily used to measure the accuracy of a speech recognition system in recognizing words, and its calculation formula is:

WER=S+D+IN\text{WER} = \frac{S + D + I}{N}

Among them:

  • S is the number of Substitution errors, that is, the system recognizes an incorrect word instead of the correct one.

  • D is the number of Deletion errors, that is, the system misses a word that should have been recognized.

  • I is the number of Insertion errors, that is, the system incorrectly recognizes words that do not exist in the text.

  • N is the total number of words in the reference text (standard answer).

The lower the WER value, the lower the recognition error rate, which also indicates higher accuracy.

What is CER?

CER, short for Character Error Rate, is a metric similar to WER, but it calculates the error rate based on characters rather than words. This metric is more effective in certain cases (for example, when dealing with pinyin, languages without clear word boundaries, or character-based language models). The formula for CER is:

CER=S+D+IN\text{CER} = \frac{S + D + I}{N}

The meanings of the symbols in the formula are the same as those in WER, except that here S, D, I, and N are based on character level rather than word level.

Applications of CER and WER

For word-based languages, such as English or other languages that form words with letters, WER is a commonly used and intuitive metric. However, when dealing with logographic languages like Japanese and Chinese, CER is typically used as the metric, as individual characters in these languages often represent a semantic unit.

Additionally, in certain application scenarios, such as accurate recognition of spelling and character-level correction tasks, CER may better reflect system performance than WER.

Calculation Example

Let's understand how to calculate through a simple Japanese example. Suppose we have the following Japanese text:

  • Reference Text (Reference, i.e., the standard answer): 今日は天気がいいね

  • Recognized Text (Hypothesis, i.e., the result of speech recognition): 今日天気はいいよね

In this example, since the text language is Japanese, it is more appropriate to use CER as the measurement standard. Therefore, next, we calculate CER.

Calculate CER manually

Reference
Hypothesis
Error TypeDSI
  1. Substitution (S):

    • The character was incorrectly recognized as . Therefore, the number of substitution errors S=1.
  2. Deletion (D):

    • The character was not recognized. Therefore, the number of deletion error D=1.
  3. Insertion (I):

    • The character was inserted additionally. Therefore, the number of insertion errors I=1.
  4. Total number of characters (N):

    • There are 9 characters in the reference text. Therefore, N=9.

Substitute these values into the CER formula:

CER=S+D+IN=1+1+190.333\text{CER} = \frac{S + D + I}{N} = \frac{1 + 1 + 1}{9} \approx 0.333

Therefore, the CER in this example is approximately 0.333, or 33.3%.

Calculate CER with Python

Through the simple calculation example above, we can clearly understand the calculation method of CER. The calculation of WER is similar, and will not be elaborated here. However, manual calculation is cumbersome, and fortunately, there are many tools available to help us with these calculations. Below, we use the JiWER Python library as an example:

Prepare Environment

Requires Python >=3.8

pip install jiwer

Calculate CER

import jiwer

reference = "今日は天気がいいね"
hypothesis = "今日天気はいいよね"
# Calculate CER
out = jiwer.process_characters(reference, hypothesis)
# Print CER only
print(f"CER={out.cer}\n-------")
# Print detailed infomation
print(jiwer.visualize_alignment(out))

After executing the above code, the following information will be output:

CER=0.3333333333333333
--------
=== SENTENCE 1 ===

REF: 今日は天気がいい*ね
HYP: 今日*天気はいいよね
     D  S   I

=== SUMMARY ===
number of sentences: 1
substitutions=1 deletions=1 insertions=1 hits=7

cer=33.33%

JiWER also supports the calculation of WER, and you can access the documentation to explore more features.

Additional processing before calculating CER and WER

When calculating CER and WER, we usually only focus on the errors of the characters/words themselves, so some preprocessing work needs to be done on the reference text and the recognition results before calculation:

  • Expand abbreviations: Expand abbreviations in the text into their actual pronunciations, for example, "IEEE" expands to "I triple E"

  • Unify case: For languages like English that distinguish between uppercase and lowercase letters, generally convert everything to lowercase

  • Remove punctuation: Remove all punctuation in the reference text and recognition results, keeping only the characters/words

  • Handle ITN transformations: Most speech recognition systems have a built-in ITN function that can display dates, numbers, and other objects in a customary format, as shown in the table below:

ITN DisabledITN Enabled
twenty percent20%
one thousand two hundred thirty-four dollars$1,234
April thirdApril 3

Before calculating CER or WER, we need to unify the reference text and recognition results into the "ITN-disabled" style, meaning that all text uses the native script of the language and is completely consistent with the pronunciation content of the speech, without any Arabic numerals, symbols, or other content. If you are using the DolphinVoice speech recognition API, you can disable the ITN feature by setting the parameter enable_inverse_text_normalization = false, so that the speech recognition results meet the above requirements. For more information on ITN, please refer to the article: A Brief Look at ITN Technology in Speech Recognition .

Considerations for the Application of CER and WER

Although CER and WER are important indicators for evaluating the performance of speech recognition systems, they also have their limitations. For example, the calculation of CER and WER cannot cover many factors that affect the readability of text. Factors such as the position and type of punctuation, the formatting of numbers and symbols (like dates and times), and the handling of text fluency are all important aspects that impact user experience, but these fall outside the evaluation scope of CER and WER.

Additionally, in Japanese, the same vocabulary may have multiple writing forms. For example, the reference text is 全て, while the speech recognition result is すべて. Both forms are actually correct, but in the CER calculation, this will be considered as one substitution error and one insertion error, thus affecting the CER calculation result. Although this issue has almost no impact on readability and actual understanding, it can lead to a decrease in CER.

Therefore, when evaluating the performance of speech recognition engines, we cannot solely rely on CER or WER. We need to conduct a comprehensive analysis that includes multiple indicators such as CER/WER, readability, and user experience, using them as reference criteria for performance evaluation to fully understand the system's true capabilities and the actual user experience.


DolphinAI, K.K. is SOC 2 Type 1 and ISMS (ISO/IEC 27001) certified, providing high-accuracy speech recognition in a secure environment, with an average daily usage of approximately 7,000 hours. In the call center industry, DolphinVoice's services have been officially integrated and commercialized in Cloopen's SimpleConnect platform. We have also collaborated with Sanntsu Telecom Service Corporation to jointly develop and launch the AI Call Memo Service.

For inquiries regarding access to the speech recognition system or related questions, feel free to reach out.

Get started now


About the Author

Masahiro Asakura / Andy Yan

  • CEO, DolphinAI, K.K.
  • Former Director of Global Business, Advanced Media Inc. (8 years)
  • 12 years of hands-on experience deploying voice-AI solutions
  • Track Record: Supported voice AI deployment for over 30 enterprises
  • Domains: ASR, TTS, call center AI, AI meeting minutes, voice-interaction devices
  • Markets: Japan, Mainland China, Taiwan, Hong Kong
  • Publications: 100+ technical articles

Public Presentations

  • "AI New Forces · Product Open Day" by Tokyo Generative AI Development Community (October 25, 2025)
  • "TOPAI International AI Ecosystem Frontier Private Salon" by TOPAI & Inspireland Incubator (July 29, 2025)
  • "Global AI Conference & Hackathon" by WaytoAGI (June 7, 2025)

Contact

Email: mh.asakura@dolphin-ai.jp
LinkedIn: https://www.linkedin.com/in/14a9b882/

About DolphinAI, K.K.

An AI company specializing in speech recognition, speech synthesis and voice dialogue technologies for Japanese and other languages.

Product: DolphinVoice (Voice-Interactive AI SaaS Platform)
Key Features: ASR (Japanese, Mandarin, English, Mandarin-English mixed, Japanese-English mixed), TTS (Japanese, Mandarin, English)
Usage: About 7,000 accumulated hours per day in call center and AI meeting minutes scenarios

■ Security & Compliance

  • ISMS (ISO/IEC 27001) Certified
  • SOC 2 Type 1 Report Obtained
  • Details

■ Contact

️ (+81) 03-6161-7298

 voice.contact@dolphin-ai.jp

 https://dolphin-ai.jp/

Share Article