A Brief Look at ITN Technology in Speech Recognition

What is ITN?

Inverse Text Normalization (ITN) is the process of converting the "normalized" textual form produced by AI speech recognition into a "non-normalized" textual form. For example, when you say "three point five four", the ASR system may output the literal string "three point five four"; ITN transforms it into "3.54" to conform to written conventions.

"Normalized" here is defined relative to the modeling units used to train the recognition model. In an English ASR model the basic unit is an English character, so the character sequence "three point five four" is regarded as the normalized form, whereas "3.54" is the non-normalized form.

ITN in DolphinVoice

When you use the DolphinVoice Speech Recognition API, ITN is enabled by default. You can toggle it explicitly with the parameter enable_inverse_text_normalization. DolphinVoice’s ITN module handles the following transformations:

Numbers & Symbols: With ITN, spoken numbers and symbols are turned into written form, e.g. "twenty percent" → "20%".
Currencies & Units: With ITN, spoken currency and unit phrases are converted to standard written forms, e.g. "twenty dollars" → "$20".
Dates & Times: With ITN, diverse spoken date/time expressions are normalized to consistent written forms, e.g. "April twenty-third, twenty twenty-five" → "April 23, 2025".

Examples:

ITN Enabled	ITN Disabled
twenty percent	20%
one thousand two hundred thirty-four dollars	$1,234
April third	April 3

How ITN is Implemented

DolphinVoice uses Finite State Transducer (FST) to achieve ITN by defining a series of transformation rules.

FST is an extended finite state machine that can not only handle state transitions but also output corresponding characters or symbols during the state transition process. Each state machine contains a set of states and transitions between those states, with each transition having an input and an output. FST maps input to output through these states and transformation rules, where each rule contains an input expression and its corresponding output expression.

When using FST for text transformation, the input text stream passes through the finite state transducer, and each matching rule is executed within the transducer to generate the corresponding output. Some transformations may depend on context, such as currency symbols or units; for example, converting "twenty dollars" to "$20" requires considering the context of "dollars". FST can remember the context by maintaining states, allowing for more complex transformations.

The advantage of this approach is that as the usage scenarios expand, more complex syntax and semantic rules can be added to continuously optimize the rule set. Machine learning methods can also be utilized to optimize the FST rule set to accommodate more complex natural language processing needs. With the advancement of technology, ITN will increasingly combine data-driven methods to further enhance its accuracy and applicability.

Why ITN Matters

ITN dramatically improves readability. Spoken language in our daily communication is often rich in expressive flexibility, but this diversity, when directly converted into written text, may appear less intuitive or not aligned with reading habits. This is particularly evident in areas such as dates, times, currencies, and percentages, where spoken expressions often differ significantly from their written forms. For example, numbers spoken in oral communication, if directly presented in textual form without conversion, may cause confusion for readers. By transforming these colloquial expressions into conventional written forms, ITN helps make the generated text easier to understand and use. This process not only ensures more precise information delivery but also allows readers to quickly grasp the meaning without additional cognitive effort, thereby significantly enhancing the readability and usability of the information.

FAQs

DolphinAI, K.K. is SOC 2 Type 1 and ISMS (ISO/IEC 27001) certified, providing high-accuracy speech recognition in a secure environment, with an average daily usage of approximately 7,000 hours. In the call center industry, DolphinVoice's services have been officially integrated and commercialized in Cloopen's SimpleConnect platform. We have also collaborated with Sanntsu Telecom Service Corporation to jointly develop and launch the AI Call Memo Service.

For inquiries regarding access to the speech recognition system or related questions, feel free to reach out.

Get started now

Log in to DolphinVoice – start your free trial
Browse the API docs – technical specs & guides
Visit our website – service details & case studies

About the Author

Masahiro Asakura / Andy Yan

CEO, DolphinAI, K.K.
Former Director of Global Business, Advanced Media Inc. (8 years)
12 years of hands-on experience deploying voice-AI solutions
Track Record: Supported voice AI deployment for over 30 enterprises
Domains: ASR, TTS, call center AI, AI meeting minutes, voice-interaction devices
Markets: Japan, Mainland China, Taiwan, Hong Kong
Publications: 100+ technical articles

Public Presentations

"AI New Forces · Product Open Day" by Tokyo Generative AI Development Community (October 25, 2025)
"TOPAI International AI Ecosystem Frontier Private Salon" by TOPAI & Inspireland Incubator (July 29, 2025)
"Global AI Conference & Hackathon" by WaytoAGI (June 7, 2025)

Contact

Email: mh.asakura@dolphin-ai.jp
LinkedIn: https://www.linkedin.com/in/14a9b882/

About DolphinAI, K.K.

An AI company specializing in speech recognition, speech synthesis and voice dialogue technologies for Japanese and other languages.

Product: DolphinVoice (Voice-Interactive AI SaaS Platform)
Key Features: ASR (Japanese, Mandarin, English, Mandarin-English mixed, Japanese-English mixed), TTS (Japanese, Mandarin, English)
Usage: About 7,000 accumulated hours per day in call center and AI meeting minutes scenarios

■ Security & Compliance