Text normalization
From Wikipedia, the free encyclopedia
Text normalization is a process by which text is transformed in some way to make it consistent in a way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.
Examples of text normalization:
- Unicode normalization
- converting all letters to lower or upper case
- removing punctuation
- removing letters with accent marks and other diacritics
- expanding abbreviations
While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.