Language Identification

Many Natural Language Processing (NLP) techniques require language-dependent processing. For example, a preprocessing step in topic modeling is the removal of stop words. Obviously, a word that serves as a stop word in one language may be meaningful in another language. So the analysis software has to be aware of the language of the text being processed.

We use a custom implementation of the N-gram based cumulative frequency addition method of Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert (Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004). [pdf]

To deal with intra-letter language variation, letters are assigned a language at the paragraph level.

 

User Interface

Language identification is used as a preprocessing step in analysis techniques. In the user interface no indication of the assigned language is shown.