Topic Modeling

Introduction

Topic modeling constitutes a statistical approach to content extraction. The major approaches to topic modeling are able to identify hidden variables that can be interpreted as “topics”. In the CKCC project three methods were considered: latent Dirichlet allocation (LDA), latent semantic analysis (LSA), and random indexing (RI). Each of these approaches is derived from the so-called word space model. Intuitively, if text fragments of two documents address similar topics, it is likely that they share many relevant terms. Conversely, if two terms occur in many documents together, the terms are likely to be related. An overview of the three methods is given in [Wittek, P. and Ravenek, W. (2011)]

As an aid to selecting a topic modeling approach we performed an experiment in which we assessed the ability of the three methods to reproduce topic labels assigned by human experts for a randomly selected subset of letters from our corpus [Wittek, P. and Ravenek, W. (2011)]. The outcome of our experiment was that RI performed best at the task of reproducing the labels assigned by the experts. From a computational point of view RI has the benefit that it does not rely on computationally intensive matrix operations as, for example, LSA. Instead, RI builds an incremental word space model that scales very well with increasing corpus size. For these reasons we decided to adopt random indexing as the topic modeling method in the ePistolarium.

 

Implementation

For the Random Indexing method employed in the ePistolarium we use the implementation provided by the Semantic Vectors project. One of the strong points of this implementation is that it constructs the topic model from a Lucene index. Lucene is the de facto standard for text searching and by basing the topic model calculation on this technology a range of text (pre)processing techniques is available. In our implementation we use the following operations on tokenized text:

  • Conversion to lower case;
  • Removal of tokens with digits;
  • Removal of tokens with length less than 3 characters;
  • Removal of diacritical marks;
  • Spelling normalization;
  • Removal of stop words;
  • Stemming (for Latin).

The first four operations are generic, the latter three are language dependent. Therefore a language identification step is required as the first step in the preprocessing of the letters.

The topic model constructed for the corpus uses the text of the three major languages in the corpus – Dutch, French, and Latin. There is just one model for these three languages taken together. Each topic almost exclusively contains words from a single language, with little coupling between the words from the various languages. The coupling that does occur, is due to the multi-linguality of text within paragraphs.

 

User Interface

In the ePistolarium the results of topic modeling are used to enhance the exploration of the CKCC corpus in three different ways. The common feature of these functionalities is the use of the calculated similarity between texts.

 

Similar letters

The topic model allows one to calculate the similarity between two letters on a scale between 0% and 100%. Thus, one can find the letters that are most similar to any letter in the corpus. We performed that calculation and show the result at the bottom of each letter text.

After clicking on an item in such a list you are presented a side-by-side view of the letter texts.

 

Similarity search

Apart from calculating the similarity between letters, one can also calculate the similarity between and arbitrary text and the letters in the corpus. We have implemented a feature we call similarity search that allows users to perform such a calculation dynamically.

After clicking the “similarity search” tab in the upper left part of the main page of the ePistolarium, you see a box in which text can be typed and/or copied. By clicking the “Search” button you start the calculation for the entered query text. Results are presented in a list (currently limited to five items) in a similar way as for the faceted search.

As part of the calculation, the language of the query text is recognized and the same processing is applied as for the letters in the corpus when the topic model was constructed. If words in the processed text do not occur in the topic model they do not contribute to calculated similarities.

The user has the option to choose the unit of analysis: either the letter or the paragraphs in the letter. In both cases opening and closing sections of the letters are excluded from the analysis.

 

Search suggestions

The topic model allows one to calculate the similarity between individual words. We used this to implement search suggestions.

After entering a word or a number of words in the full text search box, you can click the “Give suggestions” link immediately below it. As a result you are given a list of words in the corpus that have the largest calculated similarity with the query. By clicking on a word in the list you can add it to the full text query.

Some restrictions apply:

  • If you enter a word that does not occur in the corpus, or a word in another language than the three modeled ones (Dutch, French, and Latin), the topic model does not “known” about this word and no suggestions can be calculated.
  • If you enter a word with relatively low frequency in the corpus, relations with other words are less significant.