We performed a keyword analysis for the letters in the CKCC corpus, with the purpose of identifying words that are used significantly more often than other words. This may serve as an indication of what distinguishes a letter from the rest of the corpus.
The analysis is based on frequency profiling of the individual letters and comparing the obtained profiles with the corresponding profile of the full letter collection as a reference corpus. Keywords are obtained with a log-likelyhood estimator, using a threshold of 99% confidence of significance. The analysis is done for each of the three main languages – Dutch, French and Latin. We use the language assignments of paragraphs (not of full letters) as obtained with our language identification software.
Our approach is similar to the one implemented in the Wmatrix corpus analysis tool of Paul Rayson and the WordSmith toolkit.
User Interface
At the bottom of each letter we display the keywords obtained from the analysis. The user can click individual keywords to highlight their occurrence in the letter text.