One of the analyses performed on the corpus is Named Entity Recognition (NER) of person names. We use a semi-automatic, rule-based approach for both the recognition and the identification of the names.
In the table below we show the current status (as of March 8, 2013) of the recognition and identification of person names. Recognized names are annotated in the letter texts; identified names are used in the analysis.
Correspondence | Recognized | Identified |
Barlaeus | 3.073 | 1.460 |
Beeckman | 139 | 120 |
Descartes | 4.096 | 3.917 |
De Groot | 76.790 | 57.286 |
Chr. Huygens | 21.647 | 17.411 |
Const. Huygens | 17.354 | 12.632 |
Van Leeuwenhoek | 899 | 869 |
Van Nierop | 394 | 326 |
Swammerdam | 567 | 532 |
CKCC corpus | 124.959 | 94.553 |
Approach
We are using an iterative, rule-based approach to build gazetteers (lookup lists of names), which are extended with hand-annotated names and names from indexes of book editions, if available. More names are generated by applying rules to latinized names (for instance, if “Grotius” and “Grotio” occur also the names “Grotium” and “Grotii” are generated). For the actual matching the well-known Aho-Corasick algorithm is used on a normalized representation of the gazetteers and the letter texts.
Example
As an example of the variation of person names in 17th-century texts we consider the Swedish general Lennart Torstensson (1603-1651) who is mentioned in the Hugo de Groot correspondence. Torstensson is referred to 486 times with 39 different names:
Leonard Torstenson (1), Leonardo Torstensonio (2), Leonhardo Torstens (1), Leonhardo Torstensonio (1), Leonhardt Torstenson (1), Leonhardum Torstensonium (1), Leonhardus Torstensohnius (1), Linnhardt Torstensohn (1), Torsenson (8), Torsonson (7), Torstensohn (2), Torstensohnius (1), Torstenson (158), Torstensoni (1), Torstensoniana (1), Torstensonianis (1), Torstensonii (73), Torstensonio (50), Torstensonium (31), Torstensonius (104), Torstensono (1), Torstensons (6), Torstensonse (1), Torstensonss (1), Torstensonum (1), Torstensoon (9), Torstensoons (1), Torstenssonius (1), Torstenssons (1), Torstenton (1), Torstenzon (2), Torstenzoon (2), Tortensoon (1), Tortenston (1), Tosterson (2), de Torsenson (1), de Torstenson (6), de Torstensoon (2), van Torstensoon (1).
Note that “Torstensson” is a relatively easy name, because it is unique within the correspondence, and because it does not refer to a location.
User Interface
In the ePistolarium the recognized person names are used in three different ways: