Named Entity Recognition

One of the analyses performed on the corpus is Named Entity Recognition (NER) of person names. We use a semi-automatic, rule-based approach for both the recognition and the identification of the names.

In the table below we show the current status (as of March 8, 2013) of the recognition and identification of person names. Recognized names are annotated in the letter texts; identified names are used in the analysis.

 

Correspondence Recognized Identified
Barlaeus 3.073 1.460
Beeckman 139 120
Descartes 4.096 3.917
De Groot 76.790 57.286
Chr. Huygens 21.647 17.411
Const. Huygens 17.354 12.632
Van Leeuwenhoek 899 869
Van Nierop 394 326
Swammerdam 567 532
CKCC corpus 124.959 94.553

 

Approach

We are using an iterative, rule-based approach to build gazetteers (lookup lists of names), which are extended with hand-annotated names and names from indexes of book editions, if available. More names are generated by applying rules to latinized names (for instance, if “Grotius” and “Grotio” occur also the names “Grotium” and “Grotii” are generated). For the actual matching the well-known Aho-Corasick algorithm is used on a normalized representation of the gazetteers and the letter texts.

 

Example

As an example of the variation of person names in 17th-century texts we consider the Swedish general Lennart Torstensson (1603-1651) who is mentioned in the Hugo de Groot correspondence. Torstensson is referred to 486 times with 39 different names:

Leonard Torstenson (1), Leonardo Torstensonio (2), Leonhardo Torstens (1), Leonhardo Torstensonio (1), Leonhardt Torstenson (1), Leonhardum Torstensonium (1), Leonhardus Torstensohnius (1), Linnhardt Torstensohn (1), Torsenson (8), Torsonson (7), Torstensohn (2), Torstensohnius (1), Torstenson (158), Torstensoni (1), Torstensoniana (1), Torstensonianis (1), Torstensonii (73), Torstensonio (50), Torstensonium (31), Torstensonius (104), Torstensono (1), Torstensons (6), Torstensonse (1), Torstensonss (1), Torstensonum (1), Torstensoon (9), Torstensoons (1), Torstenssonius (1), Torstenssons (1), Torstenton (1), Torstenzon (2), Torstenzoon (2), Tortensoon (1), Tortenston (1), Tosterson (2), de Torsenson (1), de Torstenson (6), de Torstensoon (2), van Torstensoon (1).

Note that “Torstensson” is a relatively easy name, because it is unique within the correspondence, and because it does not refer to a location.

 

User Interface

In the ePistolarium the recognized person names are used in three different ways:

  • The names are marked up in the letter texts. For identified names the normalized person name is shown when the user moves the mouse pointer over the name as it occurs in the text:
  • In the faceted search the “Named Persons” facet allows you to select letters that refer to persons, irrespective of the spelling of the name:
  • The names in the current result set are used for a dynamic co-citation analysis; see the section on visualizations.