Name Variant Matching

The Name Variant Matching module simply matches entries from the Name Variant Database to the document texts. The found matches are then marked as entities with the type and id from the database.

Name Variant Database

The Name Variant Database contains entities of various types (e.g. Person, Organisation, etc.). It is amended each time the Entity Normalisation process finds a new entity.

Note: The initial Name Variant Database is created automatically from the EMM NewsBrief system. Therefore, the quality of the entries may vary.

Each possible spelling of an entity is called a Name Variant (or short “a variant”). Since a person entity can have many different spellings of its name, the variants are clustered in a so called Name Variant Profile (or short: “profile”). The name of a profile is taken from one of its variants. We call this variant the canonical variant.

For example, the profile for “Franz Beckenbauer” (a former German soccer player) contains a variety of variants which can also contain misspellings of his name.

  • Franz Beckenbauer (canonical variant)

  • Franz Beckenabuer

  • Franz Beckenbaur

  • פרנץ בקנבאואר

  • 贝肯鲍尔

The profile is named “Franz Beckenbauer” after the canonical variant. Each profile has a unique id in the system. Therefore, all variants found belonging to the same profile will get the same profile id (and represent the same Entity in the system). The Name Variant Database is automatically amended by the entity normalisation module which tries to find variants belonging to the same profile. In addition, the profiles and variants in the database can be edited manually.