Importing a Name Variant File
The EMM OSINT Suite uses a database of named entities containing mainly persons and organizations (see Name Variant Matching for more information). Sometimes it might be interesting to use your own database with specific entities to be matched by the entity extraction process. This tutorial shows you how to import your own data.
Creating an import file with custom name variants
Basic Requirements
-
The file needs to be encoded in UTF-8 (refer to Encoding a File in UTF-8)
-
The file must not have empty lines
Format of the import file:
The file is a TSV (Tabular Seperated Values) file. The file should contain four columns:
-
KEY. The import process does not take the values under this column into account because it already assigns its own primary key for each new entity. However, this column must appear as the first column within the TSV file, although their values are discarded.
-
PID (Profile Identification). It is the identification value of an entity. This numerical value is very important. For variants (different names for the same entity) that belong to the same entity, this value must be identical. The first occurrence found in the TSV file will be considered as the canonical entity (original name form of the entity) and the following ones as variants of this canonical form (see the example below).
-
TYPE. It is the type of the entity. OSINT accepts four main entity types:
-
o, for organizations
-
p, for persons
-
t, for toponyms (locations)
-
u, for unknown types of entities
-
-
VARIANT. Is the name variant of the entity. The matching of the name variant is not exact but matches according of some rules.
Matching the name variant against the real text
The import file contains name variants as the fourth column. These variants are matched against the real text using the following rules:
|
Rule |
Description |
Example |
|
Lower case matches both cases |
If the name variant is imported as lower case, it matches both upper and lower case in the text |
Name variant "procter and gamble" matches "Procter and Gamble" and "PROCTER AND GAMBLE" |
|
Upper case matches only upper case |
If the name variant contains upper case characters, these characters will only match upper case |
Name variant "Procter and Gamble" matches "Procter and Gamble" but not "procter and gamble" Name variant "PROCTER AND GAMBLE" matches only "PROCTER AND GAMBLE" |
|
Some characters are ignored |
There are a number of special characters which will be ignored. These characters are "." (dot), |
Name variant "Procter & Gamble" matches
"Procter & Gamble", "Procter - Gamble", "Procter:Gamble", |
|
Using wildcard character '%' |
The percentage character will match zero or more characters. |
Name variant "Procter%" matches "Procter & Gamble", "Procter - Gamble", "Procter:Gamble", "Procter Gamble", etc. |
|
Using wildcard character '_' |
The underscore character matches any single character |
Name variant "Procter_Gamble" matches "Procter&Gamble", "Procter-Gamble", "Procter:Gamble" but matches not "Procter & Gamble" (first whitespaceis taken up by wildcard character) |
Here is an example of a TSV file used for importing new name variants into OSINT:
|
Key |
Pid |
Type |
Variant |
|
2 |
11 |
p |
Aaron Albert |
|
3 |
11 |
p |
A. Albert |
|
4 |
11 |
p |
A. M. Albert |
|
5 |
21 |
o |
Chad Calvin Christian |
|
6 |
21 |
o |
CCC |
|
7 |
21 |
o |
C.C. Christian |
|
8 |
41 |
t |
Milano |
|
9 |
61 |
u |
Harold Hugh |
|
10 |
61 |
u |
Henry Hugh |
In this example, it can be observed how the entity Aaron Albert (person) has a PID value of 11. The first occurrence would be the canonical (original) form for that entity, whereas the next ones found with the same PID (A. Albert, A. M. Albert) are considered as variants of that canonical form. However, all these occurrences (variants) represent the same entity in real life (the person Aaron Albert). Another example in the table is the organization called Chad Calvin Christian (PID 21). As can be seen , there exist one canonical form and two variants (CCC, C.C. Christian) for this entity. Finally, we find the entity Milano (PID 41) with only one variant (the canonical form) and one entity of unknown type (Harold Hugh) with two variants.
Importing the Name Variant File
Perform the following steps to import the name variant file:
-
Open the EMM OSINT Suite and click in the main menu on File > Import > Entity Extraction > Import Name Variant Database File
-
Next click Browse and select the TSV file to import database from. Then click Finish to perform the import.
Finally the system imports the new database and prints a status message in the Console view.