Adding a Custom Entity Type
The system provides a way to add additional custom entity types to the basic predefined types that already ship with the software (e.g. person, organization, location). A custom entity type is an additional type which extracts data not covered by the predefined types in the system.
This tutorial describes how to add a custom entity type for Swedish number plates.
Introduction
Vehicle registration plates of Sweden are used for most types of vehicles and have three letters first and three digits after, if read from the left. The combination is simply a serial and has no connection with a geographic location, although the last digit shows what month the car has to undergo vehicle inspection. Vehicles like police cars, fire trucks, public buses and trolley buses use the same type of plate as normal private cars, and cannot be directly distinguished by the plate alone. Military vehicles have special plates.
The only possible coding to be seen by looking at the plate alone is when the vehicle must undergo inspection. The last digit of the plate denote this.
|
Last Digit |
Inspection Month |
Inspection Period |
|
1 |
January |
November-March |
|
2 |
February |
December-April |
|
3 |
March |
January-May |
|
4 |
April |
February-June |
|
5 |
July |
May-September |
|
6 |
August |
June-October |
|
7 |
September |
July-November |
|
8 |
October |
August-December |
|
9 |
November |
September-January |
|
0 |
December |
October-February |
All letters in the Swedish alphabet are used, except the letters I, Q, V, Å, Ä and Ö. 91[1] letter combinations are not used since the may be politically offensive or otherwise unsuitable.
(Source: Wikipedia: Vehicle Registration Plates of Sweden )
To add a custom entity type to the system to recognize Swedish number plates, perform the following steps:
-
Creating a new custom entity type definition file in the Entity Extraction folder.
-
Editing the new custom entity type definition file
Creating a new custom entity type definition file
To create a new custom entity type definition, copy the type-template.xml file to the Active Entities folder:
The Active Entities folder contains all active custom entity type definitions. You can move them to the Available Entities folder to temporarily deactivate them.
Inside the Active Entities folder, rename the newly copied file:
-
Right click the type-template.xml file, then select Rename...
-
Enter the name number-plates.xml and confirm.
Editing the new custom entity definition file
In the Active Entities folder,
-
Right click the number-plates.xml file, then click on Open With > Text Editor.
The file will be opened in text editor in the editor area. The file contains a lot of comments to explain how to fill in the different tags.
Defining the Entity Type
In the text editor navigate after the <declaration></declaration> section and add a type entry for the new entity type as follows:
<type id="pn" description="number plate"/>
Defining the Pattern to match the entity
In the text editor navigate to the <expressions> tag and add a new <expression></expression> child tag to hold the pattern definition as follows:
<expression>
<regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
This pattern will match three uppercase letters, optionally followed by a space or a dash, followed by a number between 000 and 999. This is a list of example terms that would match the above pattern:
WNF766
WNF 766
WNF-766
Understanding the definition of a Custom Entity Type
As can be seen, the pattern (regular expression) used as example for recognizing Swedish number plates and defined within the <regex> label is
[A-ZÅÄÖ]{3}[ \-]?<000-999> Three main parts can be detected in this regular expression:
|
Pattern |
Meaning |
[A-ZÅÄÖ]{3} |
This pattern will match three uppercase letters (from A to Z including the letters Å, Ä or Ö) |
[ \-]? |
Optionally (symbol ?) can follow a space or a dash |
<000-999> |
Necessarily is followed by a number between 000 and 999 |
There are many regular expression testers available on the Internet that allow testing our regular expressions (regexpal, regexr, etc.)
Under the hood, the system matches this expression to all text contents of the files being processed by the entity extraction. If some term matches, then the system adds a meta tag to the meta data of the file which is an xml element such as:
<emm:custom type="pn" country="sweden" name="the term that matched the pattern" pos="the position of the term that matched the pattern" id="an unique identifier for name">the term that matched the pattern</emm:custom>
In our above example, let's say we have a document which contains some text interspersed with some number plate terms as follows:
.... WNF766 ... ADE-683 ... OWA 882 ...
If the Entity Extraction process analyzes this text, it will produce the following tags to be included in the meta data of the file:
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="ADE-683" id="8">ADE-683</emm:custom>
<emm:custom type="pn" country="sweden" name="OWA 882" id="9">OWA 882</emm:custom>
In other words, the system thinks it has found three different number plates (see the different id values), even though they are only spelled slightly differently and describe the same number plate. In order to overcome this problem we need to output the matched terms in a standardised form.
Further improving the custom regular expressions
In this section we will show a how to customize the regular expressions used for recognizing entities in the EMM-OSINT Suite.
Following the example above about recognizing Swedish number plates, the following steps are explained:
-
Standardizing the output name of the entity. This would be interesting to apply when OSINT finds out different entities that are not spelled exactly in the same way, but they refer to the same entity
-
Defining capturing groups in the pattern. The current library in OSINT for regular expressions doesn't support capturing groups. To avoid that, we can set the feature "mode" on the <regex>
-
Using keywords as patterns. Basically, a text file containing the keywords is used to define the exact terms to recognize
-
Using a script to generate the patterns on-the-fly. Following a Java style of coding, OSINT allows defining own scripts to recognize entities
Standardizing the output name
The output key called "name" of the entity recognized by OSINT can be standardized. Returning to the example of the regular expression (pattern) defined to recognize Swedish number plates:
<expression>
<regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
the output in OSINT for a text including "... WNF766 ... WNF 766 ... WNF-766 ..." would be:
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF 766" id="8">WNF 766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF-766" id="9">WNF-766</emm:custom>
Notice how there are three different entities with different "name" output keys. If we want to standardize this key in order to OSINT shows the same entity (WNF766 for instance) for the three cases, we should add the "name" output key within the definition of the pattern as follows:
<expression>
<regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
<output key="name"><![CDATA[
name = term.replaceAll("[ \\-]", "");
return name;]]>
</output>
</expression>
Now, if the Entity Extraction module processes the text again, it will now produce the following meta tags:
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF 766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF-766</emm:custom>
As shown, now the output for the "name" key is unified for all the entities.
Defining capturing groups in the pattern
By default, the current library in OSINT for regular expressions doesn't support capturing groups within the definition of the pattern. To avoid that, we can set the feature "mode" on the <regex> as follows:
<expression>
<regex mode="groups"><![CDATA[([A-ZÅÄÖ]{3})[ \-]?(<000-999>)]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
Notice how the "mode" key is added to the <regex> label and set up to "groups" in order to allow the definition of groups (using brackets) in the regular expression.Then, we can use the defined groups for further processing, referring to them as groups[1], groups[2], etc. In the example above, if OSINT finds the entity "WNF766" in the text, the variable "groups[1]" would refer to the string matched by the first group defined in the regular expression (i.e. "WNF"), while "groups[2]" would refer to the string matched by the second group defined ("766"). Therefore, these "groups" might be used within a script that we can also define for generating patterns on-the-fly, as explained below.
Using keywords as patterns
In OSINT we can use keywords as patterns by using an external text file in which the keywords are included. To use this feature we have to set the "mode" key within <regex> as follows:
<expression>
<regex mode="file"><![CDATA[keywords.txt]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
Therefore, we have to define within the CDATA section the relative path (starting from where the expressions.xml is loaded) to a file which contains keyword terms. The file which contains keyword terms must be a text file, in UTF-8 format. It is important to note that each line (which is not an empty line nor a comment line) is considered a keyword term, case sensitive, and it will be matched as-is (no need to escape the special characters). An example of the "keywords.txt" file would be:
#comment lines begin with # and are ignored
#empty lines (as the one below) are also ignored
#if possible, the first and the last line of the file
#should be either an empty line or a comment line
#keywords start here
WNF766
wnf766
WNF-766
wnf-766
Taking into account the "keywords" file above, for the example including the text "... WNF766 ... WNF 766 ... WNF-766 ...", OSINT would produce two xml elements as such:
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF-766" id="8">WNF-766</emm:custom>
Using a script to generate the patterns on-the-fly
Following a Java style of coding, OSINT allows defining own scripts to recognize entities. To use this feature we have to set the "mode" key with the value "script" within the <regex>, as follows:
<expression>
<regex mode="script"><![CDATA[
List<String> ls = new ArrayList<String>();
ls.add("WNF[ \-]?[0-9]{3}");
ls.add("UCD[ \-]?[0-9]{3}");
return ls;
]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
</expression>
As can be seen, the script (in the <regex> CDATA section) is in the Java language. Some features related to the "script" mode are:
-
the script can reference the path from where the expressions.xml is loaded as "resourcespath" ("resourcespath" is a String)
-
the script must return a List<String>, where each element in the list is a regular expression pattern
For the example above, the script will recognize entities such as "WNF777", "WNF-876", "WNF 987", "UCD465", "UCD-999" or "UCD 112".