Adding a Custom Entity Type

The system provides a way to add additional custom entity types to the basic predefined types that already ship with the software (e.g. person, organization, location). A custom entity type is an additional type which extracts data not covered by the predefined types in the system.

This tutorial describes how to add a custom entity type for Swedish number plates.

Introduction

Vehicle registration plates of Sweden are used for most types of vehicles and have three letters first and three digits after, if read from the left. The combination is simply a serial and has no connection with a geographic location, although the last digit shows what month the car has to undergo vehicle inspection. Vehicles like police cars, fire trucks, public buses and trolley buses use the same type of plate as normal private cars, and cannot be directly distinguished by the plate alone. Military vehicles have special plates.

The only possible coding to be seen by looking at the plate alone is when the vehicle must undergo inspection. The last digit of the plate denote this.

Last Digit

Inspection Month

Inspection Period

1

January

November-March

2

February

December-April

3

March

January-May

4

April

February-June

5

July

May-September

6

August

June-October

7

September

July-November

8

October

August-December

9

November

September-January

0

December

October-February

All letters in the Swedish alphabet are used, except the letters I, Q, V, Å, Ä and Ö. 91[1] letter combinations are not used since the may be politically offensive or otherwise unsuitable.

(Source: Wikipedia: Vehicle Registration Plates of Sweden )

To add a custom entity type to the system to recognize Swedish number plates, perform the following steps:

  1. Creating a Configuration Project

  2. Creating a new custom entity type definition file in the Entity Extraction folder.

  3. Editing the new custom entity type definition file

Creating a new custom entity type definition file

To create a new custom entity type definition, copy the type-template.xml file to the Active Entities folder:

images_download\attachments\2588695\type-template.png

The Active Entities folder contains all active custom entity type definitions. You can move them to the Available Entities folder to temporarily deactivate them.

Inside the Active Entities folder, rename the newly copied file:

  • Right click the type-template.xml file, then select Rename...

  • Enter the name number-plates.xml and confirm.

Editing the new custom entity definition file

In the Active Entities folder,

  • Right click the number-plates.xml file, then click on Open With > Text Editor.

images_download\attachments\2588695\edit-number-plates.png

The file will be opened in text editor in the editor area. The file contains a lot of comments to explain how to fill in the different tags.

Defining the Entity Type

In the text editor navigate after the <declaration></declaration> section and add a type entry for the new entity type as follows:

<type id="pn" description="number plate"/>

  • "id" is mandatory; it is a code of at most two characters, and it must be unique in the OSINT Suite namespace. The letters p, o, u, t are already used for the internal types. We suggest choosing a two-characters code which is not yet used in the expressions.xml

  • "description" is mandatory; it is a free text to describe the data type, this description is show in the user interface to denote the entity type.

Defining the Pattern to match the entity

In the text editor navigate to the <expressions> tag and add a new <expression></expression> child tag to hold the pattern definition as follows:

<expression> 
 <regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
 <description>swedish plate number, example AAA-111</description>
 <output key="type" value="pn"/>
 <output key="country" value="sweden"/>
</expression> 

  • <regex> is mandatory; it defines (when mode is not set, or when mode="basic" is set) (in the cdata section) the regular expression pattern, expressed according to the syntax of dk.brics.automaton library [http://www.brics.dk/automaton/doc/dk/brics/automaton/RegExp.html]

  • <description> is optional; it is a free text to describe the pattern

  • within an <expression> tag, 0 or more <output> tags can be specified (likely at least 1); each <output> adds a piece of information as meta data to the tag which defines the found entity in the meta data of the file.

  • there must be an <output> label which identifies the data type of the term that matched the pattern; in the above example this is <output key="type" value="pn"/> which tells the system that any term matching that pattern is of data type "pn" (plate number)

This pattern will match three uppercase letters, optionally followed by a space or a dash, followed by a number between 000 and 999. This is a list of example terms that would match the above pattern:

WNF766
WNF 766
WNF-766

Understanding the definition of a Custom Entity Type

As can be seen, the pattern (regular expression) used as example for recognizing Swedish number plates and defined within the <regex> label is

[A-ZÅÄÖ]{3}[ \-]?<000-999>

Three main parts can be detected in this regular expression:

Pattern

Meaning

[A-ZÅÄÖ]{3}

This pattern will match three uppercase letters (from A to Z including the letters Å, Ä or Ö)

[ \-]?

Optionally (symbol ?) can follow a space or a dash

<000-999>

Necessarily is followed by a number between 000 and 999

There are many regular expression testers available on the Internet that allow testing our regular expressions (regexpal, regexr, etc.)

Under the hood, the system matches this expression to all text contents of the files being processed by the entity extraction. If some term matches, then the system adds a meta tag to the meta data of the file which is an xml element such as:

<emm:custom type="pn" country="sweden" name="the term that matched the pattern" pos="the position of the term that matched the pattern" id="an unique identifier for name">the term that matched the pattern</emm:custom>

In our above example, let's say we have a document which contains some text interspersed with some number plate terms as follows:

.... WNF766 ... ADE-683 ... OWA 882 ...

If the Entity Extraction process analyzes this text, it will produce the following tags to be included in the meta data of the file:

<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="ADE-683" id="8">ADE-683</emm:custom>
<emm:custom type="pn" country="sweden" name="OWA 882" id="9">OWA 882</emm:custom>

In other words, the system thinks it has found three different number plates (see the different id values), even though they are only spelled slightly differently and describe the same number plate. In order to overcome this problem we need to output the matched terms in a standardised form.

Further improving the custom regular expressions

In this section we will show a how to customize the regular expressions used for recognizing entities in the EMM-OSINT Suite.

Following the example above about recognizing Swedish number plates, the following steps are explained:

  • Standardizing the output name of the entity. This would be interesting to apply when OSINT finds out different entities that are not spelled exactly in the same way, but they refer to the same entity

  • Defining capturing groups in the pattern. The current library in OSINT for regular expressions doesn't support capturing groups. To avoid that, we can set the feature "mode" on the <regex>

  • Using keywords as patterns. Basically, a text file containing the keywords is used to define the exact terms to recognize

  • Using a script to generate the patterns on-the-fly. Following a Java style of coding, OSINT allows defining own scripts to recognize entities

Standardizing the output name

The output key called "name" of the entity recognized by OSINT can be standardized. Returning to the example of the regular expression (pattern) defined to recognize Swedish number plates:

<expression> 
 <regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
 <description>swedish plate number, example AAA-111</description>
 <output key="type" value="pn"/>
 <output key="country" value="sweden"/>
</expression> 

the output in OSINT for a text including "... WNF766 ... WNF 766 ... WNF-766 ..." would be:

<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF 766" id="8">WNF 766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF-766" id="9">WNF-766</emm:custom>

Notice how there are three different entities with different "name" output keys. If we want to standardize this key in order to OSINT shows the same entity (WNF766 for instance) for the three cases, we should add the "name" output key within the definition of the pattern as follows:

<expression> 
<regex><![CDATA[[A-ZÅÄÖ]{3}[ \-]?<000-999>]]></regex>
<description>swedish plate number, example AAA-111</description>
<output key="type" value="pn"/>
<output key="country" value="sweden"/>
<output key="name"><![CDATA[ 
     name = term.replaceAll("[ \\-]", ""); 
     return name;]]>
</output>
</expression>

Now, if the Entity Extraction module processes the text again, it will now produce the following meta tags:

<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF 766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF-766</emm:custom>

As shown, now the output for the "name" key is unified for all the entities.

Defining capturing groups in the pattern

By default, the current library in OSINT for regular expressions doesn't support capturing groups within the definition of the pattern. To avoid that, we can set the feature "mode" on the <regex> as follows:

<expression>
  <regex mode="groups"><![CDATA[([A-ZÅÄÖ]{3})[ \-]?(<000-999>)]]></regex>
  <description>swedish plate number, example AAA-111</description>
  <output key="type" value="pn"/>
  <output key="country" value="sweden"/>
</expression>

Notice how the "mode" key is added to the <regex> label and set up to "groups" in order to allow the definition of groups (using brackets) in the regular expression.Then, we can use the defined groups for further processing, referring to them as groups[1], groups[2], etc. In the example above, if OSINT finds the entity "WNF766" in the text, the variable "groups[1]" would refer to the string matched by the first group defined in the regular expression (i.e. "WNF"), while "groups[2]" would refer to the string matched by the second group defined ("766"). Therefore, these "groups" might be used within a script that we can also define for generating patterns on-the-fly, as explained below.

Using keywords as patterns

In OSINT we can use keywords as patterns by using an external text file in which the keywords are included. To use this feature we have to set the "mode" key within <regex> as follows:

<expression>
  <regex mode="file"><![CDATA[keywords.txt]]></regex>
  <description>swedish plate number, example AAA-111</description>
  <output key="type" value="pn"/>
  <output key="country" value="sweden"/>
</expression>

Therefore, we have to define within the CDATA section the relative path (starting from where the expressions.xml is loaded) to a file which contains keyword terms. The file which contains keyword terms must be a text file, in UTF-8 format. It is important to note that each line (which is not an empty line nor a comment line) is considered a keyword term, case sensitive, and it will be matched as-is (no need to escape the special characters). An example of the "keywords.txt" file would be:

#comment lines begin with # and are ignored
#empty lines (as the one below) are also ignored
#if possible, the first and the last line of the file
#should be either an empty line or a comment line
#keywords start here
WNF766
wnf766
WNF-766
wnf-766

Taking into account the "keywords" file above, for the example including the text "... WNF766 ... WNF 766 ... WNF-766 ...", OSINT would produce two xml elements as such:

<emm:custom type="pn" country="sweden" name="WNF766" id="7">WNF766</emm:custom>
<emm:custom type="pn" country="sweden" name="WNF-766" id="8">WNF-766</emm:custom>

Whitespaces are allowed inside the keywords defined within the "keywords" file

Using a script to generate the patterns on-the-fly

Following a Java style of coding, OSINT allows defining own scripts to recognize entities. To use this feature we have to set the "mode" key with the value "script" within the <regex>, as follows:

<expression>
  <regex mode="script"><![CDATA[
      List<String> ls = new ArrayList<String>();
      ls.add("WNF[ \-]?[0-9]{3}");
      ls.add("UCD[ \-]?[0-9]{3}");
      return ls;
    ]]></regex>
  <description>swedish plate number, example AAA-111</description>
  <output key="type" value="pn"/>
  <output key="country" value="sweden"/>
</expression>

As can be seen, the script (in the <regex> CDATA section) is in the Java language. Some features related to the "script" mode are:

  • the script can reference the path from where the expressions.xml is loaded as "resourcespath" ("resourcespath" is a String)

  • the script must return a List<String>, where each element in the list is a regular expression pattern

For the example above, the script will recognize entities such as "WNF777", "WNF-876", "WNF 987", "UCD465", "UCD-999" or "UCD 112".