Defining categories or alerts
Prerequisite: Creating a Category Definition File
EMM OSINT Suite provides a Domain-Specific Language (DSL) to define a category or alert by using the Category Definition File editor view. A Category Definition File (adf file) consists of two different sections mainly:
-
Patterns section, which is a simple keyword-weight list with a defined threshold (also optional).
-
Combinations section, which is a list of keyword combinations and optionally a proximity attribute.
The keywords to be used in the definitions should reflect the language used in the documents that are being classified.
The basic structure of a Category Definition File is the following:
define alert <ID>
(
(words threshold <INT>)?
define patterns
<ID> <INT> (, <ID> <INT>)*
end patterns
)?
(
define combination
(proximity <INT>)?
( begin or
<ID> (, <ID>)*
end or )+
( begin not
<ID> (, <ID>)*
end not )?
end combination
)*
end alert
Category names
Regarding the names to be used as category names, they should be unique in the system. This means that a name can only be used for one category exactly. It should contain only non-accented alphanumerical characters. The name is case sensitive but case should nevertheless NOT be used to distinguish between categories, i.e. category 'myTest' and 'mytest' should not be used at the same time.
Patterns section
The first way to define an alert is to use a list of keywords with associated weights. This simple keyword method is preferred for performance reasons and it is very effective if you are looking for precise, unambiguous terms or names (e.g. Gazprom Media, brucellosis, Michael Phelps). If a precise term consists of two or more words, you can use the wildcard “+”. For instance, if you are interested in “yellow fever” you do not want all the documents containing the word “yellow” and “fever” individually anywhere in the document, but you want them to appear together and in the same order, so you can use the “+” symbol (“yellow+fever”). “+” effectively skips the white space between the 2 words. Note that “+” also skips punctuation marks, so “yellow, fever” would be valid.
An important feature of the Category Matcher is that it is multilingual. If the users want to categorize articles in various languages, they have to define their alerts also in different languages. The system will accept terms in any language, and an alert may have any numbers of keywords.
Words threshold and keywords weight
The value of the words threshold for each alert and the weight for each pattern can be chosen by the user and both are optional. If the user wants to set them up, they have to be an integer value. Eventually, the system keeps track of the total weight of the individual patterns, and only if the threshold set by the user has been reached, the document will be categorized to the category. The words threshold is ONLY used within the current definition. The value has no particular meaning other than to check against the total of the values of the patterns found in the text. The word weight list can also be used with weights less than the threshold value.
The system already takes into account the same pattern for a maximum of 8 occurrences. It assigns a decreasing weight based on the following multipliers: 1.0, 1.0, .6, .4, .4, .2, .2, .2. So, if a pattern matches multiple times, the system will automatically reduce the value of the weight and one pattern will never score more than 4 times its full value.
Combinations section
The second way to define an alert is to create one or more combinations of lists of keywords. A combination section is composed of one or more "OR" lists of patterns (or sub-sections) and optionally one "NOT" list of patterns (not sub-section). When a combination section is defined by the user, at least one of the keywords belonging to each or sub-sections must be found in the document to assign such category to it. Obviously, if any of the keywords defined within the not sub-section is found, then the document would be automatically discarded although some or combinations were found. Therefore, the "NOT" list of patterns means "unwanted words".
One should use a combination for a broader term or concept (e.g. imported disease, release of toxic substances, equal rights or such combinations as Russian peacekeeping mission in Caucasus, Russian Georgian Conflict). As explained above, each "OR" list of patterns would express a certain concept and a document would be considered for the alert if every concept is found in the document but rejected if the “NOT” concept is found.
Proximity
The “proximity” value is optional and it can be very handy. It will define a word context size within which the combination terms have to occur. If this value is not defined by the user, then the Category Matcher will use 10 as default value.
Wildcards
The Category Matcher in EMM OSINT Suite allows using several wildcard characters:
-
% (percent) for 0, 1 or more characters. E.g. origin% would match original, originality, originally, originate, originating, originator, origination... This wildcard can be very useful with inflecting/fusional languages like Russian for example.
-
_ (underscore) for exactly one character, it does not denote a blank. E.g. p_t would match pot, put, pat…., “organi_ation” would match both “organization” and “organisation”.
-
Set: [abc] in a pattern definition means that the system will match either an ‘a’, a ‘b’ or a ‘c’ in that position. E.g. c[aou]t would match ‘cat’, ‘cot’ and ‘cut’.
-
It is possible to introduce prefixes in the following way: @prefix]. Please note that you have to introduce a prefix only once in the whole system. E.g. together with words like bug, bunk, but, claim you can introduce @de] and the system will automatically get debug, debunk, debut, declaim etc….This symbol should be used with caution as it will affect all other alert definitions.
-
The “+” (white space) sign can be used to build or unite term strings. E.g. Olympic+games, News+Brief, dmitry+medvedev would match dmitry (white space) medvedev.
Using these wild cards can be very helpful to build common patterns for multiple languages because they can substitute accented characters. E.g. ent_rotox% would match enterotoxine (de), ente’rotoxin (fr), enterotoxín (sk) etc.
It is possible to use word-initial wild cards (both _ and %), but these should be used as little as possible because they are computationally heavy. If you only want to cover one word-initial letter, you should use the _ (underscore) instead of %( percent). If there are only two or three variants of your chosen word/patter, it would be much better to put them in explicitly instead of using a wildcard.
Uppercase/Lowercase definitions
A pattern definition should normally be in lowercase, but can contain upper case characters. In that case the pattern will only match text that has an uppercase character in the same position. Forcing uppercase can be used for acronyms that would otherwise cause problems. This means that a lowercase character in a patter matches both lower and uppercase in the incoming text, but an UPPERCASE character only matches uppercase, so the pattern e.g. “abc” would match ABC, ABc, Abc, aBC,AbC etc but the pattern “Abc” would only match Abc, ABC, ABc etc all with the uppercase “A”.
Useful tricks
-
Negative weights can be used. Negative weights can be useful if a search word is homographic with some other unrelated word or with a person name, or if a search word has many meanings. E.g. if you are interested in finding texts mentioning “tsunami”-sea storm, you could put several words with a negative score of let’s say -999.
-
rock-band -999 (there is a famous Indian rock band “Tsunami”)
-
Arashi+Tsunami -999 (a Japanese voice actor)
-
Satoshi+Tsunami -999 (a Japanese football player)
-
deodorant -999 (“Tsunami” fragrance by Axe)
-
politics -999 (“Tsunami” term used to describe an overwhelming victory by a political party)
-
Another example: if you are interested in Michael Jackson the Canadian actor and not the musician, you could put words like pop music, songwriter, dancer etc. with a negative weight.
-
-
To use weights and a threshold (e.g. 50), so that some words can trigger the alert on their own (weight = 50) while other words need to occur cumulatively (several times) before reaching the threshold (e.g. weight = 20). For example:
define alert Biotechnology
words threshold 50
define patterns
genetic_ 40, cancer 20, genomics 50, antibodies 40, biotechnology 50
end patterns
end alert
-
Be careful with abbreviations. E.g. a very simple (at first sight) abbreviation “ABC” can stand for the following: Latin Alphabet, American Broadcasting Company, Australian Broadcasting Company, Associated British Company, Appalachian Brewing company, Atlanta Bread company, Agricultural Bank of China, ABC (programming language), abc conjecture, All Lesotho Convention, ABC (island in Alaska) etc. Please keep in mind that a simple word (or an abbreviation) in English can mean a completely different thing in Italian, German, Russian, Bulgarian… E.g. a Portuguese word for “vomiting” is “emese”, and “Emese” is a very common first feminine name in Hungary. Another typical example is the work ‘mais’ which means the agricultural crop in many languages, but in French means ‘but’. The French version is written with an accented ‘i’. In order to avoid these conflicts it is sometimes useful to define multiple combinations with the word lists in the combination reflecting the various languages or language groups. A combination is unlikely to trigger the category if one or-list consists of English words and the other of French words.
Examples
Here you can find some examples of using the DSL provided by EMM OSINT Suite to define Category Definition Files:
define alert TerroristAttack
words threshold 20
define patterns
stratégiai+bombázás% 10, film% -999, car+bomb% 10, bomb%+detonat% 10,
attentato+suicide 10, camion+bomba 10
end patterns
define combination
proximity 5
begin or
terroris%, bioterrorism%, attack%, attentat%
end or
begin or
attacco+chimico, attacco+tossico, allarm_, sostanz_+tossic%, sostanz_+chimic%
end or
end combination
define combination
proximity 15
begin or
ETA%, IRA%, Al-Kaida%, szélsoséges
end or
begin or
bomba%, terror%, tömeg+gyilkosság%, gyilkosság%, merényl%, mészárl%, csoport%
end or
begin not
könyv%, gól, film%
end not
end combination
end alert
define alert WaterConflict
define combination
begin or
water, eau
end or
begin or
conflict, conflict%
end or
begin not
book%, film%, movie%, game%, song%, mostra+fotografica
end not
end combination
end alert
define alert NaturalDisasters
define combination
begin or
tsunami, volkanik, erozyon, volkan%+pat%, volkan%+kül%, sel%+felaket%, heyelan%
end or
begin or
ferit_, ferida%, vittim_, crash%, explosion%
end or
end combination
end alert