Text Extraction
Text Extraction is the process of extracting raw text from multiple input file formats.The Text Extraction module of EMM OSINT Suite is based on the open source project Apache Tika.
Currently, the module supports the following input file formats:
-
Plain text (no text extraction needed)
-
XML
-
HTML
-
PDF
-
Microsoft Office 97 formats (doc, xls, ppt)
-
Microsoft Office Open XML (2007) (docx, xlsx, pptx, thmx)
-
Open Office Text, Presentation, Spreadsheet(odt, odp, ods)
In addition to extract the text, the language of the text is identified and stored as meta data.