Crawling a Web Site
Prerequisite: Creating a Case Project
See also: Setting HTTP Proxy Information
The application contains a crawler component which can be used to crawl (often also called "spider") a targeted web site.
The crawler component starts at a set URL and then follows the links on this web site until a predefined depth has reached.
Creating a Crawler Configuration
The crawler component needs a configuration file which defines the starting URL and some parameters. To create one, do the following:
-
In the Workspace Navigator view expand the Crawler folder and right-click the Crawler folder, then click New > Crawler Configuration. An OSINT Crawler Configuration creation dialog opens.
-
In the OSINT Crawler Configuration dialog enter a file name and click Finish.The configuration file is created and opened in a Crawler Configuration editor.
In the Crawler Configuration Editor the following parameters can be set:
|
Name |
Type |
Description |
|
Targeted Websites |
list of URLs |
The list allows you to add start URLs of web sites which should be crawled. |
|
Max Depth |
number |
The maxium number of links to follow, default is 1 (crawling all pages connected to the start URL) |
|
Minimum Text Size |
number |
The minimum extracted text size, default is 200 characters. If a page has less text, it will be ignored. |
|
Concurrent Workers |
number |
The number of concurrent worker threads doing downloads, default is 1. This should be set to a very low number to avoid being black listed. |
|
Random Delay (ms) |
number |
The waiting time between requests done by the worker threads. Default is 2500 milliseconds. |
Adding a Web Site for crawling
In the Crawler Configuration editor do the following:
-
In the Targeted Websites section click Add. The Add a new URL dialog opens.
-
In the Add a new URL dialog enter a full URL, such as http://www.europa.eu and click OK. The added URL appears in the Targeted Websites list box.
Performing a Crawl
To perform the crawl using the Crawler Configuration file created do the following:
-
Right-click on the Crawler Configuration file and click Start Crawl.
The crawler module starts in the background, after it has finished the crawled HTML pages appear in the Documents folder of the case project where the crawler configuration file is located.