Crawling a Web Site

Prerequisite: Creating a Case Project

See also: Setting HTTP Proxy Information

The application contains a crawler component which can be used to crawl (often also called "spider") a targeted web site.

The crawler component starts at a set URL and then follows the links on this web site until a predefined depth has reached.

Creating a Crawler Configuration

The crawler component needs a configuration file which defines the starting URL and some parameters. To create one, do the following:

  • In the Workspace Navigator view expand the Crawler folder and right-click the Crawler folder, then click New > Crawler Configuration. An OSINT Crawler Configuration creation dialog opens.

images_download\attachments\2588688\crawler-configuration.png
  • In the OSINT Crawler Configuration dialog enter a file name and click Finish.The configuration file is created and opened in a Crawler Configuration editor.

images_download\attachments\2588688\crawler-configuration2.png

In the Crawler Configuration Editor the following parameters can be set:

Name

Type

Description

Targeted Websites

list of URLs

The list allows you to add start URLs of web sites which should be crawled.

Max Depth

number

The maxium number of links to follow, default is 1 (crawling all pages connected to the start URL)

Minimum Text Size

number

The minimum extracted text size, default is 200 characters. If a page has less text, it will be ignored.

Concurrent Workers

number

The number of concurrent worker threads doing downloads, default is 1. This should be set to a very low number to avoid being black listed.

Random Delay (ms)

number

The waiting time between requests done by the worker threads. Default is 2500 milliseconds.

Adding a Web Site for crawling

In the Crawler Configuration editor do the following:

  1. In the Targeted Websites section click Add. The Add a new URL dialog opens.

  2. In the Add a new URL dialog enter a full URL, such as http://www.europa.eu and click OK. The added URL appears in the Targeted Websites list box.

To save the Crawler Configuration Editor use the main menu and click File > Save.

Performing a Crawl

To perform the crawl using the Crawler Configuration file created do the following:

  1. Right-click on the Crawler Configuration file and click Start Crawl.

images_download\attachments\2588688\crawler-configuration3.png

The crawler module starts in the background, after it has finished the crawled HTML pages appear in the Documents folder of the case project where the crawler configuration file is located.