Sparkler Usage

Basics

A Simple Crawl

Once you have Sparkler installed and configured you can kick off your first crawl. There are various command line flags to help you do this.

./sparkler.sh inject -su bbc.co.uk -id test
./sparkler.sh crawl -id test

This example basically says crawl bbc.co.uk and label the id test. The id is optional, if you don't supply it then you'll get a custom job id in return.

Crawls are always in 2 steps, the inject phase just preseeds the database. Then the crawl phase iterates through the seeded urls and populates the database with the crawl results.

Configuration

The default configuration file is in the conf directory named, sparkler-default.yaml. In here you will find sensible defaults for most things. You can set various plugins, headers, kafka config and more.

Fetcher Properties

The main place to tweak settings is the fetcher properties. In here you can set the server delay, which is the pause between crawl requests, this stops Sparkler spamming the servers causing undue load and also trying to make us look a little less like a robot.

You can also set fetcher headers, in here are the standard headers that get sent with the request to make you look like a browser.

You can also enable the fetcher.user.agents property which will cycle through the headers in the file, this is ideal if you're crawling the same site and would like to look a little less obvious to any detection scripts the target has.

Enabling Plugins

You enable plugins by editing the plugins.active block. This list of plugins is the defaults shipped with Sparkler and you can enable or disable any of the supplied plugins by adding or removing the # comment symbol.

Basic Plugins

Enabled by default are the urlfilter-regex and urlfilter-samehost.

These plugins provide a couple of sensible function that allow Sparkler to crawl without downloading the world. Regex will filter out some urls and links it picks up so it doesn't download loads of useless stuff. Samehost will, by default, ensure your crawl is limited to the same domain.

Samehost

This plugin does what it says on the tin, ensures the crawl is limited to the same host, so that you don't end up in a completely different domain, crawling completely different stuff. Of course, you may want that, in which case disable this plugin.

Regex

This url provides more flexiblity over the samehost plugin. Out of the box it will prevent a number of file urls being picked up, so for example, you don't crawl PDFs, videos, images etc. It also filters out ftp sites and mailto addresses, infinite loops and local files.

To adjust the regex, you can simply edit the regex-urlfilter.txt file which manages all the regex expressions that are required for matching.

Fetcher HTMLUnit

Also supplied with Sparkler is the fetcher htmlunit plugin. This plugin is a slightly different browser backend that allows you to crawl sites using a different engine. If you find the basic default(fastest) crawler doesn't work, then have a look at this one, and if this doesn't work checkout the other plugins below for more support.

Advanced Usage

Plugins

Fetcher Chrome

Fetcher Chrome is a plugin that allows you to run crawls using a full chrome headless browser. This uses the Selenium engine to drive the browser and so the easiest way to do this is using the browserless/chrome docker image and pointing Sparkler at that. You can of course run your own, but this is tried and tested.

There are various settings you can configure to alter the default behaviour.

  fetcher.chrome:
    #Set timeout to > -1 to enable the wait for element visibility for some ajax sites.
    chrome.wait.timeout: -1
    #Element name
    chrome.wait.element: "some element"
    #What type of element, class, name, id
    chrome.wait.type: "class"
    chrome.dns: "http://localhost:3000/webdriver"
    #chrome.selenium.enabled: "true"
    #chrome.selenium.script.click: "id:txtName"
    #chrome.selenium.script.keys: "COR"
    #chrome.selenium.script.click: "id:btnSearch" 
    #chrome.proxy.address: 127.0.0.1:9998

Firstly there is the chrome.dns setting. This allows users to set the IP address of the Chrome instance.

Then there is chrome.proxy.address. Because Selenium is a user interaction emulator it doesn't return http status codes, header information etc, to fix that problem we run the browserup proxy. If you don't provide an address is will launch a local proxy, configure it appropriately and run it in the Sparkler instance. If you require more flexibility, you can launch your own Browserup proxy and then point this configuration variable at it.

Lastly there are a number of selenium configuration options. This allows users to hit a site, let it render, then interact with that site before grabbing the output. This might allow you to enter something in a search box, click a button, filter a list, whatever. The syntax is reasonably strict, it understands click and keys and within that id, class and the pointer names. You can also set the chrome.wait.element. This is because you may be waiting on a specific element to be rendered before grabbing the text. If your site load is asyncronous, it will not wait for those elements to load and you will end up with a semi rendered state. To resolve this you can set the wait element and it will wait for that element to render before running the scrape. You can also set the wait type and the timeout when configuring this option.

URL Injector

The url injector plugin is an advanced plugin that allows you to interact with sites in different ways. Its basic usage allows you to specify a single site and a list of tokens and create a seed url for each combination. This might help you if you were interested in crawling a search engine, or site with a query string url that you want to do on repeat.

For this you need a simple replace configuration:

url.injector:
  mode: replace
  values:
    - Tom
    - Dick
    - Harry

Equally if you wanted to submit a form you can use the form mode:

url.injector:
  mode: form
  values:
    - Tom 
    - Dick
    - Harry
  form:
    hdnField: "submit"
    txtRequired: ""
    radSearchBy: "drugname"
    txtName: "${token}"
    selTC: ""
    selProgram: "MA"
    txtDateOfService: "12/01/2020"

In the example above the form fields are the various data fields defined in the POST request which you can find in the browser developer tools. ${token} is the placeholder for the values to be injected. So in this example if you submitted your crawl with 1 url and 3 values, you'll end up with 3 entries in the database ready to do a form submit of the site for the different values.

There is also a selenium mode. This mode is the most involved, but allows you to crawl the same site numerous times, with various values whilst using Selenium to interact with the site.

For example:

url.injector:
  mode: selenium
  values:
    - Tom
    - Dick
    - Harry
  selenium:
    1:
      operation: click
      value: id:ctl00_ContentPlaceHolder1_rcbDrugName_Arrow
    2:
      operation: keys
      value: "id:ctl00_ContentPlaceHolder1_rcbDrugName_Input:${token}"
    3:
      operation: click
      value: "id:ctl00_ContentPlaceHolder1_DrugSearch"

The order in which we execute the script is important so they are arranged numerically. From there we can click or press keys. We support id, class and name types followed by the value name. In this example, we click an element defined by its id, then we enter the token value into the input box, again found by its id, and then we click a button, found by its id. This way we can search a site many times for various different search inputs without having to create a long script to do the injection for us.

POST/PUT Commands

Sometimes running simple GETs may not be enough. For example if you wanted to crawl the results of a search engine, it might be better to create a POST request and submit that then read the response. As such, if you want to run a POST or PUT command you can do so by prefixing the url you want to crawl with a POST| or a PUT| . So a POST might look something like: POST|https://myservice.com?q=somequeryparam and it would do a POST.

Config Override

Sometimes it makes sense to setup a set of default configuration options but then change some of them on the fly. For example if you're using Sparkler as a service, it might be hard to update a YAML file every time you want to search a different site with different settings. As such you can use the -co option to update any of the config options. To do this you need to convert the config you have from YAML to JSON using an online converter of your choice, then you can use the converted JSON, or just a subsection of it in the flag to update settings. You can either override or add to the existing configuration, either will work. You do need to keep the same structure though, as it merges both the original file and then the override with any override values taking precidence over the file default values.

Also, depending on what you're planning on overriding, you'll either need to override the inject command or the crawl command, if you don't know which, just provide the same co flags for both and it will be fine.