Skip to content

Using SCE

Tom Barber edited this page Sep 25, 2020 · 1 revision

Before you begin!

Run:

docker pull registry.gitlab.com/sparkler-crawl-environment/sparkler/sparkler:memex-dd

This isn't mandatory but will speed up your execution time on your first run.

Create a new model or Select and existing model

To create a new model, click on the Models button in the toolbar. Then go to New Model, enter your model name and click Create Model.

Searching Web Pages

Enter your search term in the Search terms box on the left press Go.

The search may take a few minutes as under the hood its rendering each website and creating a screenshot.

To check its running, you can look at the log output from the API container or Splash container or run Top or similar to ensure you've got a reasonably high CPU load.

Eventually, it will render the images in the containers.

Ranking Pages

From here you can then select which of the previews are Highly Relevant, Relevant and Not Relevant. Once you are happy with your selection press the Update Model button.

Uploading Seed URLs

To upload seed urls, select the Paste Seed URLs button and then insert your URLs. A single URL needs to go on each row. Press Save and it will update the index with the URLs you've requested.

Running a Crawl

Finally, you can run a Crawl by pressing the Start Crawler button. The duration of the crawl depends on how many pages it's attempting to index. You can run more crawls by pressing the button once more after it has completed a crawl.

You can also kill a crawl by pressing the Kill Crawl button.