Skip to content

ems-project/web-auditor

Repository files navigation

web-auditor

This tools provides 4 scripts:

The idea behind splitting the audit from the upload is that the audit is very network and cpu consuming. In those particular circumstances, the audit script might not work at first run. So run the audit until it passed, than launch the upload.

Prerequisite

You may want to use a docker, check the Docker chapter

Both Pupeteer and Chromium must install and working. Here an example for Ubuntu:

sudo apt-get install chromium-browser
sudo apt-get install libx11-xcb1 libxcomposite1 libasound2 libatk1.0-0 libatk-bridge2.0-0 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6
npm install

Check Pupeteer documentation for your platform.

Audit

This script launches the audit. A JSON file per URL audited is saved in the directory storage/datasets/https__elasticms.fgov.be/

node src/audit.js https://elasticms.fgov.be

Create local report

create.js generates a summary report for a prior audit and launches a local server for review. The report, which includes the total error count, the number of pages with errors, and the audit date, is saved in /storage/reports/ and easily shareable (for analysis and corrections). It also provides a list of error types and a breakdown of pages with errors, including specific error information.

Running an audit (using audit.js) on the URL before executing create.js is mandatory.

node src/create.js  https://elasticms.fgov.be

Upload

This script upload the current JSON files present in the folder storage/datasets/https__elasticms.fgov.be/ in elasticms. Ensure first that the audit has been performed completely.

The audit base url is mandatory in order to identify the right dataset to upload.

Also define those 2 environment variables (or in a .env file):

  • WEB_AUDIT_EMS_ADMIN
  • WEB_AUDIT_EMS_AUTHKEY
node src/upload.js  https://elasticms.fgov.be/

Cleaning

This script cleans out audit results that are older that storage/datasets/https__elasticms.fgov.be//000000001.json.

Caution: this script does not currently give live feedback. Check the elasticms job's logs for live status.

The audit base url is mandatory in order to identify the right dataset to upload.

node src/clean-out.js  https://elasticms.fgov.be

All in one

A shell script, at the root, is available to audit, upload and clean a website with a single command:

./audit.sh  https://elasticms.fgov.be

Script's arguments

  • URL used to start the audit [mandatory]
  • Dataset ID, used to identify the audit's dataset [not mandatory, by default generated with URL argument]

Script's options

In order to pass them to the audit.sh sccript all options can be provided to all WebAuditor scripts, but they don't always have an effect on all scripts:

  • --ignore-ssl=true: used to ignore SSL errors (only for the audit.js, clean-out.js and upload.js scripts)
  • --content=true: also extract text content if supported in a content field (for HTML and using textract)
  • --status-code=200: Display all links with a return code above the one provided (only for the create.js script)
  • --max-pages=5000: Limit the summary overview to the first x audited pages (performance issue if the website contains too much A11Y issues and/or too much broken links) (only for the create.js script). Try --max-pages=all to load all pages.
  • --wait-until=load: If defined, the page audit will be initiated only after the provided event is triggered. Check this blog page. (only for the audit.js script)
  • --pa11y-limit=100: Limit the upload of P11Y errors to the first x one. Default value 100. (only for the upload.js script)
  • --status-code-limit=404: If defined, limit the upload of links to one with status code bigger or equal to x. (only for the upload.js script)

And then you can run :

./audit.sh --ignore-ssl=true  https://elasticms.fgov.be/

Docker

Build the image

docker compose build

Run the web-auditor scripts

Script by script:

docker compose run --rm web-auditor audit --ignore-ssl=true --content=true https://elasticms.fgov.be/
docker compose run --rm --service-ports web-auditor create https://elasticms.fgov.be/
docker compose run --rm web-auditor upload --pa11y-limit=10 --status-code-limit=404 https://elasticms.fgov.be/
docker compose run --rm web-auditor clean-out https://elasticms.fgov.be/

Or all in one (without the create script):

docker compose run --rm web-auditor all --ignore-ssl=true --pa11y-limit=10 --status-code-limit=404 https://elasticms.fgov.be/

How to

How to keep current results

With the environment variable CRAWLEE_PURGE_ON_START:

CRAWLEE_PURGE_ON_START=0 node src/audit.js https://elasticms.fgov.be

Increase the memory available for Puppeteer

By default, Crawlee is set to use only 25% of the available memory. You can update the configuration by setting the environment variable CRAWLEE_AVAILABLE_MEMORY_RATIO. I would recommend setting it to 0.8. Especially if you want to scan a large website (>5.000 pages)

CRAWLEE_AVAILABLE_MEMORY_RATIO=0.8 node src/audit.js https://elasticms.fgov.be