Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexing metrics and URL spot checks #18

Merged
merged 6 commits into from
Nov 11, 2020
Merged

Conversation

anjackson
Copy link
Contributor

@anjackson anjackson commented Nov 6, 2020

This extends the set of metrics to cover the CDX indexing and some crawl activity. It looks to the tracking database and checks the total number of WARCs that have been marked as indexed, and the timestamp of the most recent one. It also looks a open-access pywb and gets the timestamp of a site that should be crawled every day (bl.uk/robots.txt). Both these timestamps can later be set up with corresponding alerts.

These two metrics should help us check that the CDX indexing is happening.
If records are not longer being updated, the files may have gone missing.
Add a check that queries the public CDX for the most recent timestamp of a page that should  be updated every day.
Records WARCs count and most recent WARC timestamp.
@anjackson
Copy link
Contributor Author

Okay, so tested now on dev. I had to remove urllib from the dependencies because it's not called that anymore, and anyway it gets pulled in appropriately as a requirement for the others.

I'll clean up this pull-request to focus on the new metrics and open a couple of issues on other things we might look at later on.

@anjackson
Copy link
Contributor Author

Added #19 and #20 as issues that arose while working on this, but they need not block this, I think.

@anjackson anjackson marked this pull request as ready for review November 11, 2020 14:22
@GilHoggarth GilHoggarth merged commit 2c2757b into ukwa:master Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants