Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC ArchiveRecordHeader incorrectly handles some fields #19

Open
kris-sigur opened this issue Jul 28, 2014 · 0 comments
Open

WARC ArchiveRecordHeader incorrectly handles some fields #19

kris-sigur opened this issue Jul 28, 2014 · 0 comments

Comments

@kris-sigur
Copy link
Member

It seems that WARCRecord isn't always doing the right thing when it reads from the WARC header. I've run into to instances where using the provided methods led to wrong/null content.

  1. See line 201 when WARCRecord reads recordIdentifier. One, might assume that this would return the WARC-Record-ID field, but it doesn't. Instead, this seems like a copy-paste error from how ARCRecord does things. This will always be null in practice
  2. In line 146 is an even more blatant bug as getDigest simply returns null instead of WARC-Payload-Digest

There may be other errors.

The only safe way to access WARC headers is via the getHeaderValue() method. Even that can be tricky if you want to use the constants from WARCConstants as their names aren't always aligned with the WARC spec (e.g. HEADER_KEY_ID really should be HEADER_RECORD_ID line 162)

Some of this seems to have been made possible by the fact that WARCConstants extends ArchiveFileConstants. Seems it might be best to sever this connection.

WARCRecord also implements a deprecated version of WARCConstants, should really fix that while we are at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant