Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify container header fields. Resolves #396 #398

Merged
merged 4 commits into from
Apr 4, 2019

Conversation

jmthibault79
Copy link
Contributor

No description provided.

@jmarshall
Copy link
Member

(As a reminder: please create PR branches in a fork of the hts-specs repository — see MAINTAINERS.md. But this time it was a useful test to highlight #399 😄)

@jmthibault79
Copy link
Contributor Author

Oops, sorry about that.

CRAMv3.tex Outdated
blocks byte array. Landmarks are used for random access indexing.\tabularnewline
itf8[ ] & landmarks & the locations of slices in this container as byte offsets from the end of
this container header, used for random access indexing.
The landmark count must equal the slice count.\tabularnewline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this definition imply that the first landmark entry would always be 0, since the first slice header would always be immediately after the container header ? That doesn't seem to match the cram_dump output that @jkbonfield put in the comment in #396, nor what I see in htsjdk code:

final int containerHeaderSize = landmarks[0];
final int containerTotalByteSize = containerHeaderSize + containerByteSize;

Not sure which is incorrect, but the code seems to count the header size in the landmark offset, as if it were an offset from the beginning of the header/container, not the end of the header ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

landmarks[0] = compression header length. We could add a note here saying so.

I can make a quick PR to name this value more precisely in htsjdk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, these states are equivalent for containers:

  • no landmarks = no slices = no compression header

Copy link
Contributor

@cmnbroad cmnbroad Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see - I wasn't properly distinguishing compression header from container header. I was thinking of them as a single header, but that doesn't match the spec treatment. I think your suggestion of noting that here is a good one, and would make that explicit.

Copy link
Contributor

@jkbonfield jkbonfield Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This is a case of "if only!". It doesn't make sense as if we have slices we have to decode container structure and the compression header, so the seek should be relative to that location, not relative to between the structure end and the compression header start.

It's a case of "I wouldn't start from here"...

I can try to make it more explicit. (Edit: although isn't that what the picture is stating?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new picture. Yes, it has helped to clarify this confusion.

I updated the text here in a new commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got muddled with which this PR was (too many related ones), sorry, but yes the picture elsewhere clarifies things there and your explicit comment regarding landmark[0] being size of the compression header helps a lot.

jmthibault79 added a commit to samtools/htsjdk that referenced this pull request Apr 1, 2019
- added comments clarifying the real situation
- see samtools/hts-specs#396
- see samtools/hts-specs#398
@hts-specs-bot
Copy link

Changed PDFs as of bd2c42c: CRAMv3 (diff).

@jkbonfield
Copy link
Contributor

Changed PDFs as of bd2c42c: CRAMv3 (diff).

The diffs artifact doesn't seem to contain the contents of this commit. It's just one page long. @yfarjoun

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor quibble only on block length wording that I'll accept either way. A good improvement especially the landmark clarification. Approved.

CRAMv3.tex Outdated
@@ -440,7 +440,8 @@ \section{\textbf{Container structure}}
\textbf{Data type} & \textbf{Name} & \textbf{Value}
\tabularnewline
\hline
int32 & length & byte size of the container data (blocks)\tabularnewline
int32 & length & the sum of the byte lengths of all blocks in this container;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"byte lengths of all blocks" may be ambiguous. Are we talking the size of all data[] parts of blocks, or also including the block structure itself? We know it's the latter, but it may not be clear from this wording given "byte lengths" and the fact that blocks have a "byte[]" array in their structure.

To be totally unambiguous, maybe use "the sum of the lengths of all blocks (both structures and data) in this container".

@jmarshall
Copy link
Member

The diffs artifact doesn't seem to contain the contents of this commit. It's just one page long.

Yep, it's the wrong one page. Yossi's been looking into that — cf #400 and the issue linked from there.

@hts-specs-bot
Copy link

Changed PDFs as of 835c877: CRAMv3 (diff).

jmthibault79 added a commit to samtools/htsjdk that referenced this pull request Apr 3, 2019
* Revert #1326 because it was correct before
- added comments clarifying the real situation
- see samtools/hts-specs#396
- see samtools/hts-specs#398
- comments and rename CRAIEntry.sliceByteOffset
- blockCount comments
CRAMv3.tex Outdated Show resolved Hide resolved
@cmnbroad
Copy link
Contributor

cmnbroad commented Apr 3, 2019

One additional issue, though perhaps for another PR. Otherwise this looks good to me.

Co-Authored-By: jmthibault79 <thibault@broadinstitute.org>
@hts-specs-bot
Copy link

Changed PDFs as of 180524d: CRAMv3 (diff).

@jmthibault79 jmthibault79 merged commit c689122 into master Apr 4, 2019
@jmthibault79 jmthibault79 deleted the jt_container_header branch April 4, 2019 15:25
jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Jun 27, 2023
This was added as clarification in samtools#398 after discussion in samtools#396, but
this was in error.  In our attempts to clarify and nail down these
corner cases, we failed to recall that the SAM header is permitted to
be padded out by non-block allocated space.

History on this decision dates back to 2013 and is show in Samtools
issue samtools/samtools#1852.

There are good reasons for changing away from the decision of padding
via a second block, as changing block sizes can also change block
structure size (if we're using a generic shared piece of code, due to
ITF8 being a variable length integer), and this in turn makes it
cumbersome to handle every possible change in SAM header size.  It is
far easier and simpler to just have unallocated space after the block
and before the end of the container.  This is how htslib works since
CRAM 3.0 and I believe how CRAMtools.jar works.

Fixes samtools/samtools#1852.
jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Sep 27, 2023
This was added as clarification in samtools#398 after discussion in samtools#396, but
this was in error.  In our attempts to clarify and nail down these
corner cases, we failed to recall that the SAM header is permitted to
be padded out by non-block allocated space.

History on this decision dates back to 2013 and is show in Samtools
issue samtools/samtools#1852.

There are good reasons for changing away from the decision of padding
via a second block, as changing block sizes can also change block
structure size (if we're using a generic shared piece of code, due to
ITF8 being a variable length integer), and this in turn makes it
cumbersome to handle every possible change in SAM header size.  It is
far easier and simpler to just have unallocated space after the block
and before the end of the container.  This is how htslib works since
CRAM 3.0 and I believe how CRAMtools.jar works.

Fixes samtools/samtools#1852.
jkbonfield added a commit that referenced this pull request Aug 6, 2024
This was added as clarification in #398 after discussion in #396, but
this was in error.  In our attempts to clarify and nail down these
corner cases, we failed to recall that the SAM header is permitted to
be padded out by non-block allocated space.

History on this decision dates back to 2013 and is show in Samtools
issue samtools/samtools#1852.

There are good reasons for changing away from the decision of padding
via a second block, as changing block sizes can also change block
structure size (if we're using a generic shared piece of code, due to
ITF8 being a variable length integer), and this in turn makes it
cumbersome to handle every possible change in SAM header size.  It is
far easier and simpler to just have unallocated space after the block
and before the end of the container.  This is how htslib works since
CRAM 3.0 and I believe how CRAMtools.jar works.

Fixes samtools/samtools#1852.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants