Skip to content

Commit

Permalink
Revert defn of CRAM container size to be sum of block sizes (PR#731)
Browse files Browse the repository at this point in the history
This was added as clarification in #398 after discussion in #396, but
this was in error.  In our attempts to clarify and nail down these
corner cases, we failed to recall that the SAM header is permitted to
be padded out by non-block allocated space.

History on this decision dates back to 2013 and is show in Samtools
issue samtools/samtools#1852.

There are good reasons for changing away from the decision of padding
via a second block, as changing block sizes can also change block
structure size (if we're using a generic shared piece of code, due to
ITF8 being a variable length integer), and this in turn makes it
cumbersome to handle every possible change in SAM header size.  It is
far easier and simpler to just have unallocated space after the block
and before the end of the container.  This is how htslib works since
CRAM 3.0 and I believe how CRAMtools.jar works.

Fixes samtools/samtools#1852.
  • Loading branch information
jkbonfield committed Aug 6, 2024
1 parent f907ead commit ebebbc8
Showing 1 changed file with 14 additions and 3 deletions.
17 changes: 14 additions & 3 deletions CRAMv3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,13 @@ \section{\textbf{File structure}}

Containers consist of one or more blocks. The first container, called the CRAM header container,
is used to store a textual header as described in the SAM specification (see the section 7.1).
This container may have additional padding bytes present for purposes
of permitting inline rewriting of the SAM header with small changes in
size. These padding bytes are undefined, but we recommend filling with
nuls. The padding bytes can either be in explicit uncompressed Block
structures, or as unallocated extra space where the size of the
container is larger than the combined size of blocks held within it.


\begin{center}
\begin{tikzpicture}[
Expand All @@ -377,12 +384,14 @@ \section{\textbf{File structure}}
\nodepart[text width=8em]{six}CRAM EOF Container
};

\node (header) [boxes=2,below=1 of file.three south, text width=15em] {
\node (header) [boxes=3,below=1 of file.three south, text width=12em] {
\nodepart{one}Block 1:\break
CRAM Header\break
(optionally compressed)
(optionally compressed)\
\nodepart{two}Optional Block 2:\break
nul padding bytes\break
(uncompressed)\
\nodepart{three}Optional padding:\break
(uncompressed)
};
\draw (file.one split south) to (header.north west);
Expand Down Expand Up @@ -548,7 +557,9 @@ \section{\textbf{Container header structure}}
\textbf{Data type} & \textbf{Name} & \textbf{Value}
\tabularnewline
\hline
int32 & length & the sum of the lengths of all blocks in this container (headers and data);
int32 & length & the sum of the lengths of all blocks in this
container (headers and data) and any padding bytes (CRAM header
container only);
equal to the total byte length of the container minus the byte length of this header structure\tabularnewline
\hline
itf8 & reference sequence id & reference sequence identifier or\linebreak{}
Expand Down

0 comments on commit ebebbc8

Please sign in to comment.