From ebebbc8c2910ef2d4c5e7119c6f9ffac3bb6a0cb Mon Sep 17 00:00:00 2001 From: James Bonfield Date: Tue, 27 Jun 2023 16:49:01 +0100 Subject: [PATCH] Revert defn of CRAM container size to be sum of block sizes (PR#731) This was added as clarification in #398 after discussion in #396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852. --- CRAMv3.tex | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/CRAMv3.tex b/CRAMv3.tex index 25e98cb6..c54abd35 100644 --- a/CRAMv3.tex +++ b/CRAMv3.tex @@ -361,6 +361,13 @@ \section{\textbf{File structure}} Containers consist of one or more blocks. The first container, called the CRAM header container, is used to store a textual header as described in the SAM specification (see the section 7.1). +This container may have additional padding bytes present for purposes +of permitting inline rewriting of the SAM header with small changes in +size. These padding bytes are undefined, but we recommend filling with +nuls. The padding bytes can either be in explicit uncompressed Block +structures, or as unallocated extra space where the size of the +container is larger than the combined size of blocks held within it. + \begin{center} \begin{tikzpicture}[ @@ -377,12 +384,14 @@ \section{\textbf{File structure}} \nodepart[text width=8em]{six}CRAM EOF Container }; -\node (header) [boxes=2,below=1 of file.three south, text width=15em] { +\node (header) [boxes=3,below=1 of file.three south, text width=12em] { \nodepart{one}Block 1:\break CRAM Header\break -(optionally compressed) +(optionally compressed)\ \nodepart{two}Optional Block 2:\break nul padding bytes\break +(uncompressed)\ +\nodepart{three}Optional padding:\break (uncompressed) }; \draw (file.one split south) to (header.north west); @@ -548,7 +557,9 @@ \section{\textbf{Container header structure}} \textbf{Data type} & \textbf{Name} & \textbf{Value} \tabularnewline \hline -int32 & length & the sum of the lengths of all blocks in this container (headers and data); +int32 & length & the sum of the lengths of all blocks in this +container (headers and data) and any padding bytes (CRAM header +container only); equal to the total byte length of the container minus the byte length of this header structure\tabularnewline \hline itf8 & reference sequence id & reference sequence identifier or\linebreak{}