Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go: add support for stream handling of attachments #737

Merged
merged 1 commit into from
Nov 21, 2022

Conversation

wkalt
Copy link
Contributor

@wkalt wkalt commented Nov 20, 2022

Adds support for streaming reads and writes of attachment data. With this change, ParseAttachment is dropped as a public function and replaced with a private function, parseAttachmentReader.

A new type AttachmentReader is exposed, which exposes the Data portion of an attachment as a reader that can be streamed. The Attachment type is modified to include a new ContentLength field, to allow the writer to write the attachment length before stream processing the body.

A new option, AttachmentCallback, is exposed in the Lexer. Rather than emitting attachments, the Lexer now parses an attachment reader and calls the supplied callback (if supplied), and consumes any unconsumed bytes from the attachment record (whether or not a callback is supplied).

This is a breaking change in the Go SDK, and introduces some awkwardness due to the presence of the new ContentLength field in the Attachment record. In the specification, that value is folded into the Data field, however this is not a tenable approach for stream handling. The actual binary format is not affected by this - only the way in which the binary format is described.

This change is required for readers and writers to be able to handle attachments bigger than available RAM. Without it, attaching a sufficiently large attachment will crash the reader/writer with an OOM.

Public-Facing Changes

Description

@wkalt
Copy link
Contributor Author

wkalt commented Nov 20, 2022

The point raised in the commit message about ContentLength relates to this issue: #736

Before merging this patch we should pick what name we want to use for that field, if people are on board with the idea of splitting Data into ContentLength and Data. I don't feel too strongly about whether that spec change needs to go in before or after this merges, but before is probably slightly preferable.

Also whatever approach we go with for stream handling (a callback mechanism in this patch) should probably land in the other SDKs at some point, so we should make sure we are all comfortable with the approach.

go/mcap/lexer.go Outdated Show resolved Hide resolved
go/mcap/parse.go Outdated
logTime, offset, err := getUint64(buf, 0)
func parseAttachmentReader(r io.Reader) (*AttachmentReader, error) {
buf := make([]byte, 8)
crcReader := newCRCReader(r)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make CRC validation optional somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I can add the option to skip the computation of CRC. Actual validation of the CRC is up to the callback function the user supplies.

go/mcap/parse.go Outdated Show resolved Hide resolved
go/mcap/mcap.go Outdated Show resolved Hide resolved
go/mcap/mcap.go Outdated Show resolved Hide resolved
@@ -451,19 +473,24 @@ type LexerOptions struct {
// MaxRecordSize defines the maximum size record the lexer will read.
// Records larger than this will result in an error.
MaxRecordSize int

// AttachmentCallback
AttachmentCallback func(*AttachmentReader) error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand comment

})

for !errors.Is(err, io.EOF) {
_, _, err = lexer.Next(nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make some assertion here for clarity, should be nil or io.EOF

go/mcap/mcap.go Outdated
Name string
MediaType string
ContentLength uint64
data *io.LimitedReader
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group this with the other private fields

8 + // create time
4 + // attachment name length
4 + // media type length
4 + // data size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this should be 8. why doesn't this cause a parse failure? I guess because the test isn't doing anything with the CRC and is interpreting the CRC bytes as part of the data size. Since it's all zeros due to the empty content that would erroneously pass.

Copy link
Collaborator

@james-rms james-rms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's going in a good direction - once it's possible to skip CRC calculation when reading attachments, I'm happy for this to merge.

Adds support for streaming reads and writes of attachment data. With
this change, ParseAttachment is dropped as a public function and
replaced with a private function, parseAttachmentReader.

A new type AttachmentReader is exposed, which exposes the Data portion
of an attachment as a reader that can be streamed. The Attachment type
is modified to include a new ContentLength field, to allow the writer to
write the attachment length before stream processing the body.

A new option, AttachmentCallback, is exposed in the Lexer. Rather than
emitting attachments, the Lexer now parses an attachment reader and
calls the supplied callback (if supplied), and consumes any unconsumed
bytes from the attachment record (whether or not a callback is
supplied).

This is a breaking change in the Go SDK, and introduces some awkwardness
due to the presence of the new ContentLength field in the Attachment
record. In the specification, that value is folded into the Data field,
however this is not a tenable approach for stream handling. The actual
binary format is not affected by this - only the way in which the binary
format is described.

This change is required for readers and writers to be able to handle
attachments bigger than available RAM. Without it, attaching a
sufficiently large attachment will crash the reader/writer with an OOM.
@wkalt wkalt force-pushed the wyatt/go/attachment-stream-handling branch from 3fc6c52 to d6f8b05 Compare November 21, 2022 22:11
@wkalt wkalt merged commit ea4b40d into main Nov 21, 2022
@wkalt wkalt deleted the wyatt/go/attachment-stream-handling branch November 21, 2022 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants