Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for embedded raw binary #623

Closed
elrnv opened this issue Jul 16, 2023 · 10 comments · Fixed by #783
Closed

Support for embedded raw binary #623

elrnv opened this issue Jul 16, 2023 · 10 comments · Fixed by #783

Comments

@elrnv
Copy link
Contributor

elrnv commented Jul 16, 2023

There are use cases where embedding binary data into an XML document is convenient. Although not officially supported by the XML spec there are valid and popular uses that exist in the wild (e.g. VTK). Binary VTK files are quite common, in fact the tag (which can contain binary data) is the default when writing vtk files, and there are plenty of files like this in the wild.

In the past quick-xml was able to support binary data, but this has changed and in the current state requires some non-trivial changes for adding support for being able to write and read uninterpreted binary data.

My first attempt didn't really address the issue correctly since binary data can have valid '<' and '>' tags, which can trip up the parser, meaning that at least reading must be augmented from the lowest level (probably as part of Reader).

As discussed previously the implementation for this should work as an additional feature.

Reading

I think it is not difficult to add a customization point for the reader in quick-xml without being too invasive by adding a "context" field (supplied externally) to Reader that can call ask how many bytes should be read by feeding it context about what has already been read.

Writing

All types in the serialize part of the code is bound by fmt::Write. I haven't found a way to maintain that bound everywhere and also support for writing binary files to an optional or custom io::Write type. To enable this I think we have to switch back to io::Write. I understand that there are some performance concerns about having to write to a buffer if io::Write is used, but I don't see another way of adding support for binary writing without io::Write. I can really use some help from the maintainers to address this point.

@Mingun
Copy link
Collaborator

Mingun commented Jul 16, 2023

It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.

I found an archive with some files in elrnv/vtkio#27 and fight now I suspect the following algorithm for resolving ambiguity:

  • when read opening tag with a special mark (<AppendedData encoding="raw"> in an example) the parser turned into special binary reading mode
  • in that mode any sequence of bytes are allowed until it contains the closing tag sequence (</AppendedData> in an example)
  • closing tag sequence should be interpreted in the document encoding (utf-8 in an example)

So I think we could work on this issue in a few steps:

  • Implement a way to read binary data. I think, it should be a special method(s) of a Reader that should be called after consuming Event::Start(r#"<AppendedData encoding="raw">"#) and fill a provided buffer with the data and return a Span. The special Event::Binary is unlikely to be needed at this point, because there is no way to generically understand that we should read binary data instead of a text data. But we could think about a callback that will take &BytesStart and return a parsing mode. I think, that this part could be available without feature flags
  • Think about API for a Writer. Actually, I think, that we need to split the current Event into reader::Event and writer::Event, because this could benefit for this and other cases
  • Think about serde API and implementation

@elrnv
Copy link
Contributor Author

elrnv commented Jul 16, 2023

Thank you for the quick response!

It would be valuable, if you could link or attach some examples of files and provide links to the explanation, how VTK parser should understand that some byte is not a markup, but a part of a binary data.

Here are a few examples:

There doesn't seem to be any documentation on how to parse this kind of binary section, just on how to decode it. The official vtk file format is documented here
The binary section is always prefixed with an underscore, and the first few bytes determine how large the binary section is.

  • when read opening tag with a special mark ( in an example) the parser turned into special binary reading mode
  • in that mode any sequence of bytes are allowed until it contains the closing tag sequence ( in an example)
  • closing tag sequence should be interpreted in the document encoding (utf-8 in an example)

That makes sense to me, and should be robust enough to support practically all such VTK files with embedded binary blobs. Although, I think ideally this logic is best left in vtkio, here it seems we should just provide a way to do context dependent parsing that is general enough for this use case.

So I think we could work on this issue in a few steps ...

This makes sense to me, but for the writer, I am still in the dark as to how io::Writer should be supported throughout the serde modules.

It is easiest for me to prototype a complete solution for reading and writing and test it directly using the tests I already have in vtkio. I am happy to start a PR and try to separate the reading and writing changes though to make it easier to review.

elrnv added a commit to elrnv/vtkio that referenced this issue Jul 16, 2023
This commit sadly breaks support for VTK files with binary AppendedData
blobs.
The code needed to support this is currently not functional and
depends on the completion of this feature in quick-xml
(tracked in tafia/quick-xml#623)
All binary related code is now featured gated, which is a regression
from previous version of vtkio, but will be re-enabled in the future.
@Mingun
Copy link
Collaborator

Mingun commented Jul 11, 2024

Probably reading binary data is possible today, if you get inner reader when you've ready to read binary and read from it. The possible drawback is that internal offsets in the Reader won't be updated, so you will need to track read bytes and correct result of .error_position() / .buffer_position() accordingly.

@elrnv
Copy link
Contributor Author

elrnv commented Jul 13, 2024

Would this also work from the Serde API?

@Mingun
Copy link
Collaborator

Mingun commented Jul 13, 2024

No, serde requires more work. Suggested approach lefts to the user the decision where to stop reading binary data and return to reading XML events. In serde API you should somehow to tell the Reader these things. That is subject for design. I do not plan to design such API in the near time, so you can try to do that yourself.

@ssokolow
Copy link

ssokolow commented Aug 1, 2024

Just to get some clarification, am I correct in my understanding that this would also allow proper handling of embedded DSLs or de facto embedded DSLs (eg. HTML's <textarea>, <script>) as long as the apparent lack of a call to push back/un-read data is compensated for by disabling check_end_names and matching the closing tag as if the it was part of the "binary data" and the "binary data" was designed to be self-terminating?

(I know that, given a restricted HTML subset that lacks those "embedded DSL elements" but contains self-closing tags like <br>, such as a Pidgin chatlog, older versions of quick-xml can be used to parse it an order of magnitude more quickly than with html5ever by turning off paired tag enforcement and enforcing expected well-formedness rules manually.)

@Mingun
Copy link
Collaborator

Mingun commented Aug 1, 2024

Yes, you understand correctly. You even don't need to disable check_end_names because when you read from .stream(), you are effectively cut some data from seeing by XML parser. As long as not cutted content represent valid XML with proper nesting everything will work smoothly.

@ssokolow
Copy link

ssokolow commented Aug 1, 2024

Yes but, unless the cut content is length-prefixed, which embedded DSLs generally aren't, there'd need to be some way to "rewind" the stream to put the closing tag back into what the XML parser sees after the DSL parser has seen it. Hence, disabling check_end_names.

@Mingun
Copy link
Collaborator

Mingun commented Aug 1, 2024

While that may be needed in some cases, I think, that usually you shouldn't do that. .stream() returns an impl BufRead, so you can lookahead without consuming data. After all, it wraps the type that you pass to the from_reader method, so you can pass such a type which will able to buffer arbitrary amount of data without consuming them.

@ssokolow
Copy link

ssokolow commented Aug 3, 2024

Huh. You're right... and it looks like I'm not the only person who was confused by the naming and documentation of fill_buf() into thinking that BufRead had no peek().

(Sorry for the delayed response. It took me a couple of days to realize that I was remembering a response I'd intended to send instead of one I'd actually sent.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants