Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quick-xml captures UTF BOM as Event::Text #191

Closed
TakaakiFuruse opened this issue Jan 25, 2020 · 7 comments · Fixed by #399
Closed

quick-xml captures UTF BOM as Event::Text #191

TakaakiFuruse opened this issue Jan 25, 2020 · 7 comments · Fixed by #399
Labels
encoding Issues related to support of various encodings of the XML documents

Comments

@TakaakiFuruse
Copy link
Contributor

TakaakiFuruse commented Jan 25, 2020

Hi, thank you for creating a cool crate, I'm using your crate in my project!

Problem

It looks quick-xml captures UTF BOM as an Event::Text.

Normally, If you parse following XML with the Reader code,

<tag1 att1 = "test">HELLO</tag1>

you would capture

  1. "test" as Event::Start,
  2. "HELLO" as Event::Text

However, if BOM was embedded in the file, the Reader code captures

  1. "test" as Event::Start,
  2. "\u{feff}" as Event::Text and
  3. "HELLO" as Event::Text

The second one should be BOM code.

I think it's better not to capture such code and I happy to make a pull request if you think that should be fixed.

(Please also have a look at my branch, I have added a test to parse an XML with BOM. It might be able to reproduce the problem.)

@tafia
Copy link
Owner

tafia commented Feb 1, 2020

Thanks for the issue and the PR.

Out of curiosity, do you have 2 events or just one with the bom unicode before the hello?

At first look, I believe we should not hide the bom when working with events (they are just bytes), like what utf8 advises (quoting wikipedia The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.).

What should be done instead is probably work on the decoder part and ignore the bom there only.

Also most of the time I believe the bom is at the beginning of the file so checking for it every time is probably too much, this should be optional (like unescape_and_decode_without_bom)

@TakaakiFuruse
Copy link
Contributor Author

Thanks for your response.

Out of curiosity, do you have 2 events or just one with the bom unicode before the hello?

I just tried to make some code to make sure what's going on. If you try following code with Rust 1.41 and quick-xml 0.17.2...

use quick_xml::events::Event;
use quick_xml::Reader;

fn main() {
    let xml: &str = std::str::from_utf8(b"\xEF\xBB\xBF<tag1 att1 = \"test\">HELLO</tag1>").unwrap();

    let mut reader = Reader::from_str(&xml);
    reader.trim_text(true);

    let mut txt = Vec::new();
    let mut buf = Vec::new();

    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Start(ref e)) => match e.name() {
                b"tag1" => println!(
                    "Event Start: {:?}",
                    e.attributes().map(|a| a.unwrap().value).collect::<Vec<_>>()
                ),
                _ => (),
            },
            Ok(Event::Text(e)) => txt.push(println!(
                "TextEvent: {:?}",
                e.unescape_and_decode(&reader).unwrap()
            )),
            Ok(Event::Eof) => break,
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            _ => (),
        }
        buf.clear();
    }
}

Shows this.

❯ cargo run
   Compiling quickxml_test v0.1.0 (/home/xxxx/quickxml_test)
    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `/home/xxxx/quickxml_test/target/debug/quickxml_test`
TextEvent: "\u{feff}"
Event Start: [[116, 101, 115, 116]]
TextEvent: "HELLO"

It was TextEvent with Bom -> EventStart -> TextEvent with Hello.
(I first post was wrong. Sorry about that.)

I believe we should not hide the bom when working with events (they are just bytes)

I agree with that. Some people might need to deal with bom.

this should be optional (like unescape_and_decode_without_bom)

As for the solution, except my current implementation, I tried to come up with several ideas.
How do you think?

1. Implement unescape_and_decode_without_bom

Good things are

  • Developers who don't want BOM can ignore bom.

Bad things are

  • It looks there's 11 BOMs, we need probably to cover all someday.
  • If we return empty string when we had BOM, it also make confusion.
    • If I tried to create {"tag1": "HELLO"} Hashmap from the HELLO example, I would be annoyed to have {"tag1": "\u{feff}"} Hashmap. If we implement this solution, we would have {"tag1": ""} Hashmap instead.

2. Create another event

Create another event specially tailored for BOM, say Event::Bom.

Good things are

  • Very simple and clear, developers would need to add another match branch for dealing with BOM
  • No more confusion for dealing with tag name and text values like creating {"tag1": "HELLO"} Hashmap.

Bad things are

  • Probably an overkill solution.
  • Might need more coding and testing than the previous solution.

3. Instruct developers to use other crate to find out BOM

Add few lines on tutorial part of documentation and shows how to deal with bom with other crate.
There might be good crate to check BOM like unicord-bom

Good things are

  • Zero coding solution, easy.

Bad things are

  • It will create another dependency.

@tafia
Copy link
Owner

tafia commented Feb 3, 2020

Thanks for trying it again. It is more in line with what I would expect.

I think I prefer solution 1 over solution 2 because the first one deals with bom in attribute values.

I am not sure I understand the bad thing from the hello example. We could definitely have yet another *_with_bom variant whenever we try to decode (be it text event or attribute values).

Also I do think that we should implement all the bom variants because that is the whole point of supporting bom. It doesn't look too complicated.

@TakaakiFuruse
Copy link
Contributor Author

Let me fix my pull request this weekend. I would implement *_with_bom method.

@TakaakiFuruse
Copy link
Contributor Author

@tafia
I have made a pull request the other day with some comments.
#192

Please have a look.

@Mingun Mingun added the encoding Issues related to support of various encodings of the XML documents label May 21, 2022
Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 19, 2022
Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 19, 2022
Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 20, 2022
Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 20, 2022
@dralley
Copy link
Collaborator

dralley commented Aug 17, 2022

After thinking through this problem a bit more I think we should take a different approach. The issue is that you likely want to decide on the writer side whether or not to include a BOM and doing so by switching on events writer side is not ideal. e.g. you may want to ensure it gets written without writing it more than once by accident.

The approach I came up with is to just strip all data before the first XML event by default (apart from the BOM such data is almost never valid). If you want to retain this data, it can be switched off, in which case the BOM will be returned in the text event.

@Mingun
Copy link
Collaborator

Mingun commented Aug 17, 2022

BOM definitely should be written by the Writer transparently if corresponding encoding allows it and mode with BOM writing will be activated. I agree, that we shouldn't allow to write BOM by writing some event. That has a disadvantage, although: you will not able to get events from reader, inspect them and write them to the writer in a loop and get the same document. Unfortunately, you will have to choose -- either an event for BOM (and the ability to write it twice) or BOM safety, but complication of the inspect-and-pass loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Issues related to support of various encodings of the XML documents
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants