quick-xml captures UTF BOM as Event::Text #191

TakaakiFuruse · 2020-01-25T13:19:27Z

Hi, thank you for creating a cool crate, I'm using your crate in my project!

Problem

It looks quick-xml captures UTF BOM as an Event::Text.

Normally, If you parse following XML with the Reader code,

<tag1 att1 = "test">HELLO</tag1>

you would capture

"test" as Event::Start,
"HELLO" as Event::Text

However, if BOM was embedded in the file, the Reader code captures

"test" as Event::Start,
"\u{feff}" as Event::Text and
"HELLO" as Event::Text

The second one should be BOM code.

I think it's better not to capture such code and I happy to make a pull request if you think that should be fixed.

(Please also have a look at my branch, I have added a test to parse an XML with BOM. It might be able to reproduce the problem.)

tafia · 2020-02-01T14:24:14Z

Thanks for the issue and the PR.

Out of curiosity, do you have 2 events or just one with the bom unicode before the hello?

At first look, I believe we should not hide the bom when working with events (they are just bytes), like what utf8 advises (quoting wikipedia The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.).

What should be done instead is probably work on the decoder part and ignore the bom there only.

Also most of the time I believe the bom is at the beginning of the file so checking for it every time is probably too much, this should be optional (like unescape_and_decode_without_bom)

TakaakiFuruse · 2020-02-03T03:49:41Z

Thanks for your response.

Out of curiosity, do you have 2 events or just one with the bom unicode before the hello?

I just tried to make some code to make sure what's going on. If you try following code with Rust 1.41 and quick-xml 0.17.2...

use quick_xml::events::Event;
use quick_xml::Reader;

fn main() {
    let xml: &str = std::str::from_utf8(b"\xEF\xBB\xBF<tag1 att1 = \"test\">HELLO</tag1>").unwrap();

    let mut reader = Reader::from_str(&xml);
    reader.trim_text(true);

    let mut txt = Vec::new();
    let mut buf = Vec::new();

    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Start(ref e)) => match e.name() {
                b"tag1" => println!(
                    "Event Start: {:?}",
                    e.attributes().map(|a| a.unwrap().value).collect::<Vec<_>>()
                ),
                _ => (),
            },
            Ok(Event::Text(e)) => txt.push(println!(
                "TextEvent: {:?}",
                e.unescape_and_decode(&reader).unwrap()
            )),
            Ok(Event::Eof) => break,
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            _ => (),
        }
        buf.clear();
    }
}

Shows this.

❯ cargo run
   Compiling quickxml_test v0.1.0 (/home/xxxx/quickxml_test)
    Finished dev [unoptimized + debuginfo] target(s) in 0.25s
     Running `/home/xxxx/quickxml_test/target/debug/quickxml_test`
TextEvent: "\u{feff}"
Event Start: [[116, 101, 115, 116]]
TextEvent: "HELLO"

It was TextEvent with Bom -> EventStart -> TextEvent with Hello.
(I first post was wrong. Sorry about that.)

I believe we should not hide the bom when working with events (they are just bytes)

I agree with that. Some people might need to deal with bom.

this should be optional (like unescape_and_decode_without_bom)

As for the solution, except my current implementation, I tried to come up with several ideas.
How do you think?

1. Implement unescape_and_decode_without_bom

Good things are

Developers who don't want BOM can ignore bom.

Bad things are

It looks there's 11 BOMs, we need probably to cover all someday.
If we return empty string when we had BOM, it also make confusion.
- If I tried to create {"tag1": "HELLO"} Hashmap from the HELLO example, I would be annoyed to have {"tag1": "\u{feff}"} Hashmap. If we implement this solution, we would have {"tag1": ""} Hashmap instead.

2. Create another event

Create another event specially tailored for BOM, say Event::Bom.

Good things are

Very simple and clear, developers would need to add another match branch for dealing with BOM
No more confusion for dealing with tag name and text values like creating {"tag1": "HELLO"} Hashmap.

Bad things are

Probably an overkill solution.
Might need more coding and testing than the previous solution.

3. Instruct developers to use other crate to find out BOM

Add few lines on tutorial part of documentation and shows how to deal with bom with other crate.
There might be good crate to check BOM like unicord-bom

Good things are

Zero coding solution, easy.

Bad things are

It will create another dependency.

tafia · 2020-02-03T04:38:39Z

Thanks for trying it again. It is more in line with what I would expect.

I think I prefer solution 1 over solution 2 because the first one deals with bom in attribute values.

I am not sure I understand the bad thing from the hello example. We could definitely have yet another *_with_bom variant whenever we try to decode (be it text event or attribute values).

Also I do think that we should implement all the bom variants because that is the whole point of supporting bom. It doesn't look too complicated.

TakaakiFuruse · 2020-02-05T00:40:14Z

Let me fix my pull request this weekend. I would implement *_with_bom method.

TakaakiFuruse · 2020-02-27T00:23:42Z

@tafia
I have made a pull request the other day with some comments.
#192

Please have a look.

…irst XML element

dralley · 2022-08-17T01:27:57Z

After thinking through this problem a bit more I think we should take a different approach. The issue is that you likely want to decide on the writer side whether or not to include a BOM and doing so by switching on events writer side is not ideal. e.g. you may want to ensure it gets written without writing it more than once by accident.

The approach I came up with is to just strip all data before the first XML event by default (apart from the BOM such data is almost never valid). If you want to retain this data, it can be switched off, in which case the BOM will be returned in the text event.

Mingun · 2022-08-17T16:41:27Z

BOM definitely should be written by the Writer transparently if corresponding encoding allows it and mode with BOM writing will be activated. I agree, that we shouldn't allow to write BOM by writing some event. That has a disadvantage, although: you will not able to get events from reader, inspect them and write them to the writer in a loop and get the same document. Unfortunately, you will have to choose -- either an event for BOM (and the ability to write it twice) or BOM safety, but complication of the inspect-and-pass loop

TakaakiFuruse mentioned this issue Jan 26, 2020

Added procedure to ignore BOM #192

Merged

Mingun added the encoding Issues related to support of various encodings of the XML documents label May 21, 2022

Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 19, 2022

tafia#191: Add new event StartText which contains text before the f…

727909f

…irst XML element

Mingun mentioned this issue Jun 19, 2022

Rework BOM handling and encoding API #399

Merged

Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 19, 2022

tafia#191: Add new event StartText which contains text before the f…

c93d247

…irst XML element

Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 20, 2022

tafia#191: Add new event StartText which contains text before the f…

0076942

…irst XML element

Mingun added a commit to Mingun/quick-xml that referenced this issue Jun 20, 2022

tafia#191: Add new event StartText which contains text before the f…

9f1e655

…irst XML element

Mingun closed this as completed in #399 Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quick-xml captures UTF BOM as Event::Text #191

quick-xml captures UTF BOM as Event::Text #191

TakaakiFuruse commented Jan 25, 2020 •

edited by Mingun

Loading

tafia commented Feb 1, 2020 •

edited

Loading

TakaakiFuruse commented Feb 3, 2020

tafia commented Feb 3, 2020

TakaakiFuruse commented Feb 5, 2020

TakaakiFuruse commented Feb 27, 2020

dralley commented Aug 17, 2022

Mingun commented Aug 17, 2022

quick-xml captures UTF BOM as Event::Text #191

quick-xml captures UTF BOM as Event::Text #191

Comments

TakaakiFuruse commented Jan 25, 2020 • edited by Mingun Loading

Problem

tafia commented Feb 1, 2020 • edited Loading

TakaakiFuruse commented Feb 3, 2020

1. Implement unescape_and_decode_without_bom

2. Create another event

3. Instruct developers to use other crate to find out BOM

tafia commented Feb 3, 2020

TakaakiFuruse commented Feb 5, 2020

TakaakiFuruse commented Feb 27, 2020

dralley commented Aug 17, 2022

Mingun commented Aug 17, 2022

TakaakiFuruse commented Jan 25, 2020 •

edited by Mingun

Loading

tafia commented Feb 1, 2020 •

edited

Loading