Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow � character references #750

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Jun 1, 2024

We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently. Because checking literal forms means that we should decode and check all the input, this will influence performance. We are not ready to get that performance lost for now. Users of the Reader API could do their own checks themselves.

Ironically, this is what firstly was proposed in #496. After trying to finish that PR I found that we have some unanswered questions (above) which we should work out before we will be create a consistent solution. I think, it will be tight linked with #749.

I think, the best what we can do now is to not check validity of any characters and allow users to that themselves.

cc @sashka

We should either restrict all invalid characters both in literal form and
as character references, or none of them. Disallowing only the one character
is inconsistently. Because checking literal forms means that we should decode
and check all the input, this will influence performance. We are not ready to
get that performance lost for now. Users of the Reader API could do their own
checks themselves.
@Mingun Mingun requested a review from dralley June 1, 2024 15:52
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.67%. Comparing base (5d76174) to head (4522d9d).
Report is 23 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #750      +/-   ##
==========================================
+ Coverage   61.24%   61.67%   +0.42%     
==========================================
  Files          39       39              
  Lines       16277    16626     +349     
==========================================
+ Hits         9969    10254     +285     
- Misses       6308     6372      +64     
Flag Coverage Δ
unittests 61.67% <100.00%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dralley
Copy link
Collaborator

dralley commented Jun 4, 2024

We should either restrict all invalid characters both in literal form and as character references, or none of them. Disallowing only the one character is inconsistently.

Some characters (like control characters, at least in XML 1.1), are legal when escaped but illegal otherwise, while others (like NUL) are illegal in both cases.

While we definitely are not 100% conformant with the standard, I disagree that this particular "inconsistency" is necessarily wrong. As far as I can tell, it's one we're going to have even once we do conform, because the XML standard doesn't treat all invalid characters as equally invalid.

@Mingun
Copy link
Collaborator Author

Mingun commented Jun 4, 2024

Some characters (like control characters, at least in XML 1.1), are legal when escaped but illegal otherwise, while others (like NUL) are illegal in both cases.

Yes, it is. My point is that:

  • we agreed in Allow &#0; values #496 that this checks should be optional although disabling them will create non-conforming parser;
  • to create good solution some investigation required;
  • checking NUL only when resolving character references is anyway does only half of work, we still need to check literal NULs, but this will impact performance; so this should be done carefully and maybe even as an explicit preprocess step from the user side;
  • the restriction is anyway seems artificial with roots in С null-terminated strings;
  • for some of our users this check is undesirable.

So I think that removing this check should not make things worse and I propose to accept a short-term solution (yes, I know what you thought now) of removing this check and add it later with understanding and formal tests (which we should get after resolving #749).

@dralley
Copy link
Collaborator

dralley commented Jun 4, 2024

the restriction is anyway seems artificial with roots in С null-terminated strings; for some of our users this check is undesirable.

I'm still not certain that disabling them will help this user. I have a vague recollection (though I cannot find where it was written) that whatever weird nonstandard format they were using was dumping raw binary data into the space between tags (i.e. text event in our parlance) and using an attribute to denote the length so that the parser could entirely skip over that region.

I could be mixing it up with another issue though.

@Mingun
Copy link
Collaborator Author

Mingun commented Jun 4, 2024

You are talking about #623, I meant this and this comments. With this change we can make life for some people easier untilw we implement proper solution that will check all according to the standard.

@dralley
Copy link
Collaborator

dralley commented Jun 5, 2024

You are talking about #623

Ah, correct.

Copy link
Collaborator

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have mixed feelings but I'm OK with going forwards with this if @sashka confirms that it would be helpful to them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants