Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixes for XML Encoding and Decoding #288

Merged
merged 9 commits into from
Nov 24, 2016

Conversation

johnjaylward
Copy link
Contributor

@johnjaylward johnjaylward commented Sep 22, 2016

This pull request is in reference to issue #285. It modifies how the XML parser handles string values and also updates the escape function to properly encode control characters using XML entities. Better reversibility is now available that entities are decoded on read. Previously they were encoded but not decoded.

What problem does this code solve?
Fixes bug where control characters were being placed in XML as unescaped values. Also unescapes encoded values when reading XML into a JSONObject to support full reversibility of XML -> JSON -> XML.

Risks
Control Character Changes
Low risk bug fix. No users should be expecting control characters output as unicode characters. They will now see ISO Control characters encoded using the &#xXXXX; notation as defined by the running JVMs Character.isISOControl(int) method.

Reversibility Changes
Low risk. Full reversibility is not offered with the XML class but improvements have been made in recent releases for Arrays (#170) and other minor items. Full reversibility is supposed to be offered through there JSONML class as it uses a normalized XML structure.

HTML Entities are still not supported and will still break reversibility in both XML and JSONML classes, however regular XML entities are now fully supported and are properly converted between the two object types. Most XML parsers will fail if they see an HTML specific entity when run in strict mode, so this is not a major concern here.

Example of broken HTML entities (not valid XML and should break most parsers):

<?xml version="1.0" encoding="UTF-8"?>
<root><amount>10,00 &#x20ac;</amount><description>A&ccedil;&atilde;o V&aacute;lida</description></root>

This would turn into JSON that looks like this:

{"root":{"amount":"10,00 €","description":"A&ccedil;&atilde;o V&aacute;lida"}}

If the JSON is converted back to XML, proper XML would be generated like so:

<?xml version="1.0" encoding="UTF-8"?>
<root><amount>10,00 €</amount><description>A&amp;ccedil;&amp;atilde;o V&amp;aacute;lida</description></root>

The resulting XML does not match the original. This is probably the desired behavior, but should be noted. This has not changed in this pull request. This is current behaviour that is being noted here.

The parsing that has changed is in the case of valid XML entities. Now when valid XML entities are encountered in an XML document, they will be decoded upon conversion to JSON. The JSON -> XML conversion will still encode the entities (< > ' " &) as well as now encoding control characters. The XML -> JSON -> XML -> JSON will have the proper values over multiple iterations.

Changes to the API?
Yes, I removed the deprecation that I previously added to XML.stringToValue. I also added a new public static method to the XML class called unescape to complement the existing public static method escape.

Will this require a new release?
Can be rolled into the next release, pending comments.

Should the documentation be updated?
No documentation updates required.

Does it break the unit tests?
No, but new unit tests have been added in a separate pull request.

* Removes deprecation on XML.stringToValue(). It now provides unescaping for strings to convert XML entities back into values.
* New unescape function to handle XML entities -> value conversion.
@johnjaylward
Copy link
Contributor Author

johnjaylward commented Sep 22, 2016

@eduveks. Please try this pull request. The implementation does not encode all unicode values. In your example, only the Euro sign would be escaped. Please ensure that you place the following at the top of your generated XML before transporting to another system:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>

So for your example in #285, the xml sent to another system would look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<root><amount>10,00 &#x20ac;</amount><description>Ação Válida</description></root>

At this time, I'm not willing to encode more than what is in this PR as it now matches what is in the JSONObject.quote method.

Also, looking at XML specs, as long as you specify the encoding at the top as UTF-8, all readers should properly read the unicode characters.

@stleary
Copy link
Owner

stleary commented Sep 22, 2016

@johnjaylward Thanks for the quick response. What would it take to implement this in such a way that existing applications are not changed at all?

@johnjaylward
Copy link
Contributor Author

johnjaylward commented Sep 22, 2016

@stleary most applications should not be affected. This is really only performing the entity encode on the same character space that JSONObject.quote manages. That mostly encodes just control characters and other non-printables. For some reason the Euro sign falls into that region, but most characters will not. It's a total of around 300 characters out of the entire unicode spec.

@stleary
Copy link
Owner

stleary commented Sep 22, 2016

No applications affected would be preferable. Going forward, changes to existing behavior will only be accepted in the case of bugs where the existing behavior is clearly worse than the fix. I called this out in the wiki, but probably did not phrase or emphasize it enough. Fixed now.

@johnjaylward
Copy link
Contributor Author

If you want I can replace the check in the XML to only escape ISO Control characters:
https://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#isISOControl(int)

That is a much tighter range, and those really should be escaped. Do you think that range is acceptable?

…he encoding to only encode control characters as defined by ISO standard.
@johnjaylward
Copy link
Contributor Author

@stleary the latest push has the check for ISO control characters only. No applications should be affected for normal characters now. I will update the PR description accordingly.

@johnjaylward johnjaylward changed the title Implements unicode escaping similar to JSONObject. Bug fixes for XML Encoding and Decoding Sep 22, 2016
@stleary
Copy link
Owner

stleary commented Sep 22, 2016

I think we need to try to do this in such a way that there is no risk of changing existing applications. Or are you saying that this is a larger problem, and JSON-Java does not handle ISO control chars according to the spec?

@johnjaylward
Copy link
Contributor Author

Correct. Control characters are technically not allowed in XML. Most parsers will allow things like \n (new line) but it's supposed to be encoded.

@johnjaylward
Copy link
Contributor Author

I fixed the wording in the description. I incorrectly noted that the HTM LEntities reversibility was a change in this PR, but it is not. The handling of HTMLEntities (or any unrecognized named entity) is the same now as on master.

The change is the handling of valid XML entities when encountered in an XML document and parsing to JSON.

@stleary
Copy link
Owner

stleary commented Sep 23, 2016

What would it take to make these changes optional? In other words, the developer has to opt-in in order to get the new behavior?

@eduveks
Copy link

eduveks commented Sep 23, 2016

Good news!

@johnjaylward, to work correctly I was forced to change your code:

if (Character.isISOControl(cp)) ...

Do not works properly because the latin character with accents is not detected as special.

But if you use:

if (!asciiEncoder.canEncode((char)cp)) ...

Works very well!

Before the for I declared the asciiEncoder:

CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
for (final int cp : codePointIterator(string)) {
...
}

By this way in the XMLTest.testJsonToXmlEscape now the result of XML.toString(json) will be:

<amount>10,00 &#x20ac;</amount><description>A&#xe7;&#xe3;o V&#xe1;lida&#x85;</description><xmlEntities>&quot; &apos; &amp; &lt; &gt;</xmlEntities>

It's is perfect because ç, ã, á, ... will be now very well encoded!

With this change I already tested in my services and works like a charm.

Right?

@johnjaylward
Copy link
Contributor Author

johnjaylward commented Sep 23, 2016

@eduveks
If that works for you feel free to fork it and maintain that change. I can't see that ever being placed in this project since unicode is supposed to be supported by the XML specification and shouldn't need to be encoded to entities. I don't see a reason to encode all non-ascii. I believe the real issue is somewhere in your processing path, one of the processors is not properly respecting your header encoding.

2 things to note with your modification to my pull request:

  • It will fail for character points that are surrogate pairs. this is because a java char is a single byte, while a code point is an integer. You will end up encoding 2 characters in some cases instead of a single value.
  • You are probably not encoding control characters anymore. \u+0000 or \u+0008 look like they will be output raw instead of encoded as &#x0; or &#x8;

@stleary I could look into either adding another flag parameter, or creating a new method. However, I feel that both changes in this PR are bug fixes, not features. When decoding XML, it should be the correct and proper behaviour to also decode the escape entities. Then when output somewhere else it should be the callers responsibility to re-encode it to whatever format they need. By leaving encoded entities like < or & it could lead to un-noticed bugs elsewhere.

I think I know where you are coming from, that people using this feature probably have a decoding call themselves somewhere to make up for this bug. But I just can't see that as a good reason to leave this bug in there.

@johnjaylward
Copy link
Contributor Author

@stleary I'm going through the XML specification and want to clean this up to better match the XML specification. The 3 sections I'm looking at in particular are:

I'd like to change the processing to better match the specs in these areas. My proposed changes include:

  • Support section 2.2: A new method called mustEncode which will return true for a code point that is outside the range defined. False will be returned for any code point that is fully supported by the XML specification. This method would replace the call to Character.isISOControl that is currently in this pull request.
  • Support section 2.10: No changes. We will always preserve whitespace. Preserved whitespace will be our "default". This should probably just be noted in the documentation (javadoc and README)
  • Support section 2.11: In the XMLTokener(String) constructor, add code to replace all new line sequences containing a #xD#xA or just #xD with the with a single #xA as defined in the spec before processing begins. This will likely be a new private static final method in the XMLTokener class that performs this before passing the value to the super constructor.

I think the changes for section 2.2 are important, but I'm not sure on 2.11. The spec says it has to perform "as if it normalized all line breaks". However, the note in section 2.3 states that it expectes the processor to remove all #xD characters. This not only seems unclear, but our parser currently doesn't seem to have any issues with new lines; I'm not sure we need any changes there. Maybe some new test cases are in order that have mixed new line characters?

@stleary
Copy link
Owner

stleary commented Sep 29, 2016

Thanks, have not had time to review this yet. Suddenly got super-busy at work. Some quick thoughts:

  • Doc changes are always OK.
  • If we are unsure about a change, it should not be made, until/unless we are sure.
  • Even in the cases where the code does not match the spec, in most cases the default position will be to continue with the existing code. At first I thought there was some wiggle-room in the XML implementation because it was once a candidate for deletion. But we are seeing now that there are a number of developers who have been using it for some time. The ongoing risk is that helping one application may cause another application to break. On the other hand, opt-in changes that only take effect when explicitly enabled are a lot easier to accept.

@stleary
Copy link
Owner

stleary commented Nov 19, 2016

Recommend accepting this change, and starting the 3 day timer. After merge, new Maven releases will be put on hold for several months, in case any problems surface.

@stleary stleary merged commit 413bb53 into stleary:master Nov 24, 2016
@johnjaylward johnjaylward deleted the XmlEscape branch February 10, 2017 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants