Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL result parsing does not work with lxml #1847

Closed
aucampia opened this issue Apr 18, 2022 · 1 comment · Fixed by #2044
Closed

SPARQL result parsing does not work with lxml #1847

aucampia opened this issue Apr 18, 2022 · 1 comment · Fixed by #2044
Assignees
Labels
bug Something isn't working

Comments

@aucampia
Copy link
Member

thanks for @edmondchuc for reporting this, this came up when he was working on integrating pyld in #1836 - pyld pulls in lxml and with lxml some tests fail that pass otherwise:

https://github.com/RDFLib/rdflib/runs/6061070121?check_suite_focus=true#step:11:674

  =========================== short test summary info ============================
  FAILED test/test_sparql/test_result.py::test_select_result_serialize_parse[xml-None-utf-8]
  FAILED test/test_sparql/test_result.py::test_select_result_serialize_parse[xml-TEXT_IO-utf-8]
  FAILED test/test_sparql/test_result.py::test_select_result_serialize_parse[xml-BINARY_IO-utf-8]
  FAILED test/test_sparql/test_result.py::test_select_result_serialize_parse[xml-STR_PATH-utf-8]
  FAILED test/test_sparql/test_result.py::test_select_result_parse_serialized[xml-TEXT_IO-utf-8]
  = 5 failed, 5115 passed, 2 skipped, 115 xfailed, 54 warnings in 129.80s (0:02:09) =

Exceptions all occur inside XMLResultParser.parse:

__________________________________________________________ test_select_result_serialize_parse[xml-BINARY_IO-utf-8] __________________________________________________________
Traceback (most recent call last):
  File "/home/iwana/sw/d/github.com/iafork/rdflib.cleanish/test/test_sparql/test_result.py", line 318, in test_select_result_serialize_parse
    check_serialized(format.name, select_result, serialized_data)
  File "/home/iwana/sw/d/github.com/iafork/rdflib.cleanish/test/test_sparql/test_result.py", line 110, in check_serialized
    parsed_result = Result.parse(StringIO(data), format=format)
  File "/home/iwana/sw/d/github.com/iafork/rdflib.cleanish/rdflib/query.py", line 215, in parse
    return parser.parse(source, content_type=content_type, **kwargs)
  File "/home/iwana/sw/d/github.com/iafork/rdflib.cleanish/rdflib/plugins/sparql/results/xmlresults.py", line 31, in parse
    return XMLResult(source)
  File "/home/iwana/sw/d/github.com/iafork/rdflib.cleanish/rdflib/plugins/sparql/results/xmlresults.py", line 40, in __init__
    tree = etree.parse(source, parser)
  File "src/lxml/etree.pyx", line 3536, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1893, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1908, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
@aucampia aucampia added the bug Something isn't working label Apr 18, 2022
@aucampia
Copy link
Member Author

This may also affect rdfxml parsing, we should use lxml in some test runs on GitHub Actions to ensure everything actually does work with it.

@aucampia aucampia self-assigned this Jul 20, 2022
aucampia added a commit that referenced this issue Jul 26, 2022
Fixed the following problems with the SPARQL XML result parsing:

- Both the parse method of both the lxml and `xml` modules does not work
  work well with `TextIO` objects, the `xml` module works with `TextIO`
  objects if the XML encoding is `utf-8`, but not if it is `utf-16`, and
  with `lxml` parse fails for both `utf-8` and `utf-16`. To fix this I
  changed the XML result parser to first convert `TextIO` to `bytes` and
  then feed the `bytes` to `parse()` using `BytesIO`.
- The parser was operating on all elements inside `results` and
  `result` elements, even if those elements were not `result` and `binding`
  elements respectively. This was causing problems with `lxml`, as `lxml`
  also returns comments when iterating over elements. To fix this I added
  a check for the element tags so that only the correct elements are
  considered.

Other changes:

- Added type hints to `rdflib.plugins.sparql.results.xmlresults`.
- Run with `lxml` one some permutations in the test matrix.
- Removed `rdflib.compat.etree`, as this was not very helpful for the
  SPARQL XML Result parser and it was not used elsewhere.
- Added an `lxml` environment to tox which installs `lxml` and
  `lxml-stubs`.
- Expanded SPARQL result testing by adding some additional
  parameters.

Related issues:

- Fixes #2035
- Fixes #1847
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant