Skip to content

Unicode string silently truncated

ebranca edited this page Jun 21, 2014 · 1 revision

Classification

  • Affected Components : codecs

  • Operating System : Linux

  • Python Versions : 2.6.x, 2.7.x

  • Reproducible : Yes

Source code

# -*- coding: utf-8 -*-
import codecs
import io
import sys

try:
    ascii
except NameError:
    ascii = repr

b = b'\x41\xF5\x42\x43\xF4'
print("Correct-String %r") % ((ascii(b.decode('utf8', 'replace'))))

with open('temp.bin', 'wb') as fout:
    fout.write(b)

with codecs.open('temp.bin', encoding='utf8', errors='replace') as fin:
    print("TEST1-String %r") % (ascii(fin.read()))

with io.open('temp.bin', 'rt', encoding='utf8', errors='replace') as fin:
    print("TEST2-String %r") % (ascii(fin.read()))

sys.exit(0)

Steps to Produce/Reproduce

To reproduce the problem copy the source code in a file and execute the script using the following command syntax:

$ python -OOBRtt test.py

Alternatively you can open python in interactive mode:

$ python -OOBRtt <press enter>

Then copy the lines of code into the interpreter.

Description

Execution of the test script produces the following results:

Correct-String "u'A\\ufffdBC\\ufffd'"
TEST1-String "u'A\\ufffdBC'"
TEST2-String "u'A\\ufffdBC\\ufffd'"

The problem is due to a problem in the codecs module that detects the character F4 and assumes this is the first character of a sequence of characters and waits to receive the remaining 3 bytes, as a consequence the resulting string is truncated.

Source string used as reference:
Correct-String "u'A\\ufffdBC\\ufffd'"
How the string is printed if processed by the module codecs
TEST1-String "u'A\\ufffdBC'"

A better and safer approach would be to read the entire stream and only then proceed to the decoding phase, as done by the io module.

How the string is printed if processed by the module io
TEST2-String "u'A\\ufffdBC\\ufffd'"

Workaround

We are not aware on any easy solution other than trying to avoid using 'codecs' in cases like the one examined.

Secure Implementation

WORK IN PROGRESS

References

[Python module io][01] [01]:https://docs.python.org/2/library/io.html

[Python module codecs][02] [02]:https://docs.python.org/2/library/codecs.html

[Python bug 12508][03] [03]:http://bugs.python.org/issue12508

  • Home
  • [Security Concerns](Security Concerns)
Clone this wiki locally