pickle does not work with unbuffered streams #93050

manueljacob · 2022-05-21T14:27:11Z

The documentation for pickle.load() says:

The argument file must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file can be an on-disk file opened for binary reading, an io.BytesIO object, or any other custom object that meets this interface.

However, the following code doesn’t work:

import pickle

large_bytes = b'x' * (1 << 31)

with open('test.pickle', 'wb') as w:
    pickle.dump(large_bytes, w)

with open('test.pickle', 'rb', 0) as r:
    assert pickle.load(r) == large_bytes

It fails with:

Traceback (most recent call last):
  File "test_pickle.py", line 9, in <module>
    assert pickle.load(r) == large_bytes
_pickle.UnpicklingError: pickle data was truncated

Contrary to the documentation, pickle.load() requires that the file’s read() method returns as many bytes as requested. This is the case for buffered binary streams unless the underlying raw stream is interactive (source). However, it is not the case for unbuffered binary streams if the operating system can’t read enough bytes at once. On my system this is the case for bytestrings longer than (1 << 31) - 4096 bytes. For pipes, the limit is 1 << 16 bytes on my system.

pickle.dump() has a similar problem. Its documentation says:

The file argument must have a write() method that accepts a single bytes argument. It can thus be an on-disk file opened for binary writing, an io.BytesIO instance, or any other custom object that meets this interface.

The above code with an unbuffered writer and buffered reader results in the same exception.

If the bytestring is one byte longer that what the operating system can write at once, loading it works but returns a wrong result.

import pickle

large_bytes = b'x' * ((1 << 31) - 4095)

with open('test.pickle', 'wb', 0) as w:
    pickle.dump(large_bytes, w)

with open('test.pickle', 'rb') as r:
    assert pickle.load(r) == large_bytes

fails with:

Traceback (most recent call last):
  File "test_pickle.py", line 9, in <module>
    assert pickle.load(r) == large_bytes
AssertionError

because the last byte of the unpickled bytestring is b'\x94' (MEMOIZE opcode).

marshal.load() / marshal.dump() have a similar problem, except that I couldn’t find an example like the previous in which the data was corrupted in-between, as marshal creates a buffer for the whole output and writes it to the stream at once. Also marshal’s maximum supported bytes length is (1 << 31) - 1, so the above example has to be adapted.

Possible solutions

The documentation should match the actual requirements of the implementation. The documentation could be changed to mention the additional restrictions, or the implementation could be changed to call read() / write() multiple times if necessary.

If it is decided that the implementation should not call write() multiple times, I think that at least an exception should be thrown to avoid silent data corruption.

Environment

CPython versions tested on: 3.10.4
Operating system and architecture: Linux x86_64

The text was updated successfully, but these errors were encountered:

manueljacob added the type-bug An unexpected behavior, bug, or error label May 21, 2022

pickle does not work with unbuffered streams #93050

pickle does not work with unbuffered streams #93050

Comments

manueljacob commented May 21, 2022