Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickle does not work with unbuffered streams #93050

Open
manueljacob opened this issue May 21, 2022 · 0 comments
Open

pickle does not work with unbuffered streams #93050

manueljacob opened this issue May 21, 2022 · 0 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@manueljacob
Copy link

manueljacob commented May 21, 2022

The documentation for pickle.load() says:

The argument file must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file can be an on-disk file opened for binary reading, an io.BytesIO object, or any other custom object that meets this interface.

However, the following code doesn’t work:

import pickle

large_bytes = b'x' * (1 << 31)

with open('test.pickle', 'wb') as w:
    pickle.dump(large_bytes, w)

with open('test.pickle', 'rb', 0) as r:
    assert pickle.load(r) == large_bytes

It fails with:

Traceback (most recent call last):
  File "test_pickle.py", line 9, in <module>
    assert pickle.load(r) == large_bytes
_pickle.UnpicklingError: pickle data was truncated

Contrary to the documentation, pickle.load() requires that the file’s read() method returns as many bytes as requested. This is the case for buffered binary streams unless the underlying raw stream is interactive (source). However, it is not the case for unbuffered binary streams if the operating system can’t read enough bytes at once. On my system this is the case for bytestrings longer than (1 << 31) - 4096 bytes. For pipes, the limit is 1 << 16 bytes on my system.

pickle.dump() has a similar problem. Its documentation says:

The file argument must have a write() method that accepts a single bytes argument. It can thus be an on-disk file opened for binary writing, an io.BytesIO instance, or any other custom object that meets this interface.

The above code with an unbuffered writer and buffered reader results in the same exception.

If the bytestring is one byte longer that what the operating system can write at once, loading it works but returns a wrong result.

import pickle

large_bytes = b'x' * ((1 << 31) - 4095)

with open('test.pickle', 'wb', 0) as w:
    pickle.dump(large_bytes, w)

with open('test.pickle', 'rb') as r:
    assert pickle.load(r) == large_bytes

fails with:

Traceback (most recent call last):
  File "test_pickle.py", line 9, in <module>
    assert pickle.load(r) == large_bytes
AssertionError

because the last byte of the unpickled bytestring is b'\x94' (MEMOIZE opcode).

marshal.load() / marshal.dump() have a similar problem, except that I couldn’t find an example like the previous in which the data was corrupted in-between, as marshal creates a buffer for the whole output and writes it to the stream at once. Also marshal’s maximum supported bytes length is (1 << 31) - 1, so the above example has to be adapted.

Possible solutions

The documentation should match the actual requirements of the implementation. The documentation could be changed to mention the additional restrictions, or the implementation could be changed to call read() / write() multiple times if necessary.

If it is decided that the implementation should not call write() multiple times, I think that at least an exception should be thrown to avoid silent data corruption.

Environment

  • CPython versions tested on: 3.10.4
  • Operating system and architecture: Linux x86_64
@manueljacob manueljacob added the type-bug An unexpected behavior, bug, or error label May 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
Status: No status
Development

No branches or pull requests

1 participant