The argument file must have two methods, a read() method that takes an integer argument, and a readline() method that requires no arguments. Both methods should return bytes. Thus file can be an on-disk file opened for binary reading, an io.BytesIO object, or any other custom object that meets this interface.
Traceback (most recent call last):
File "test_pickle.py", line 9, in <module>assert pickle.load(r) == large_bytes
_pickle.UnpicklingError: pickle data was truncated
Contrary to the documentation, pickle.load() requires that the file’s read() method returns as many bytes as requested. This is the case for buffered binary streams unless the underlying raw stream is interactive (source). However, it is not the case for unbuffered binary streams if the operating system can’t read enough bytes at once. On my system this is the case for bytestrings longer than (1 << 31) - 4096 bytes. For pipes, the limit is 1 << 16 bytes on my system.
pickle.dump() has a similar problem. Its documentation says:
The file argument must have a write() method that accepts a single bytes argument. It can thus be an on-disk file opened for binary writing, an io.BytesIO instance, or any other custom object that meets this interface.
The above code with an unbuffered writer and buffered reader results in the same exception.
If the bytestring is one byte longer that what the operating system can write at once, loading it works but returns a wrong result.
Traceback (most recent call last):
File "test_pickle.py", line 9, in <module>assert pickle.load(r) == large_bytes
AssertionError
because the last byte of the unpickled bytestring is b'\x94' (MEMOIZE opcode).
marshal.load() / marshal.dump() have a similar problem, except that I couldn’t find an example like the previous in which the data was corrupted in-between, as marshal creates a buffer for the whole output and writes it to the stream at once. Also marshal’s maximum supported bytes length is (1 << 31) - 1, so the above example has to be adapted.
Possible solutions
The documentation should match the actual requirements of the implementation. The documentation could be changed to mention the additional restrictions, or the implementation could be changed to call read() / write() multiple times if necessary.
If it is decided that the implementation should not call write() multiple times, I think that at least an exception should be thrown to avoid silent data corruption.
Environment
CPython versions tested on: 3.10.4
Operating system and architecture: Linux x86_64
The text was updated successfully, but these errors were encountered:
manueljacob commentedMay 21, 2022
The documentation for
pickle.load()says:However, the following code doesn’t work:
It fails with:
Contrary to the documentation,
pickle.load()requires that the file’sread()method returns as many bytes as requested. This is the case for buffered binary streams unless the underlying raw stream is interactive (source). However, it is not the case for unbuffered binary streams if the operating system can’t read enough bytes at once. On my system this is the case for bytestrings longer than(1 << 31) - 4096bytes. For pipes, the limit is1 << 16bytes on my system.pickle.dump()has a similar problem. Its documentation says:The above code with an unbuffered writer and buffered reader results in the same exception.
If the bytestring is one byte longer that what the operating system can write at once, loading it works but returns a wrong result.
fails with:
because the last byte of the unpickled bytestring is
b'\x94'(MEMOIZEopcode).marshal.load()/marshal.dump()have a similar problem, except that I couldn’t find an example like the previous in which the data was corrupted in-between, as marshal creates a buffer for the whole output and writes it to the stream at once. Also marshal’s maximum supported bytes length is(1 << 31) - 1, so the above example has to be adapted.Possible solutions
The documentation should match the actual requirements of the implementation. The documentation could be changed to mention the additional restrictions, or the implementation could be changed to call
read()/write()multiple times if necessary.If it is decided that the implementation should not call
write()multiple times, I think that at least an exception should be thrown to avoid silent data corruption.Environment
The text was updated successfully, but these errors were encountered: