[Python-Dev] pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

Fri, 26 Apr 2002 16:59:23 -0400

[Tim]
>> But Marc-Andre uses realloc at the end to return the excess.  The
>> excess bytes will get reused (and some returned yet again) by the
>> next overallocation, and so on.

[Martin]
> Right. I confused this with the fact that PyMem_Realloc won't return
> the excess memory,

PyMem_Realloc does whatever the system realloc does -- PyMem_Realloc doesn't
go thru pymalloc today (except in a PYMALLOC_DEBUG build).  Doesn't matter,
though, since strings use the PyObject_{Malloc, Free, Realloc} family today,
and that does use pymalloc.  OTOH, there's no reason PyObject_Realloc *has*
to hang on to all small-block memory on a shrinking realloc, and there's no
reason pymalloc couldn't grow another realloc entry point specifying what
the caller wants a shrinking realloc to do.  These things are all easy to
change, but I don't know what's truly desirable.

Note another subtlety:  I expect you brought up PyMem_Realloc because
unicodeobject.c uses the PyMem_XYZ family for managing the
PyUnicodeObject.str member today.  That means it normally never uses
pymalloc at all, except to allocate fixed-size PyUnicodeObject structs
(which use the PyObject_XYZ memory family).  I don't know whether that's the
best idea, but that's how it is today.

pymalloc gets into this because PyUnicode_EncodeUTF8 returns a plain string
object, and the latter uses pymalloc today.

> so the extra bytes in a small string will be wasted for the life
> time of the string object - that still could cause significant memory
> wastage.

It could.  Python generally aims to optimize the expected case, not jump
thru hoops to avoid worst cases (else we wouldn't use dicts at all <wink>).
But I don't know what the expected case is here, and given how often I use
Unicode in my own work it could be I'll never have a clue.  Note that the
expected uses of Unicode strings makes no difference to
PyUnicode_EncodeUTF8:  what counts there is the expected lifetimes and sizes
of the "plain" utf8-encoded PyStringObjects it computes.  Indeed, pymalloc
has almost no implications for Unicode beyond the encode-as-a-plain-string
functions (unless unicodeobject.c is changed to manage the
PyUnicodeObject.str member using pymalloc too, as plain strings do today).

>> MAL, you should keep in mind that pymalloc is also managing the
>> small chunks in your scheme:  when you're fiddling with a 40-character
>> Unicode string, an overallocation "by a factor of 4" only amounts to
>> an 80-character UTF8 string.

> [I guess this is a terminology, not a math problem:

Nope!  Turns out it was an hallucination problem <wink>.

> a 40 character Unicode string has already 80 bytes; the UTF-8 of
> it can have up to 160 bytes].

You're right, of course.  The conclusion doesn't change, though:  that's
still in the range of block pymalloc handles (and will remain so unless I
reduce pymalloc's small-object threshold below what's needed for pymalloc to
handle small dicts on its own -- which I'm unlikely to do).