[Python-Dev] Unicode
Guido van Rossum
guido@python.org
Sun, 28 Apr 2002 20:38:42 -0400
[Guido]
> > No syntactic changes, no. But the way we do things would become
> > significantly different. And think of binary I/O vs. textual I/O --
> > currently, file.read() returns a string. Code dealing with binary
> > files will look significantly different, and old code won't work.
[Jack]
> It could be argued that open(..., 'r').read() returns a text
> string and open(..., 'rb').read() returns a binary blob.
They might even return different kind of objects -- arguably, binary
files don't need readline() etc., and text files may not need read(n)
(though the arg-less variant is handy).
If only I had the time to reinvent the I/O library...
> If textstrings and blobs become wholly different objects this
> shouldn't create too many problems [see below], except for code
> that opens a file in binary mode and (partially) reads the
> resulting file expecting text. But this code would need
> revisiting anyway if the normal textstring would become unicode.
Yeah, that's usually just stubborn Unix users who don't believe in the
distinction between binary and text mode. :-)
Anyway, the proper way to convert between blobs and textstrings would
be encodings. That's how Java does it.
> [here's below] To my surprise I think that having blobs and
> textstrings be unrelated objects creates less problems than
> having the one be a subtype of the other. At least, every time I
> try to do the subtyping in my head I flip back and forth between
> textstrings-are-a-subtype-of-general-binary-buffers and
> binary-buffers-are-a-special-case-of-python-strings every couple
> of seconds. I think having them both be subtypes of a common
> base type (basestring) might work, but I'm not sure.
I think they don't need anything in common (apart their
sequence-ness). I think Java's byte[] vs. String distinction is about
right.
--Guido van Rossum (home page: http://www.python.org/~guido/)