[Python-Dev] PEP 383 update: utf8b is now the error handler
Stephen J. Turnbull
stephen at xemacs.org
Thu May 7 04:35:52 CEST 2009
"Martin v. Löwis" writes:
> > Now, with Python's file system encoding == UTF-8 or any packed EUC,
> > and more than a handful of Shift JIS or Big5 characters in file names,
> > one is *almost certain* to encounter ASCII as the second byte of a
> > multibyte sequence. PEP 383 can't handle this
Ah, I see. Of course, the algorithm not only has to handle the ASCII
octet which is erroneous because it can't be a trailing byte, but
*also the leading byte that signalled to expect a trailing byte >127*.
So the algorithm backs up to the character boundary (which is
well-defined for all the "sane" encodings), encode the high byte(s) in
the character with lone surrogates, and encode the ASCII as itself
(promoted to a Unicode code point).
Sorry, you're right, I was just confused. I withdraw the objection as
completely mistaken, and apologize for not thinking more carefully in
the first place.
More information about the Python-Dev
mailing list