Skip to content

Handling of surrogates on JSON encoding #93508

@JustAnotherArchivist

Description

@JustAnotherArchivist

When json.dumps is called with ensure_ascii = True (the default), the output does not distinguish between Unicode characters outside of the BMP and their corresponding surrogate characters. This means that json.loads(json.dumps(s)) is not guaranteed to return s:

>>> json.loads(json.dumps('𝄞')) == '𝄞'
True
>>> json.loads(json.dumps('\ud834\udd1e')) == '\ud834\udd1e'
False

The only way to get surrogate characters back is to have them in the JSON string as actual characters, not escaped:

>>> json.loads( '"\ud83d\udca9"')  # Actual surrogate characters
'\ud83d\udca9'
>>> json.loads(r'"\ud83d\udca9"')  # Escaped surrogates
'💩'

Python does handle the case of lone surrogates reasonably (although RFC 8259 basically declares this undefined behaviour), so the round-trip does work in some cases:

>>> json.loads(json.dumps('\ud83d'))
'\ud83d'

It would seem reasonable to expect that json.loads(json.dumps(s)) == s would hold for any string. This isn't really a bug but rather a limitation of pure-ASCII encoding. However, I think it does deserve a mention in the documentation. Possibly at the end of the Character encodings section? I'm not quite sure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation in the Doc dir

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions