-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
Description
When json.dumps is called with ensure_ascii = True (the default), the output does not distinguish between Unicode characters outside of the BMP and their corresponding surrogate characters. This means that json.loads(json.dumps(s)) is not guaranteed to return s:
>>> json.loads(json.dumps('𝄞')) == '𝄞'
True
>>> json.loads(json.dumps('\ud834\udd1e')) == '\ud834\udd1e'
False
The only way to get surrogate characters back is to have them in the JSON string as actual characters, not escaped:
>>> json.loads( '"\ud83d\udca9"') # Actual surrogate characters
'\ud83d\udca9'
>>> json.loads(r'"\ud83d\udca9"') # Escaped surrogates
'💩'
Python does handle the case of lone surrogates reasonably (although RFC 8259 basically declares this undefined behaviour), so the round-trip does work in some cases:
>>> json.loads(json.dumps('\ud83d'))
'\ud83d'
It would seem reasonable to expect that json.loads(json.dumps(s)) == s would hold for any string. This isn't really a bug but rather a limitation of pure-ASCII encoding. However, I think it does deserve a mention in the documentation. Possibly at the end of the Character encodings section? I'm not quite sure.
Metadata
Metadata
Assignees
Labels
Projects
Status