Handling of surrogates on JSON encoding

When `json.dumps` is called with `ensure_ascii = True` (the default), the output does not distinguish between Unicode characters outside of the BMP and their corresponding surrogate characters. This means that `json.loads(json.dumps(s))` is not guaranteed to return `s`:

```
>>> json.loads(json.dumps('𝄞')) == '𝄞'
True
>>> json.loads(json.dumps('\ud834\udd1e')) == '\ud834\udd1e'
False
```

The only way to get surrogate characters back is to have them in the JSON string as actual characters, not escaped:

```
>>> json.loads( '"\ud83d\udca9"')  # Actual surrogate characters
'\ud83d\udca9'
>>> json.loads(r'"\ud83d\udca9"')  # Escaped surrogates
'💩'
```

Python does handle the case of lone surrogates reasonably (although RFC 8259 basically declares this undefined behaviour), so the round-trip does work in some cases:

```
>>> json.loads(json.dumps('\ud83d'))
'\ud83d'
```

It would seem reasonable to expect that `json.loads(json.dumps(s)) == s` would hold for any string. This isn't really a bug but rather a limitation of pure-ASCII encoding. However, I think it does deserve a mention in the documentation. Possibly at the end of the [Character encodings](https://docs.python.org/3/library/json.html#character-encodings) section? I'm not quite sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Handling of surrogates on JSON encoding #93508

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Handling of surrogates on JSON encoding #93508

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions