Issue #205: Tweet bodies incorrectly encoded/decoded

Note: Depending on how you look at it, this can be either a back-end bug or a front-end bug. However, it's a bug in both cases.

To describe the bug, let me show the process of a high school teacher explaining HTML to one of his students, as a reply on Twitter.

  • Teacher opens twitter.com and types the tweet: "@student If you want to display the & symbol, use & in HTML."
  • Twitter.com calls statuses/update to send the tweet
  • Internal API stuff
  • Student opens twitter.com to view the tweet
  • Twitter.com calls statuses/home_timeline and gets (roughly):
    [{..., "text":"@student If you want to display the & symbol, use & in HTML.", ...}, ...]
  • Twitter.com displays this as "@student If you want to display the & symbol, use & in HTML."

This is incorrect behavior. I believe that Twitter's core values include that tweets sent are tweets received - the only alterations should be additions.

A long, long time ago Twitter decided to encode only < and > in the XML responses, to avoid executing scripts and other kind of unwanted HTML code. This then migrated to JSON as well, and has been there ever since. "#oldtwitter" knew that only &lt; and &gt; should be decoded, so there weren't any issues. However, it looks like newer versions of Twitter forgot to decode only those entities.

Possible fixes:
* Have the API fully encode the response according to the HTML spec, instead of only encoding < and >. At the very least & should be encoded as well. This fix might create some unexpected behavior on third party clients (although they probably also have a sloppy implementation of the API).
* Simply don't encode the responses at all. This is the "correct" solution as JSON will never get executed as HTML, so there's no need to worry about executing scripts. Developers may have to update their implementations to stop decoding HTML entities.
* Change Twitter.com to properly decode only < and >. However, this still leaves people who type &lt; and &gt; manually. It's also bad in that developers will have to write their own decoding algorithms to properly parse the data. (While it can be as simple as two basic replacements, they do have to be implemented. That, and they're not perfect.) If this path is chosen, I strongly urge the API team to write some guidelines on decoding data such as this - many developers will simply use (for example) PHP's html_entity_decode.

This is a (very) minor issue but definitely an issue. It's an incredibly easy fix but it may have massive implications. But it's a bug and bugs should be fixed, correct?

tl;dr: Basic rule of encoding data is "in == out", yet that's not the case here for (for example) &amp; (displayed as &)

Updates

  • Thanks, I've reported this to the twitter.com folks. Since the API appears to do the correct thing in this case, though, I'm closing this issue.