Update JSON::safeEncode to sanitize, rather than attempting convert/repair#2996
Update JSON::safeEncode to sanitize, rather than attempting convert/repair#2996anthonyryan1 wants to merge 1 commit into
Conversation
|
Just to clarify, this problem affects every endpoint that uses safeEncode, not just the list in this example. |
|
If all calls to safeEncode are affected why not fix the main function in |
|
That is the long term plan, yes. This isn't ready to merge, it's a discussion starting point for how we want to try solving these UTF8 normalization failures. |
|
What do you think if the json class is like this? class JSON
{
public static function safeEncode($value)
{
if (defined('JSON_THROW_ON_ERROR')) {
try {
return json_encode($value, JSON_THROW_ON_ERROR | JSON_INVALID_UTF8_SUBSTITUTE);
} catch (JsonException $e) {
throw $e;
}
}
$encoded = json_encode($value);
if ($encoded === false) {
require_once('utf.php');
return json_encode(UTF::utf8ize($value));
}
return $encoded;
}
}I would leave the old version just incase someone have PHP 5 still in use and then these substitude and stuff dont work. So far dont see anything break with this. |
…epair
UTF::utf8ize previously attempted to "repair" as many mangled strings
as possible, but I've observed a number of examples over the years where
it choked and broke rutorrent.
I would argue that taking arbitrary strings of bytes, and attempting
to convert them with perfect accuracy into valid UTF8 (when those strings
come to us mangled, corrupt and always with unknown encodings) isn't possible.
Here's a handful of examples of of byte sequences that choke the UTF8 repair code:
```php
require('php/utility/json.php');
var_dump(JSON::safeEncode("\xC0\x80"));
var_dump(JSON::safeEncode("\xED\xA0\x80"));
var_dump(JSON::safeEncode("\xED\xBF\xBF"));
var_dump(JSON::safeEncode("\xF5\x80\x80\x80"));
var_dump(JSON::safeEncode("\xF7\xBF\xBF\xBF"));
```
The problem is that I think this list is nowhere near comprehensive, and
even if we fix all of these, there will be many more corrupt sequences in
the wild we can't predict.
Instead of trying to repair all this corrupt data, let's just use the xFFFD
replacement character. Users who add files with corrupt strings will see the
replacement character (adding some awareness), but that becomes a content
problem and not a "ruTorrent is broken" problem.
|
There hasn't been much discussion on the trade-offs between attempting to repair vs sanitizing the corrupt strings. So I've just made a call, and amended this PR to my personal preference for how to handle this issue. UTF::utf8ize previously attempted to "repair" as many mangled strings I would argue that taking arbitrary strings of bytes, and attempting Here's a handful of examples of of byte sequences that choke the UTF8 repair code: require('php/utility/json.php');
var_dump(JSON::safeEncode("\xC0\x80"));
var_dump(JSON::safeEncode("\xED\xA0\x80"));
var_dump(JSON::safeEncode("\xED\xBF\xBF"));
var_dump(JSON::safeEncode("\xF5\x80\x80\x80"));
var_dump(JSON::safeEncode("\xF7\xBF\xBF\xBF"));The problem is that I think this list is nowhere near comprehensive, and Instead of trying to repair all this corrupt data, let's just use the xFFFD |
Right now, we've got a lot of code attempting to correct every malformed UTF8 string into something usable. This pull request is more of an alternate approach where we don't try and coax every string into something valid, but instead replace invalid sequences with a broken character encoding character.
Also ensures that on JSON encode failures we don't fail silently and return a HTTP 200.
One of two possible fixes for #2977
The other approcah is collecting every broken UTF8 name and adding unit tests for the UTF8 repair class. I have saved some strings that I've observed causing these problems in the wild, and can share them with someone motivated to write tests and fix the UTF8 fixer instead.