Skip to content

Update JSON::safeEncode to sanitize, rather than attempting convert/repair#2996

Open
anthonyryan1 wants to merge 1 commit into
Novik:masterfrom
anthonyryan1:utf8
Open

Update JSON::safeEncode to sanitize, rather than attempting convert/repair#2996
anthonyryan1 wants to merge 1 commit into
Novik:masterfrom
anthonyryan1:utf8

Conversation

@anthonyryan1

Copy link
Copy Markdown
Contributor

Right now, we've got a lot of code attempting to correct every malformed UTF8 string into something usable. This pull request is more of an alternate approach where we don't try and coax every string into something valid, but instead replace invalid sequences with a broken character encoding character.

Also ensures that on JSON encode failures we don't fail silently and return a HTTP 200.

One of two possible fixes for #2977

The other approcah is collecting every broken UTF8 name and adding unit tests for the UTF8 repair class. I have saved some strings that I've observed causing these problems in the wild, and can share them with someone motivated to write tests and fix the UTF8 fixer instead.

@anthonyryan1

Copy link
Copy Markdown
Contributor Author

Just to clarify, this problem affects every endpoint that uses safeEncode, not just the list in this example.

@ranirahn

Copy link
Copy Markdown
Contributor

If all calls to safeEncode are affected why not fix the main function in
Rutorrent/php/utility/json.php ?

@anthonyryan1

Copy link
Copy Markdown
Contributor Author

That is the long term plan, yes. This isn't ready to merge, it's a discussion starting point for how we want to try solving these UTF8 normalization failures.

@ranirahn

ranirahn commented May 5, 2026

Copy link
Copy Markdown
Contributor

What do you think if the json class is like this?

class JSON
{
    public static function safeEncode($value)
    {
        if (defined('JSON_THROW_ON_ERROR')) {
           try {
               return json_encode($value, JSON_THROW_ON_ERROR | JSON_INVALID_UTF8_SUBSTITUTE);
           } catch (JsonException $e) {
               throw $e;
           }
        }
        $encoded = json_encode($value);
        if ($encoded === false) {
            require_once('utf.php');
            return json_encode(UTF::utf8ize($value));
        }
        return $encoded;
    }
}

I would leave the old version just incase someone have PHP 5 still in use and then these substitude and stuff dont work. So far dont see anything break with this.

…epair

UTF::utf8ize previously attempted to "repair" as many mangled strings
as possible, but I've observed a number of examples over the years where
it choked and broke rutorrent.

I would argue that taking arbitrary strings of bytes, and attempting
to convert them with perfect accuracy into valid UTF8 (when those strings
come to us mangled, corrupt and always with unknown encodings) isn't possible.

Here's a handful of examples of of byte sequences that choke the UTF8 repair code:

```php
require('php/utility/json.php');
var_dump(JSON::safeEncode("\xC0\x80"));
var_dump(JSON::safeEncode("\xED\xA0\x80"));
var_dump(JSON::safeEncode("\xED\xBF\xBF"));
var_dump(JSON::safeEncode("\xF5\x80\x80\x80"));
var_dump(JSON::safeEncode("\xF7\xBF\xBF\xBF"));
```

The problem is that I think this list is nowhere near comprehensive, and
even if we fix all of these, there will be many more corrupt sequences in
the wild we can't predict.

Instead of trying to repair all this corrupt data, let's just use the xFFFD
replacement character. Users who add files with corrupt strings will see the
replacement character (adding some awareness), but that becomes a content
problem and not a "ruTorrent is broken" problem.
@anthonyryan1

Copy link
Copy Markdown
Contributor Author

There hasn't been much discussion on the trade-offs between attempting to repair vs sanitizing the corrupt strings.

So I've just made a call, and amended this PR to my personal preference for how to handle this issue.


UTF::utf8ize previously attempted to "repair" as many mangled strings
as possible, but I've observed a number of examples over the years where
it choked and broke rutorrent.

I would argue that taking arbitrary strings of bytes, and attempting
to convert them with perfect accuracy into valid UTF8 (when those strings
come to us mangled, corrupt and always with unknown encodings) isn't possible.

Here's a handful of examples of of byte sequences that choke the UTF8 repair code:

require('php/utility/json.php');
var_dump(JSON::safeEncode("\xC0\x80"));
var_dump(JSON::safeEncode("\xED\xA0\x80"));
var_dump(JSON::safeEncode("\xED\xBF\xBF"));
var_dump(JSON::safeEncode("\xF5\x80\x80\x80"));
var_dump(JSON::safeEncode("\xF7\xBF\xBF\xBF"));

The problem is that I think this list is nowhere near comprehensive, and
even if we fix all of these, there will be many more corrupt sequences in
the wild we can't predict.

Instead of trying to repair all this corrupt data, let's just use the xFFFD
replacement character. Users who add files with corrupt strings will see the
replacement character (adding some awareness), but that becomes a content
problem and not a "ruTorrent is broken" problem.

@anthonyryan1 anthonyryan1 changed the title [RFC] JSON::safeEncode not entirely safe Update JSON::safeEncode to sanitize, rather than attempting convert/repair Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants