Update JSON::safeEncode to sanitize, rather than attempting convert/repair by anthonyryan1 · Pull Request #2996 · Novik/ruTorrent

anthonyryan1 · 2026-01-12T00:34:01Z

Right now, we've got a lot of code attempting to correct every malformed UTF8 string into something usable. This pull request is more of an alternate approach where we don't try and coax every string into something valid, but instead replace invalid sequences with a broken character encoding character.

Also ensures that on JSON encode failures we don't fail silently and return a HTTP 200.

One of two possible fixes for #2977

The other approcah is collecting every broken UTF8 name and adding unit tests for the UTF8 repair class. I have saved some strings that I've observed causing these problems in the wild, and can share them with someone motivated to write tests and fix the UTF8 fixer instead.

anthonyryan1 · 2026-03-10T16:48:26Z

Just to clarify, this problem affects every endpoint that uses safeEncode, not just the list in this example.

ranirahn · 2026-04-24T08:21:22Z

If all calls to safeEncode are affected why not fix the main function in
Rutorrent/php/utility/json.php ?

anthonyryan1 · 2026-05-05T01:12:29Z

That is the long term plan, yes. This isn't ready to merge, it's a discussion starting point for how we want to try solving these UTF8 normalization failures.

ranirahn · 2026-05-05T17:35:28Z

What do you think if the json class is like this?

class JSON
{
    public static function safeEncode($value)
    {
        if (defined('JSON_THROW_ON_ERROR')) {
           try {
               return json_encode($value, JSON_THROW_ON_ERROR | JSON_INVALID_UTF8_SUBSTITUTE);
           } catch (JsonException $e) {
               throw $e;
           }
        }
        $encoded = json_encode($value);
        if ($encoded === false) {
            require_once('utf.php');
            return json_encode(UTF::utf8ize($value));
        }
        return $encoded;
    }
}

I would leave the old version just incase someone have PHP 5 still in use and then these substitude and stuff dont work. So far dont see anything break with this.

…epair UTF::utf8ize previously attempted to "repair" as many mangled strings as possible, but I've observed a number of examples over the years where it choked and broke rutorrent. I would argue that taking arbitrary strings of bytes, and attempting to convert them with perfect accuracy into valid UTF8 (when those strings come to us mangled, corrupt and always with unknown encodings) isn't possible. Here's a handful of examples of of byte sequences that choke the UTF8 repair code: ```php require('php/utility/json.php'); var_dump(JSON::safeEncode("\xC0\x80")); var_dump(JSON::safeEncode("\xED\xA0\x80")); var_dump(JSON::safeEncode("\xED\xBF\xBF")); var_dump(JSON::safeEncode("\xF5\x80\x80\x80")); var_dump(JSON::safeEncode("\xF7\xBF\xBF\xBF")); ``` The problem is that I think this list is nowhere near comprehensive, and even if we fix all of these, there will be many more corrupt sequences in the wild we can't predict. Instead of trying to repair all this corrupt data, let's just use the xFFFD replacement character. Users who add files with corrupt strings will see the replacement character (adding some awareness), but that becomes a content problem and not a "ruTorrent is broken" problem.

anthonyryan1 · 2026-06-22T00:19:12Z

There hasn't been much discussion on the trade-offs between attempting to repair vs sanitizing the corrupt strings.

So I've just made a call, and amended this PR to my personal preference for how to handle this issue.

UTF::utf8ize previously attempted to "repair" as many mangled strings
as possible, but I've observed a number of examples over the years where
it choked and broke rutorrent.

I would argue that taking arbitrary strings of bytes, and attempting
to convert them with perfect accuracy into valid UTF8 (when those strings
come to us mangled, corrupt and always with unknown encodings) isn't possible.

Here's a handful of examples of of byte sequences that choke the UTF8 repair code:

require('php/utility/json.php');
var_dump(JSON::safeEncode("\xC0\x80"));
var_dump(JSON::safeEncode("\xED\xA0\x80"));
var_dump(JSON::safeEncode("\xED\xBF\xBF"));
var_dump(JSON::safeEncode("\xF5\x80\x80\x80"));
var_dump(JSON::safeEncode("\xF7\xBF\xBF\xBF"));

The problem is that I think this list is nowhere near comprehensive, and
even if we fix all of these, there will be many more corrupt sequences in
the wild we can't predict.

Instead of trying to repair all this corrupt data, let's just use the xFFFD
replacement character. Users who add files with corrupt strings will see the
replacement character (adding some awareness), but that becomes a content
problem and not a "ruTorrent is broken" problem.

anthonyryan1 mentioned this pull request Jan 12, 2026

parse errors xmlrpc - Bad response from server: (200 [parsererror,list]) #2977

Closed

4 tasks

anthonyryan1 force-pushed the utf8 branch from b7bc0a0 to 2f0dda7 Compare June 22, 2026 00:17

anthonyryan1 changed the title ~~[RFC] JSON::safeEncode not entirely safe~~ Update JSON::safeEncode to sanitize, rather than attempting convert/repair Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update JSON::safeEncode to sanitize, rather than attempting convert/repair#2996

Update JSON::safeEncode to sanitize, rather than attempting convert/repair#2996
anthonyryan1 wants to merge 1 commit into
Novik:masterfrom
anthonyryan1:utf8

anthonyryan1 commented Jan 12, 2026

Uh oh!

anthonyryan1 commented Mar 10, 2026

Uh oh!

ranirahn commented Apr 24, 2026

Uh oh!

anthonyryan1 commented May 5, 2026

Uh oh!

ranirahn commented May 5, 2026 •

edited

Loading

Uh oh!

anthonyryan1 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anthonyryan1 commented Jan 12, 2026

Uh oh!

anthonyryan1 commented Mar 10, 2026

Uh oh!

ranirahn commented Apr 24, 2026

Uh oh!

anthonyryan1 commented May 5, 2026

Uh oh!

ranirahn commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anthonyryan1 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ranirahn commented May 5, 2026 •

edited

Loading