add get_bytes by deankarn · Pull Request #5 · tidwall/gjson.rs

deankarn · 2022-04-04T14:50:11Z

This is a simplified version of #4 which only adds the new get_bytes function for when one has a &[u8] instead of an &str to avoid the need for a type conversation to a string.

@tidwall I'm hoping you might be able merge and cut a release for this small change and then have all the time to review the other PR you need. The new get_bytes function is really the only change I require at this time, the rest is minor optimizations. I did fork this repo however, am not able to release my crate without publishing it as a separate crate which is undesirable.

If not please let me know and I'll instead copy json into my codebase for the time being.

deankarn · 2022-04-04T14:52:52Z

+    if rpv.len() > 0 && rpv[0] == b'~' {
+        // convert to bool
+        rpv = &rpv[1..];
+        if value.bool() {
            tvalue.slice = "true";
            tvalue.info = INFO_TRUE;
-		} else {
+        } else {
            tvalue.slice = "false";
-			tvalue.info = INFO_FALSE;
-		}
+            tvalue.info = INFO_FALSE;
+        }
        value = &tvalue;
-	}
+    }


no changes happened, only code formatting from my IDE

tidwall · 2022-04-04T21:46:21Z

The PR is simple enough for me to take a look.

I think that your get_bytes function will lead to unsafe behavior for some inputs.

For example:

let json = b"{\"hello\":\"w\xFFld\"}";
let r = get_bytes(json, "hello").json().to_owned();
let s = String::from_utf8(r.as_bytes().to_vec()).unwrap();

This will panic with an Utf8Error because the json string is not valid UTF-8.

This because the input json: &'a [u8] arg is not known to be a valid UTF-8 string at the time that get_bytes is called. GJSON expects that inputs are always UTF-8.

In your signature:

pub fn get_bytes<'a>(json: &'a [u8], path: &'a str) -> Value<'a>

The return Value<'a> type will contain strings that are created from the same memory space as the original slices of the &[u8]. This is done in the util::tostr function, here:

https://github.com/deankarn/gjson.rs/blob/get-bytes/src/util.rs#L11-L21

You can see that unsafe { std::str::from_utf8_unchecked(v) } is used because the original get operation already guarantees to that the UTF-8 has already been checked due to only accepting &'a str, and there's no need to recopy or reencode the string.

The fix could possibly be one of the following:

Add the unsafe keyword to the get_bytes function to give the user an option to use an unsafe function.
Check the input JSON that it's valid UTF-8 first thing from inside the get_bytes function. If it's not valid then perform a lossy UTF-8 conversion.
Remove the original unsafe, at the cost of extra bytes copies.

If at all possible I would rather not expose unsafe functions to the end user in the library.

deankarn · 2022-04-04T22:04:22Z

@tidwall as with your original get function this one also includes the same warning explicitly stating this fact https://github.com/tidwall/gjson.rs/pull/5/files#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759R1060

In my case I already know that the bytes in question are valid utf8. If I want to read out two fields it would be very inefficient to check validity multiple times.

I think that the warning alone should be more than sufficient and that always checking for validity is not something that should be forced upon the end user that has the option to do so before hand.

This is also to gain feature parity with the Go library of the same name.
I think adding unsafe would be overkill especially given the comment.
removing the original unsafe would result in a signature change for the function and all callers resulting in a much larger PR and set of changes
Like any library it’s ultimately up to the user to use it correctly.

deankarn · 2022-04-04T22:30:28Z

If it is 100% required though I’d opt for marking the get_bytes function as unsafe as it seems a reasonable tradeoff.

I’ll try to make that change within the next few hours:)

deankarn · 2022-04-05T00:32:12Z

Thanks @tidwall I also made get_bytes from #4 to also be unsafe.

tidwall · 2022-04-05T00:47:55Z

Sounds good. FYI, I made some minor changes to the trunk following the merge.

deankarn · 2022-04-05T01:04:01Z

Yes I saw @tidwall

makes sense for this PR 👍

Exposes get_bytes(json: bytes, path: str, default=...) -> Value, mirroring the gjson.rs get_bytes API (tidwall/gjson.rs#5). The Python binding validates UTF-8 and raises ValueError on invalid input rather than using the unsafe from_utf8_unchecked path, keeping the binding safe for Python callers. https://claude.ai/code/session_01Ab8vgRMuBcE3fxEgTcfu6w

add get_bytes

8d32b44

deankarn commented Apr 4, 2022

View reviewed changes

add unsafe to get_bytes

3d0c7d4

tidwall merged commit 6786e1a into tidwall:main Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add get_bytes#5

add get_bytes#5
tidwall merged 2 commits into
tidwall:mainfrom
deankarn:get-bytes

deankarn commented Apr 4, 2022

Uh oh!

deankarn Apr 4, 2022

Uh oh!

tidwall commented Apr 4, 2022 •

edited

Loading

Uh oh!

deankarn commented Apr 4, 2022

Uh oh!

deankarn commented Apr 4, 2022

Uh oh!

deankarn commented Apr 5, 2022

Uh oh!

tidwall commented Apr 5, 2022

Uh oh!

deankarn commented Apr 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deankarn commented Apr 4, 2022

Uh oh!

deankarn Apr 4, 2022

Choose a reason for hiding this comment

Uh oh!

tidwall commented Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deankarn commented Apr 4, 2022

Uh oh!

deankarn commented Apr 4, 2022

Uh oh!

deankarn commented Apr 5, 2022

Uh oh!

tidwall commented Apr 5, 2022

Uh oh!

deankarn commented Apr 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tidwall commented Apr 4, 2022 •

edited

Loading