Skip to content

add get_bytes#5

Merged
tidwall merged 2 commits into
tidwall:mainfrom
deankarn:get-bytes
Apr 5, 2022
Merged

add get_bytes#5
tidwall merged 2 commits into
tidwall:mainfrom
deankarn:get-bytes

Conversation

@deankarn

@deankarn deankarn commented Apr 4, 2022

Copy link
Copy Markdown

This is a simplified version of #4 which only adds the new get_bytes function for when one has a &[u8] instead of an &str to avoid the need for a type conversation to a string.

@tidwall I'm hoping you might be able merge and cut a release for this small change and then have all the time to review the other PR you need. The new get_bytes function is really the only change I require at this time, the rest is minor optimizations. I did fork this repo however, am not able to release my crate without publishing it as a separate crate which is undesirable.

If not please let me know and I'll instead copy json into my codebase for the time being.

Comment thread src/lib.rs
Comment on lines +809 to +820
if rpv.len() > 0 && rpv[0] == b'~' {
// convert to bool
rpv = &rpv[1..];
if value.bool() {
tvalue.slice = "true";
tvalue.info = INFO_TRUE;
} else {
} else {
tvalue.slice = "false";
tvalue.info = INFO_FALSE;
}
tvalue.info = INFO_FALSE;
}
value = &tvalue;
}
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no changes happened, only code formatting from my IDE

@tidwall

tidwall commented Apr 4, 2022

Copy link
Copy Markdown
Owner

The PR is simple enough for me to take a look.

I think that your get_bytes function will lead to unsafe behavior for some inputs.

For example:

let json = b"{\"hello\":\"w\xFFld\"}";
let r = get_bytes(json, "hello").json().to_owned();
let s = String::from_utf8(r.as_bytes().to_vec()).unwrap();

This will panic with an Utf8Error because the json string is not valid UTF-8.

This because the input json: &'a [u8] arg is not known to be a valid UTF-8 string at the time that get_bytes is called. GJSON expects that inputs are always UTF-8.

In your signature:

pub fn get_bytes<'a>(json: &'a [u8], path: &'a str) -> Value<'a>

The return Value<'a> type will contain strings that are created from the same memory space as the original slices of the &[u8]. This is done in the util::tostr function, here:

https://github.com/deankarn/gjson.rs/blob/get-bytes/src/util.rs#L11-L21

You can see that unsafe { std::str::from_utf8_unchecked(v) } is used because the original get operation already guarantees to that the UTF-8 has already been checked due to only accepting &'a str, and there's no need to recopy or reencode the string.

The fix could possibly be one of the following:

  • Add the unsafe keyword to the get_bytes function to give the user an option to use an unsafe function.
  • Check the input JSON that it's valid UTF-8 first thing from inside the get_bytes function. If it's not valid then perform a lossy UTF-8 conversion.
  • Remove the original unsafe, at the cost of extra bytes copies.

If at all possible I would rather not expose unsafe functions to the end user in the library.

@deankarn

deankarn commented Apr 4, 2022

Copy link
Copy Markdown
Author

@tidwall as with your original get function this one also includes the same warning explicitly stating this fact https://github.com/tidwall/gjson.rs/pull/5/files#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759R1060

In my case I already know that the bytes in question are valid utf8. If I want to read out two fields it would be very inefficient to check validity multiple times.

I think that the warning alone should be more than sufficient and that always checking for validity is not something that should be forced upon the end user that has the option to do so before hand.

  • This is also to gain feature parity with the Go library of the same name.
  • I think adding unsafe would be overkill especially given the comment.
  • removing the original unsafe would result in a signature change for the function and all callers resulting in a much larger PR and set of changes
  • Like any library it’s ultimately up to the user to use it correctly.

@deankarn

deankarn commented Apr 4, 2022

Copy link
Copy Markdown
Author

If it is 100% required though I’d opt for marking the get_bytes function as unsafe as it seems a reasonable tradeoff.

I’ll try to make that change within the next few hours:)

@tidwall tidwall merged commit 6786e1a into tidwall:main Apr 5, 2022
@deankarn

deankarn commented Apr 5, 2022

Copy link
Copy Markdown
Author

Thanks @tidwall I also made get_bytes from #4 to also be unsafe.

@tidwall

tidwall commented Apr 5, 2022

Copy link
Copy Markdown
Owner

Sounds good. FYI, I made some minor changes to the trunk following the merge.

@deankarn

deankarn commented Apr 5, 2022

Copy link
Copy Markdown
Author

Yes I saw @tidwall

makes sense for this PR 👍

minefuto pushed a commit to minefuto/pygjson that referenced this pull request Apr 8, 2026
Exposes get_bytes(json: bytes, path: str, default=...) -> Value, mirroring
the gjson.rs get_bytes API (tidwall/gjson.rs#5). The Python binding validates
UTF-8 and raises ValueError on invalid input rather than using the unsafe
from_utf8_unchecked path, keeping the binding safe for Python callers.

https://claude.ai/code/session_01Ab8vgRMuBcE3fxEgTcfu6w
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants