fix(postgres): preserve NUL bytes in internal JSONB tracking state#1798
fix(postgres): preserve NUL bytes in internal JSONB tracking state#1798exveria1015 wants to merge 1 commit intococoindex-io:mainfrom
Conversation
- add reversible zero-byte escaping for internal JSONB values and keys - apply the wrapper to tracking tables and setup metadata reads/writes - keep prefix-only literals unchanged for backward compatibility - add tests for JSON roundtrip and prefix-literal handling
|
@exveria1015 thanks for the fix! @georgeh0 can help take a look! |
There was a problem hiding this comment.
Pull request overview
This PR fixes Postgres internal JSONB handling for tracking/setup metadata by introducing reversible escaping for U+0000 (NUL) in JSON string values/keys, so internal state can round-trip without failing Postgres writes.
Changes:
- Added base64+magic based escaping/unescaping for NUL-containing JSON strings and a
sqlxJSON wrapper (ZeroCodeEscapedJson) with Encode/Decode support. - Wired the wrapper through tracking-table and setup-metadata reads/writes to preserve NUL bytes in internal JSONB state.
- Updated execution code paths to unwrap decoded tracking keys/state and added focused roundtrip/literal-prefix tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/utils/src/str_sanitize.rs | Implements NUL-preserving JSON escape/unescape and a sqlx Encode/Decode wrapper; adds tests. |
| rust/cocoindex/src/setup/db_metadata.rs | Reads/writes setup metadata using escaped JSON wrapper and converts back to plain serde_json::Value. |
| rust/cocoindex/src/execution/db_tracking.rs | Applies escaped JSON wrapper across tracking table reads/writes (source_key, memoization info, staged/committed keys, source state). |
| rust/cocoindex/src/execution/source_indexer.rs | Unwraps escaped source_key when constructing PK values. |
| rust/cocoindex/src/execution/row_indexer.rs | Unwraps escaped JSON tracking structures when processing staged/committed target keys. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let decoded = base64::engine::general_purpose::STANDARD | ||
| .decode(encoded) | ||
| .ok()?; | ||
| let payload = decoded.strip_prefix(ZERO_CODE_ESCAPE_MAGIC)?; |
There was a problem hiding this comment.
decode_zero_code_text will decode any string that matches the prefix + base64 + magic, even if the decoded payload does not contain any NUL bytes. This means a crafted literal like __cocoindex_zero_code_b64_v1__:... could be unexpectedly transformed on read. Consider adding a guard (e.g., require at least one \0 in the decoded payload after the magic) so only values that actually needed escaping are unescaped.
| let payload = decoded.strip_prefix(ZERO_CODE_ESCAPE_MAGIC)?; | |
| let payload = decoded.strip_prefix(ZERO_CODE_ESCAPE_MAGIC)?; | |
| if !payload.contains(&b'\0') { | |
| return None; | |
| } |
| serde_json::Value::Object(values) => { | ||
| let mut escaped = serde_json::Map::new(); | ||
| for (key, mut value) in std::mem::take(values) { | ||
| escape_zero_codes_in_json(&mut value); | ||
| escaped.insert(escape_zero_code_text(&key).unwrap_or(key), value); | ||
| } | ||
| *values = escaped; | ||
| } |
There was a problem hiding this comment.
When escaping JSON object keys, inserting into a new map can silently overwrite entries if an escaped key collides with an existing key (e.g., a preexisting key equals the encoded form of a NUL-containing key). This loses data and breaks round-tripping for such inputs. Consider detecting collisions and failing fast (e.g., return a Result) or otherwise ensuring a collision-free encoding strategy for keys.
| } | ||
|
|
||
| fn size_hint(&self) -> usize { | ||
| 0 |
There was a problem hiding this comment.
ZeroCodeEscapedJson::size_hint always returns 0, unlike other encoders in this module (e.g., ZeroCodeStrippedEncode uses the input length). This can cause extra allocations/copies when binding many JSON parameters. Consider delegating to the underlying sqlx::types::Json size_hint (or otherwise providing a non-zero estimate).
| 0 | |
| match serde_json::to_value(&self.0) { | |
| Ok(mut value) => { | |
| escape_zero_codes_in_json(&mut value); | |
| <sqlx::types::Json<serde_json::Value> as Encode<'q, Postgres>>::size_hint( | |
| &sqlx::types::Json(value), | |
| ) | |
| } | |
| Err(_) => 0, | |
| } |
| type EscapedJson<T> = utils::str_sanitize::ZeroCodeEscapedJson<T>; | ||
|
|
||
| fn escaped_json<T>(value: T) -> EscapedJson<T> { | ||
| utils::str_sanitize::ZeroCodeEscapedJson(value) | ||
| } |
There was a problem hiding this comment.
EscapedJson/escaped_json are duplicated here and in execution/db_tracking.rs. To keep the escaping behavior consistent and avoid drift, consider centralizing this helper (e.g., a shared module in utils::str_sanitize or a small db_json helper) rather than duplicating it per file.
| use utils::{db::WriteAction, fingerprint::Fingerprint}; | ||
|
|
||
| type EscapedJson<T> = utils::str_sanitize::ZeroCodeEscapedJson<T>; | ||
|
|
||
| fn escaped_json<T>(value: T) -> EscapedJson<T> { | ||
| utils::str_sanitize::ZeroCodeEscapedJson(value) | ||
| } | ||
|
|
There was a problem hiding this comment.
EscapedJson/escaped_json are duplicated here and in setup/db_metadata.rs. Consider centralizing this helper (or re-exporting a single helper from utils::str_sanitize) to avoid repeating the pattern and to make future adjustments (prefix/version changes, etc.) less error-prone.
| use utils::{db::WriteAction, fingerprint::Fingerprint}; | |
| type EscapedJson<T> = utils::str_sanitize::ZeroCodeEscapedJson<T>; | |
| fn escaped_json<T>(value: T) -> EscapedJson<T> { | |
| utils::str_sanitize::ZeroCodeEscapedJson(value) | |
| } | |
| use utils::{ | |
| db::WriteAction, | |
| fingerprint::Fingerprint, | |
| str_sanitize::{EscapedJson, escaped_json}, | |
| }; |
georgeh0
left a comment
There was a problem hiding this comment.
Thanks for taking care of this!
| } | ||
|
|
||
| fn escape_zero_code_text(s: &str) -> Option<String> { | ||
| s.contains('\0').then(|| encode_zero_code_text(s)) |
There was a problem hiding this comment.
Besides those containing '\0', I think we also want to escape those strings starting with ZERO_CODE_ESCAPE_PREFIX. So we'll be free from collision in the future.
And with this treatment, I think we won't need to have ZERO_CODE_ESCAPE_MAGIC.
Summary
This fixes internal Postgres JSONB handling for values that contain
U+0000(NUL).add reversible zero-byte escaping for internal JSONB values and keys
wire the escaping wrapper through tracking-table and setup-metadata reads/writes
avoid rewriting prefix-only literals so existing rows remain compatible
add focused tests for roundtrip behavior and literal-prefix handling
Verification
cargo test -p cocoindex_utils --features sqlx str_sanitize
cargo check -p cocoindex --lib
Fixes #933