Skip to content

fix(postgres): preserve NUL bytes in internal JSONB tracking state#1798

Open
exveria1015 wants to merge 1 commit intococoindex-io:mainfrom
exveria1015:fix/postgres-jsonb-nul
Open

fix(postgres): preserve NUL bytes in internal JSONB tracking state#1798
exveria1015 wants to merge 1 commit intococoindex-io:mainfrom
exveria1015:fix/postgres-jsonb-nul

Conversation

@exveria1015
Copy link
Copy Markdown

@exveria1015 exveria1015 commented Apr 3, 2026

Summary

This fixes internal Postgres JSONB handling for values that contain U+0000 (NUL).

  • add reversible zero-byte escaping for internal JSONB values and keys

  • wire the escaping wrapper through tracking-table and setup-metadata reads/writes

  • avoid rewriting prefix-only literals so existing rows remain compatible

  • add focused tests for roundtrip behavior and literal-prefix handling

Verification

  • cargo test -p cocoindex_utils --features sqlx str_sanitize

  • cargo check -p cocoindex --lib

Fixes #933

- add reversible zero-byte escaping for internal JSONB values and keys

- apply the wrapper to tracking tables and setup metadata reads/writes

- keep prefix-only literals unchanged for backward compatibility

- add tests for JSON roundtrip and prefix-literal handling
@exveria1015 exveria1015 marked this pull request as ready for review April 3, 2026 00:39
@badmonster0
Copy link
Copy Markdown
Member

@exveria1015 thanks for the fix! @georgeh0 can help take a look!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Postgres internal JSONB handling for tracking/setup metadata by introducing reversible escaping for U+0000 (NUL) in JSON string values/keys, so internal state can round-trip without failing Postgres writes.

Changes:

  • Added base64+magic based escaping/unescaping for NUL-containing JSON strings and a sqlx JSON wrapper (ZeroCodeEscapedJson) with Encode/Decode support.
  • Wired the wrapper through tracking-table and setup-metadata reads/writes to preserve NUL bytes in internal JSONB state.
  • Updated execution code paths to unwrap decoded tracking keys/state and added focused roundtrip/literal-prefix tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rust/utils/src/str_sanitize.rs Implements NUL-preserving JSON escape/unescape and a sqlx Encode/Decode wrapper; adds tests.
rust/cocoindex/src/setup/db_metadata.rs Reads/writes setup metadata using escaped JSON wrapper and converts back to plain serde_json::Value.
rust/cocoindex/src/execution/db_tracking.rs Applies escaped JSON wrapper across tracking table reads/writes (source_key, memoization info, staged/committed keys, source state).
rust/cocoindex/src/execution/source_indexer.rs Unwraps escaped source_key when constructing PK values.
rust/cocoindex/src/execution/row_indexer.rs Unwraps escaped JSON tracking structures when processing staged/committed target keys.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let decoded = base64::engine::general_purpose::STANDARD
.decode(encoded)
.ok()?;
let payload = decoded.strip_prefix(ZERO_CODE_ESCAPE_MAGIC)?;
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode_zero_code_text will decode any string that matches the prefix + base64 + magic, even if the decoded payload does not contain any NUL bytes. This means a crafted literal like __cocoindex_zero_code_b64_v1__:... could be unexpectedly transformed on read. Consider adding a guard (e.g., require at least one \0 in the decoded payload after the magic) so only values that actually needed escaping are unescaped.

Suggested change
let payload = decoded.strip_prefix(ZERO_CODE_ESCAPE_MAGIC)?;
let payload = decoded.strip_prefix(ZERO_CODE_ESCAPE_MAGIC)?;
if !payload.contains(&b'\0') {
return None;
}

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +77
serde_json::Value::Object(values) => {
let mut escaped = serde_json::Map::new();
for (key, mut value) in std::mem::take(values) {
escape_zero_codes_in_json(&mut value);
escaped.insert(escape_zero_code_text(&key).unwrap_or(key), value);
}
*values = escaped;
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When escaping JSON object keys, inserting into a new map can silently overwrite entries if an escaped key collides with an existing key (e.g., a preexisting key equals the encoded form of a NUL-containing key). This loses data and breaks round-tripping for such inputs. Consider detecting collisions and failing fast (e.g., return a Result) or otherwise ensuring a collision-free encoding strategy for keys.

Copilot uses AI. Check for mistakes.
}

fn size_hint(&self) -> usize {
0
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZeroCodeEscapedJson::size_hint always returns 0, unlike other encoders in this module (e.g., ZeroCodeStrippedEncode uses the input length). This can cause extra allocations/copies when binding many JSON parameters. Consider delegating to the underlying sqlx::types::Json size_hint (or otherwise providing a non-zero estimate).

Suggested change
0
match serde_json::to_value(&self.0) {
Ok(mut value) => {
escape_zero_codes_in_json(&mut value);
<sqlx::types::Json<serde_json::Value> as Encode<'q, Postgres>>::size_hint(
&sqlx::types::Json(value),
)
}
Err(_) => 0,
}

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +16
type EscapedJson<T> = utils::str_sanitize::ZeroCodeEscapedJson<T>;

fn escaped_json<T>(value: T) -> EscapedJson<T> {
utils::str_sanitize::ZeroCodeEscapedJson(value)
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EscapedJson/escaped_json are duplicated here and in execution/db_tracking.rs. To keep the escaping behavior consistent and avoid drift, consider centralizing this helper (e.g., a shared module in utils::str_sanitize or a small db_json helper) rather than duplicating it per file.

Copilot uses AI. Check for mistakes.
Comment on lines 12 to +19
use utils::{db::WriteAction, fingerprint::Fingerprint};

type EscapedJson<T> = utils::str_sanitize::ZeroCodeEscapedJson<T>;

fn escaped_json<T>(value: T) -> EscapedJson<T> {
utils::str_sanitize::ZeroCodeEscapedJson(value)
}

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EscapedJson/escaped_json are duplicated here and in setup/db_metadata.rs. Consider centralizing this helper (or re-exporting a single helper from utils::str_sanitize) to avoid repeating the pattern and to make future adjustments (prefix/version changes, etc.) less error-prone.

Suggested change
use utils::{db::WriteAction, fingerprint::Fingerprint};
type EscapedJson<T> = utils::str_sanitize::ZeroCodeEscapedJson<T>;
fn escaped_json<T>(value: T) -> EscapedJson<T> {
utils::str_sanitize::ZeroCodeEscapedJson(value)
}
use utils::{
db::WriteAction,
fingerprint::Fingerprint,
str_sanitize::{EscapedJson, escaped_json},
};

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@georgeh0 georgeh0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this!

}

fn escape_zero_code_text(s: &str) -> Option<String> {
s.contains('\0').then(|| encode_zero_code_text(s))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides those containing '\0', I think we also want to escape those strings starting with ZERO_CODE_ESCAPE_PREFIX. So we'll be free from collision in the future.

And with this treatment, I think we won't need to have ZERO_CODE_ESCAPE_MAGIC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Make U+0000 (NUL) work with Postgres

4 participants