Skip to content

fix: reject non-finite numeric metadata supplied as strings#405

Open
Mubashirrrr wants to merge 1 commit into
morphik-org:mainfrom
Mubashirrrr:fix/typed-metadata-nonfinite-number-string
Open

fix: reject non-finite numeric metadata supplied as strings#405
Mubashirrrr wants to merge 1 commit into
morphik-org:mainfrom
Mubashirrrr:fix/typed-metadata-nonfinite-number-string

Conversation

@Mubashirrrr
Copy link
Copy Markdown

Bug

typed_metadata._coerce_number is used by normalize_metadata / merge_metadata to coerce user-supplied document metadata according to a declared type. For numeric inputs it explicitly rejects NaN/infinite values:

if isinstance(value, (int, float)) and not isinstance(value, bool):
    if isinstance(value, float) and (math.isnan(value) or math.isinf(value)):
        raise TypedMetadataError(f"Metadata field '{field}' cannot store NaN or infinite values.")
    return value

…because those values break JSON serialization (json.dumps(float('inf')) emits Infinity, which is invalid JSON) and Postgres double precision storage. But the string branch did not apply the same guard:

return float(text)

So when a field is declared as number (or an alias like int/float) and the value arrives as a string, "inf", "-inf", "Infinity", "nan", and overflowing literals such as "1e400" (Python parses this to inf) were coerced into non-finite floats and stored, corrupting the document row at write time. Metadata and type hints both come straight from the ingestion request body (see core/services/v2_document_service.py and core/services/ingestion_service.py), so this is reachable from a normal API call.

Reproduction

from core.utils.typed_metadata import normalize_metadata
bundle = normalize_metadata({"score": "inf"}, {"score": "number"})
print(bundle.values["score"])   # inf  -> later breaks json.dumps / Postgres insert

Fix

Apply the existing finite-value check to the parsed string result, raising the same TypedMetadataError the numeric path uses. Four lines added, one changed; valid numeric strings (including scientific notation like "1e5") are unaffected.

Regression test

Extended test_number_coercion_rejects_nan_and_infinity with a string-input case (test_number_coercion_rejects_nan_and_infinity_strings) covering inf, -inf, Infinity, nan, and 1e400. It fails before the fix (DID NOT RAISE) and passes after. Full test_typed_metadata.py suite stays green (47 passed).

$ pytest core/tests/unit/test_typed_metadata.py -q
47 passed

🤖 Generated with Claude Code

_coerce_number rejects NaN and infinite values for numeric inputs because
they break JSON serialization and Postgres double precision storage, but
the string branch returned float(text) directly. Strings like "inf",
"-inf", "Infinity", "nan", and overflowing literals such as "1e400" (which
Python parses to inf) slipped through type coercion when a field was
declared as a number, corrupting the stored document.

Apply the same finite-value check to the parsed string result.

Extends the existing rejection test to cover string inputs; the new cases
fail before and pass after the fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants