Skip to content

Allow disabling per-function memoization cache to reduce tracking table storage #1779

@petrarca

Description

@petrarca

Thank you for building CocoIndex — we're using it as the foundation for an enterprise code search platform and the incremental processing model is excellent.

Problem

We index ~18 source code repositories (2.4M chunks total) using SentenceTransformerEmbed. Our tracking tables have grown to 17 GB — larger than the actual vector data table (16 GB). About 90% of the tracking table size is cached embedding vectors stored as JSON in memoization_info.cache.

We're scaling toward 200 sources. At that scale, the tracking tables alone would consume roughly 190 GB, essentially doubling our total database size for data that already exists in the target table.

Why we think disabling the cache would work fine for us

CocoIndex's source fingerprinting (processed_source_fp) already handles the normal operations perfectly — unchanged files are skipped entirely without consulting the memoization cache. Modified and added files need to be re-embedded regardless. The cache is only valuable when the processing logic fingerprint changes (e.g., embedding model upgrade), but we change models very rarely and would do a full re-index in that case anyway.

Proposal

Allow enable_cache to be configurable at the function spec level, defaulting to true for backward compatibility:

# Current behavior (unchanged)
text.transform(
    cocoindex.functions.SentenceTransformerEmbed(model="all-MiniLM-L6-v2")
)

# Opt out of caching to save storage
text.transform(
    cocoindex.functions.SentenceTransformerEmbed(
        model="all-MiniLM-L6-v2",
        enable_cache=False,
    )
)

A flow-level or global setting would also work for our use case.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions