Thank you for building CocoIndex — we're using it as the foundation for an enterprise code search platform and the incremental processing model is excellent.
Problem
We index ~18 source code repositories (2.4M chunks total) using SentenceTransformerEmbed. Our tracking tables have grown to 17 GB — larger than the actual vector data table (16 GB). About 90% of the tracking table size is cached embedding vectors stored as JSON in memoization_info.cache.
We're scaling toward 200 sources. At that scale, the tracking tables alone would consume roughly 190 GB, essentially doubling our total database size for data that already exists in the target table.
Why we think disabling the cache would work fine for us
CocoIndex's source fingerprinting (processed_source_fp) already handles the normal operations perfectly — unchanged files are skipped entirely without consulting the memoization cache. Modified and added files need to be re-embedded regardless. The cache is only valuable when the processing logic fingerprint changes (e.g., embedding model upgrade), but we change models very rarely and would do a full re-index in that case anyway.
Proposal
Allow enable_cache to be configurable at the function spec level, defaulting to true for backward compatibility:
# Current behavior (unchanged)
text.transform(
cocoindex.functions.SentenceTransformerEmbed(model="all-MiniLM-L6-v2")
)
# Opt out of caching to save storage
text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="all-MiniLM-L6-v2",
enable_cache=False,
)
)
A flow-level or global setting would also work for our use case.
Thank you for building CocoIndex — we're using it as the foundation for an enterprise code search platform and the incremental processing model is excellent.
Problem
We index ~18 source code repositories (2.4M chunks total) using
SentenceTransformerEmbed. Our tracking tables have grown to 17 GB — larger than the actual vector data table (16 GB). About 90% of the tracking table size is cached embedding vectors stored as JSON inmemoization_info.cache.We're scaling toward 200 sources. At that scale, the tracking tables alone would consume roughly 190 GB, essentially doubling our total database size for data that already exists in the target table.
Why we think disabling the cache would work fine for us
CocoIndex's source fingerprinting (
processed_source_fp) already handles the normal operations perfectly — unchanged files are skipped entirely without consulting the memoization cache. Modified and added files need to be re-embedded regardless. The cache is only valuable when the processing logic fingerprint changes (e.g., embedding model upgrade), but we change models very rarely and would do a full re-index in that case anyway.Proposal
Allow
enable_cacheto be configurable at the function spec level, defaulting totruefor backward compatibility:A flow-level or global setting would also work for our use case.