Skip to content

[blog] Fluss storage hierarchy#8

Open
polyzos wants to merge 4 commits into
apache:mainfrom
polyzos:fluss-storage-hierarchy
Open

[blog] Fluss storage hierarchy#8
polyzos wants to merge 4 commits into
apache:mainfrom
polyzos:fluss-storage-hierarchy

Conversation

@polyzos
Copy link
Copy Markdown
Contributor

@polyzos polyzos commented May 25, 2026

No description provided.

Copy link
Copy Markdown
Member

@fresh-borzoni fresh-borzoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polyzos Thank you so much for the very needed blog 🚀

I am frequently asked about Fluss tiering model and now I'll have a great post to reference, kudos 🙇

Left some comments, mostly non-blocking, except unreleased standby replica feature.
PTAL


**The two upload tracks are independent on the way in. The recovery story stitches them back together on the way out, and breaks if either piece is missing.**

## Standby Replicas
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't released as I see apache/fluss#2835


Remote log storage is **enabled by default**. It's controlled by `remote.log.task-interval-duration` (default `1min`), and is only disabled when that value is set to `0`. KV snapshot upload is independent of remote-log tiering and is governed by `kv.snapshot.interval` (default `10min`). Note that `remote.data.dir` itself has no default — you must configure it before either of these tracks can do anything useful.

**Tier 3 is the lakehouse.** Paimon, Iceberg, Hudi, or Lance are holding data in analytical file formats queryable by any engine. Reads from the lakehouse cost seconds.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hudi isn't released yet


### A Note On Single-Copy Storage

**Fluss is single-copy in steady state.** Hot data lives on the server, cold data lives in the lakehouse, and there is no permanent duplication.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: too strong statement, we still replicate data across replicas


Everything described so far is the cold-start path; the one that runs when no other copy of a bucket is still alive. Most production recoveries aren't cold restarts.

**Fluss replicates each bucket across multiple tablet servers**: one leader handling writes, plus followers continuously tailing the same log. One of those followers is the designated **standby**, the replica the controller will promote on leader failure, and the one that maintains a live RocksDB kept current with the leader's in near real time.
Copy link
Copy Markdown
Member

@fresh-borzoni fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit dangling the leader's - is it a typo?


* **Tier 1** looks like the tier that matters, because it's the only one on the live query path.
* **Tier 2** looks like an implementation detail, because it's **"just S3"**.
* **Tier 3** looks like a destination, because it's the lakehouse. Each shortcut is wrong in a way that only becomes visible after you've configured something based on it.
Copy link
Copy Markdown
Member

@fresh-borzoni fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence starting with "Each shortcut" - separate line, it doesn't belong to the bullet point


`table.log.tiered.local-segments` (default 2) is a count-based floor for local disk, only meaningful when remote-log tiering is on. The remote-log task keeps at least this many recent segments on local disk after upload, so consumers reading near the head don't pay an S3 round-trip for the freshest data.

A segment becomes a candidate for upload **the moment it is sealed and its records are below the high watermark** (i.e., committed/acked). Sealing happens when the active segment hits its size threshold, fills the offset or time index, or can no longer encode records as relative offsets, and then rolls over: Fluss closes the current active segment (which becomes immutable) and opens a new one for subsequent writes. The freshly-closed segment is now something the remote-log task can pick up on its next pass. This is the same model Kafka uses for tiered storage, and it has the same operational consequence: **data sitting in the active segment lives only on the local Fluss server until rollover**, which means the active segment's size threshold sets a lower bound on how recent your "remote-only" reads can be.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Lower bound on how recent" means "a floor on freshness", but the actual constraint is the opposite: the most recent data only lives locally, so what's in remote is at most N records behind the head.

Suggested: "the active segment's size sets an upper bound on how fresh remote-tier data can be - anything newer than the current rollover is local-only."


**There is only one small exception**, when lakehouse tiering is enabled, a remote log segment is only deleted once **both** its TTL has expired **and** the lakehouse has ingested it. That's a safety net against lakehouse lag, and it creates a bounded window where the same data lives in both Tier 2 and Tier 3 simultaneously, governed by `table.log.ttl` (default 7 days). Shorten the TTL if strict single-copy matters more to you than a long lakehouse catch-up window.

The overlap is a configurable, time-bounded transition, **not another copy** of your data.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with previous paragraph's "in both Tier 2 and Tier 3 simultaneously" this is a confusing contradiction.


### Structure 2: KV Snapshots

Every ten minutes by default (`kv.snapshot.interval=10min`), the tablet server takes a snapshot of the live RocksDB and writes it to remote storage. **This is the system's only durable record of the table's current state.** If the tablet server's local disk evaporates, the most recent snapshot is what brings the data back.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tablet server's local disk evaporates, the most recent snapshot is what brings the data back

well, we need to apply changelog on top to get it fully back, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants