[blog] Fluss storage hierarchy by polyzos · Pull Request #8 · apache/fluss-blog

polyzos · 2026-05-25T10:43:06Z

No description provided.

fresh-borzoni

@polyzos Thank you so much for the very needed blog 🚀

I am frequently asked about Fluss tiering model and now I'll have a great post to reference, kudos 🙇

Left some comments, mostly non-blocking, except unreleased standby replica feature.
PTAL

fresh-borzoni · 2026-05-28T12:30:14Z

+
+**The two upload tracks are independent on the way in. The recovery story stitches them back together on the way out, and breaks if either piece is missing.**
+
+## Standby Replicas


This isn't released as I see apache/fluss#2835

fresh-borzoni · 2026-05-28T12:31:12Z

+
+Remote log storage is **enabled by default**. It's controlled by `remote.log.task-interval-duration` (default `1min`), and is only disabled when that value is set to `0`. KV snapshot upload is independent of remote-log tiering and is governed by `kv.snapshot.interval` (default `10min`). Note that `remote.data.dir` itself has no default — you must configure it before either of these tracks can do anything useful.
+
+**Tier 3 is the lakehouse.** Paimon, Iceberg, Hudi, or Lance are holding data in analytical file formats queryable by any engine. Reads from the lakehouse cost seconds.


Hudi isn't released yet

fresh-borzoni · 2026-05-28T12:32:42Z

+
+### A Note On Single-Copy Storage
+
+**Fluss is single-copy in steady state.** Hot data lives on the server, cold data lives in the lakehouse, and there is no permanent duplication.


nit: too strong statement, we still replicate data across replicas

fresh-borzoni · 2026-05-28T12:34:20Z

+
+Everything described so far is the cold-start path; the one that runs when no other copy of a bucket is still alive. Most production recoveries aren't cold restarts.
+
+**Fluss replicates each bucket across multiple tablet servers**: one leader handling writes, plus followers continuously tailing the same log. One of those followers is the designated **standby**, the replica the controller will promote on leader failure, and the one that maintains a live RocksDB kept current with the leader's in near real time.


a bit dangling the leader's - is it a typo?

fresh-borzoni · 2026-05-28T12:35:38Z

+
+* **Tier 1** looks like the tier that matters, because it's the only one on the live query path. 
+* **Tier 2** looks like an implementation detail, because it's **"just S3"**. 
+* **Tier 3** looks like a destination, because it's the lakehouse. Each shortcut is wrong in a way that only becomes visible after you've configured something based on it.


Sentence starting with "Each shortcut" - separate line, it doesn't belong to the bullet point

fresh-borzoni · 2026-05-28T12:37:36Z

+
+`table.log.tiered.local-segments` (default 2) is a count-based floor for local disk, only meaningful when remote-log tiering is on. The remote-log task keeps at least this many recent segments on local disk after upload, so consumers reading near the head don't pay an S3 round-trip for the freshest data.
+
+A segment becomes a candidate for upload **the moment it is sealed and its records are below the high watermark** (i.e., committed/acked). Sealing happens when the active segment hits its size threshold, fills the offset or time index, or can no longer encode records as relative offsets, and then rolls over: Fluss closes the current active segment (which becomes immutable) and opens a new one for subsequent writes. The freshly-closed segment is now something the remote-log task can pick up on its next pass. This is the same model Kafka uses for tiered storage, and it has the same operational consequence: **data sitting in the active segment lives only on the local Fluss server until rollover**, which means the active segment's size threshold sets a lower bound on how recent your "remote-only" reads can be.


"Lower bound on how recent" means "a floor on freshness", but the actual constraint is the opposite: the most recent data only lives locally, so what's in remote is at most N records behind the head.

Suggested: "the active segment's size sets an upper bound on how fresh remote-tier data can be - anything newer than the current rollover is local-only."

fresh-borzoni · 2026-05-28T12:38:42Z

+
+**There is only one small exception**, when lakehouse tiering is enabled, a remote log segment is only deleted once **both** its TTL has expired **and** the lakehouse has ingested it. That's a safety net against lakehouse lag, and it creates a bounded window where the same data lives in both Tier 2 and Tier 3 simultaneously, governed by `table.log.ttl` (default 7 days). Shorten the TTL if strict single-copy matters more to you than a long lakehouse catch-up window. 
+
+The overlap is a configurable, time-bounded transition, **not another copy** of your data.


with previous paragraph's "in both Tier 2 and Tier 3 simultaneously" this is a confusing contradiction.

fresh-borzoni · 2026-05-28T12:40:07Z

+
+### Structure 2: KV Snapshots
+
+Every ten minutes by default (`kv.snapshot.interval=10min`), the tablet server takes a snapshot of the live RocksDB and writes it to remote storage. **This is the system's only durable record of the table's current state.** If the tablet server's local disk evaporates, the most recent snapshot is what brings the data back.


If the tablet server's local disk evaporates, the most recent snapshot is what brings the data back

well, we need to apply changelog on top to get it fully back, no?

polyzos added 4 commits May 22, 2026 10:50

[blog] initial commit for fluss storage hierarchy

fa5046d

complete blog post

63ad2e9

update text

e914f93

update rocksdb image

c42046d

fresh-borzoni reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[blog] Fluss storage hierarchy#8

[blog] Fluss storage hierarchy#8
polyzos wants to merge 4 commits into
apache:mainfrom
polyzos:fluss-storage-hierarchy

polyzos commented May 25, 2026

Uh oh!

fresh-borzoni left a comment

Uh oh!

fresh-borzoni May 28, 2026

Uh oh!

fresh-borzoni May 28, 2026

Uh oh!

fresh-borzoni May 28, 2026

Uh oh!

fresh-borzoni May 28, 2026 •

edited

Loading

Uh oh!

fresh-borzoni May 28, 2026 •

edited

Loading

Uh oh!

fresh-borzoni May 28, 2026

Uh oh!

fresh-borzoni May 28, 2026

Uh oh!

fresh-borzoni May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		The two upload tracks are independent on the way in. The recovery story stitches them back together on the way out, and breaks if either piece is missing.

		## Standby Replicas


		Remote log storage is enabled by default. It's controlled by `remote.log.task-interval-duration` (default `1min`), and is only disabled when that value is set to `0`. KV snapshot upload is independent of remote-log tiering and is governed by `kv.snapshot.interval` (default `10min`). Note that `remote.data.dir` itself has no default — you must configure it before either of these tracks can do anything useful.

		Tier 3 is the lakehouse. Paimon, Iceberg, Hudi, or Lance are holding data in analytical file formats queryable by any engine. Reads from the lakehouse cost seconds.


		### A Note On Single-Copy Storage

		Fluss is single-copy in steady state. Hot data lives on the server, cold data lives in the lakehouse, and there is no permanent duplication.


		Everything described so far is the cold-start path; the one that runs when no other copy of a bucket is still alive. Most production recoveries aren't cold restarts.

		Fluss replicates each bucket across multiple tablet servers: one leader handling writes, plus followers continuously tailing the same log. One of those followers is the designated standby, the replica the controller will promote on leader failure, and the one that maintains a live RocksDB kept current with the leader's in near real time.


		`table.log.tiered.local-segments` (default 2) is a count-based floor for local disk, only meaningful when remote-log tiering is on. The remote-log task keeps at least this many recent segments on local disk after upload, so consumers reading near the head don't pay an S3 round-trip for the freshest data.

		A segment becomes a candidate for upload the moment it is sealed and its records are below the high watermark (i.e., committed/acked). Sealing happens when the active segment hits its size threshold, fills the offset or time index, or can no longer encode records as relative offsets, and then rolls over: Fluss closes the current active segment (which becomes immutable) and opens a new one for subsequent writes. The freshly-closed segment is now something the remote-log task can pick up on its next pass. This is the same model Kafka uses for tiered storage, and it has the same operational consequence: data sitting in the active segment lives only on the local Fluss server until rollover, which means the active segment's size threshold sets a lower bound on how recent your "remote-only" reads can be.


		There is only one small exception, when lakehouse tiering is enabled, a remote log segment is only deleted once both its TTL has expired and the lakehouse has ingested it. That's a safety net against lakehouse lag, and it creates a bounded window where the same data lives in both Tier 2 and Tier 3 simultaneously, governed by `table.log.ttl` (default 7 days). Shorten the TTL if strict single-copy matters more to you than a long lakehouse catch-up window.

		The overlap is a configurable, time-bounded transition, not another copy of your data.


		### Structure 2: KV Snapshots

		Every ten minutes by default (`kv.snapshot.interval=10min`), the tablet server takes a snapshot of the live RocksDB and writes it to remote storage. This is the system's only durable record of the table's current state. If the tablet server's local disk evaporates, the most recent snapshot is what brings the data back.

Conversation

polyzos commented May 25, 2026

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fresh-borzoni May 28, 2026 •

edited

Loading

fresh-borzoni May 28, 2026 •

edited

Loading