From 0f8ae01374335998e57363e763185e714aecd0b6 Mon Sep 17 00:00:00 2001 From: Ankit Master Date: Tue, 9 Jun 2026 16:03:53 -0700 Subject: [PATCH] feat(msfs): warm-read fixes + per-inode disk cache backend Improves the MSFS data-cache warm-read path and adds an opt-in disk cache backend, validated against the EC2 MSFS->S3 baseline on OCI/AIStore. Contention / optimistic-lock copy: - Move the cache-line -> FUSE reply copy out of the global lock using the data cache's contentGeneration counter (snapshot under lock, copy unlocked, re-check + retry on mismatch), removing warm-read lock serialization. Per-inode disk cache backend (cache_backend: disk, Option B, opt-in): - Store each inode's cache lines in a contiguous per-inode file; serve warm reads via pread issued outside the global lock and drop FOPEN_DIRECT_IO so kernel readahead/page cache apply. Default in-memory data cache unchanged. - cache_disk.go + cache_disk_{linux,other}.go (fallocate PUNCH_HOLE on Linux). - Bump contentGeneration before punching recycled clean lines so in-flight optimistic readers retry instead of reading punched bytes. - Treat a failed storeContentDisk on a non-empty read as a fetch failure (EIO via DoRead) rather than EOF. Results (ws-fits, P4.8 warm): baseline 1,861 -> optimistic 2,866 -> disk 4,496 MiB/s (0.99x EC2 4,536). Other: - Do not log backend-derived values in the backend setup-failure and manifest_gen mismatch warnings (CodeQL go/clear-text-logging): the secret taints those structures, so the warnings log a fixed message only. - Reject changing AIStore.manifest_gen_backend via SIGHUP. - config.go/globals.go: cache_backend knob and config plumbing. - README.md: document cache_backend; align manifest_gen_backend wording. - tests: config_test.go, fission_test.go, cache_disk_test.go. --- agent-learnings.md | 2 + multi-storage-file-system/README.md | 4 +- multi-storage-file-system/backend_aistore.go | 40 ++++- multi-storage-file-system/cache.go | 68 +++++++- multi-storage-file-system/cache_disk.go | 138 +++++++++++++++ multi-storage-file-system/cache_disk_linux.go | 17 ++ multi-storage-file-system/cache_disk_other.go | 13 ++ multi-storage-file-system/cache_disk_test.go | 146 ++++++++++++++++ multi-storage-file-system/config.go | 56 ++++++ multi-storage-file-system/config_test.go | 103 +++++++++++ multi-storage-file-system/fission.go | 161 ++++++++++++++---- multi-storage-file-system/fission_test.go | 96 +++++++++++ multi-storage-file-system/fs.go | 26 +-- multi-storage-file-system/globals.go | 16 ++ multi-storage-file-system/globals_lock.go | 87 +++++----- 15 files changed, 873 insertions(+), 100 deletions(-) create mode 100644 multi-storage-file-system/cache_disk.go create mode 100644 multi-storage-file-system/cache_disk_linux.go create mode 100644 multi-storage-file-system/cache_disk_other.go create mode 100644 multi-storage-file-system/cache_disk_test.go diff --git a/agent-learnings.md b/agent-learnings.md index 906cb5ab..66f60747 100644 --- a/agent-learnings.md +++ b/agent-learnings.md @@ -8,3 +8,5 @@ Mistakes and non-obvious patterns worth remembering. Any agent (any tool, any te - When `msc` is in a project venv (`.venv/bin/msc`), MCP server configs must use the absolute path — tools launch the server outside the venv. - `git commit --no-edit` on a merge commit uses `cleanup=whitespace`, which keeps any `#`-prefixed comment lines from the auto-generated `MERGE_MSG` template. To strip those without opening an editor, pass `--cleanup=strip` explicitly (e.g. `git commit --no-edit --cleanup=strip`) or supply `-m "..."` with the final message. Fixing it after the push requires `git commit --amend --cleanup=strip --no-edit` plus `git push --force-with-lease`, so prefer to get it right the first time. - Long-running `nix-shell` sessions set `TMPDIR` to a per-shell directory (e.g. `/tmp/nix-shell.XXXXXX`) that gets swept off disk when /tmp is cleaned or the original spawning shell exits. Any tool that calls `os.TempDir()` then fails opaquely: `git commit` ("could not create temporary file: No such file or directory" → "failed to write commit object"), `golangci-lint` ("typechecking error: go: creating work dir: stat /tmp/nix-shell.XXXX: no such file or directory" → exit 7 even though the lint itself found 0 issues), `go build` / `go test` with similar messages. Fix from inside the nix-shell with `export TMPDIR=/tmp` (or `unset TMPDIR`) before re-running. If you must persist it, set it in `shellHook` in `flake.nix`. Reading `pgrep -fa golangci-lint` showing fast-churn PIDs from Cursor's `go.lintOnSave` is a separate problem (parallel-runner lock); set `run.allow-parallel-runners: true` in `.golangci.yml` to let editor + CLI coexist. +- MSFS manifest **ingest** performance is governed by `process_memory_limit` (the Go soft-memory / GOMEMLIMIT knob), which **defaults to 4 GiB**. A 100M-inode ingest needs ~6-8 GiB resident, so on the default the heap sits permanently above the soft limit and Go GCs continuously — a death spiral where current throughput collapses to ~1K obj/sec, `gcPauses` climb to thousands, and a 100M ingest stretches past 24 min. Symptom in the log: `manifest-ingest: progress` lines where `current obj/sec` drops to ~1000 while `mem` hovers above the limit and `gcPauses` grows ~linearly. Fix: set `process_memory_limit` generously for the host (e.g. `68719476736` = 64 GiB on a big node); ingest then completed in ~3m22s @ ~518K obj/sec with only 94 GC pauses (vs 3317). On 256-vCPU hosts also cap `GOMAXPROCS` (e.g. `GOMAXPROCS=32` env at mount) so GC stop-the-world doesn't pause 256 Ps. NOTE: the old EC2 configs bounded ingest memory via `max_memory_inodes: 20000000`, but that key **no longer exists in the current code** (replaced by PebbleDB metadata-cache paging: `pebble_cache_size`, `inode_map_*`); it is silently ignored if present, so use `process_memory_limit` instead. +- MSFS AIStore backend supports `manifest_gen_backend: ` (in the `AIStore:` block) to route LIST/STAT-DIR (manifest generation, readdir) to another already-configured backend (e.g. a direct-S3 backend) while object reads still go via AIS for caching/fast-tier. Use this when AIS fronts a remote cloud bucket: AIS's cold cross-cloud LIST is extremely slow (proxy→target `/begin` control-plane timeouts under concurrency; a 100M-dir crawl never completes), whereas listing the underlying store directly is fast (100M objects in ~1m28s @ ~1.1M obj/sec, identical to native S3). Only the AIS backend should carry `manifest_path`; if a backend named as someone's `manifest_gen_backend` ALSO sets its own `manifest_path`, both run independent generate+ingest pipelines that compete on the same PebbleDB and memory budget (double the S3 LIST load; corrupt manifest if they share the same `manifest_path` dir). diff --git a/multi-storage-file-system/README.md b/multi-storage-file-system/README.md index 691480bb..c89ddbed 100644 --- a/multi-storage-file-system/README.md +++ b/multi-storage-file-system/README.md @@ -63,7 +63,8 @@ The MSFS-specific global (i.e. "top-level") settings are described in the follow | virtual_dir_ttl | decimal milliseconds | 1000000 | Amount of time a created but still empty directory should be maintained (should be at least evictable_inode_ttl) | | virtual_file_ttl | decimal milliseconds | 1000000 | Amount of time a created but still not flushed file should be maintained (should be at least evictable_inode_ttl) | | ttl_check_interval | decimal milliseconds | 250 | Amount of time between checking for evictions and cache pruning | -| mapped_cache | boolean | true | If true, the caching layer uses a memory-mapped file (which can be much larger than RAM); if false, the caching layer uses only RAM | +| mapped_cache | boolean | true | If true, the caching layer uses a memory-mapped file (which can be much larger than RAM); if false, the caching layer uses only RAM (only applies when cache_backend == "memory") | +| cache_backend | string | "memory" | Cache-line storage/serve backend: "memory" (shared mmap content buffer served via memcpy) or "disk" (per-inode contiguous files under /cachelines served via pread, with FOPEN_DIRECT_IO dropped) | | cache_line_size | decimal bytes | 10485760 (10Mi) | Granularity of caching layer for both file read and write traffic | | cache_lines | decimal | 128 | Number of cache lines provisioned | | cache_lines_to_prefetch | decimal | 4 | Maximum number of cache lines to prefetch while fetching a cache line to satisfy a read operation | @@ -146,6 +147,7 @@ the following table: | authnTokenFile | string | "${AIS_AUTHN_TOKEN_FILE:=~/.config/ais/cli/auth.token}" | If != "", specifies location of AUTHN Token file | | provider | string | "s3" | IF != "ais", specifies the backend of which bucket contents are cached | | timeout | decimal milliseconds | 30000 | Limit on allowed duration of requests (including retries) | +| manifest_gen_backend | string | "" | IF != "", `dir_name` of another (non-AIStore) backend used for LIST/STAT-DIR (manifest generation, readdir); object reads still go via AIS. Lets listing hit the underlying store (e.g. S3) directly while reads benefit from AIS caching. The referenced backend should target the same bucket/prefix; a mismatch is logged as a warning (not rejected). | ### GCS Backend Configuration diff --git a/multi-storage-file-system/backend_aistore.go b/multi-storage-file-system/backend_aistore.go index 19b5e241..271cc8df 100644 --- a/multi-storage-file-system/backend_aistore.go +++ b/multi-storage-file-system/backend_aistore.go @@ -24,6 +24,14 @@ type aistoreContextStruct struct { backend *backendStruct baseParams api.BaseParams // Connection parameters bck cmn.Bck // Bucket metadata/ structure + // `listBackend`, when non-nil, is the backend named by the AIStore backend's + // `manifest_gen_backend` (e.g. a direct S3 backend). LIST/STAT-DIR operations are + // delegated to its context so that manifest generation and readdir bypass AIS's + // (slow, cold) cross-cloud listing, while object reads continue to flow through + // AIS for its caching/fast-tier. We store the *backendStruct (not its context) + // and dereference `.context` at call time so a later re-setup can't leave a stale + // pointer here. + listBackend *backendStruct } // `backendCommon` is called to return a pointer to the context's common `backendStruct`. @@ -94,12 +102,27 @@ func (backend *backendStruct) setupAIStoreContext() (err error) { } // Store context - backend.context = &aistoreContextStruct{ + aisContext := &aistoreContextStruct{ backend: backend, baseParams: baseParams, bck: bck, } + // If a manifest_gen_backend is configured, route LIST/STAT-DIR operations to it + // (reads still go via AIS). Ensure its context is set up (it may not have been + // set up yet, depending on backend mount ordering). + if backendAIStore.manifestGenBackend != nil { + if backendAIStore.manifestGenBackend.context == nil { + if err = backendAIStore.manifestGenBackend.setupContext(); err != nil { + err = fmt.Errorf("[AIStore] failed to set up manifest_gen_backend %q: %v", backendAIStore.manifestGenBackend.dirName, err) + return + } + } + aisContext.listBackend = backendAIStore.manifestGenBackend + } + + backend.context = aisContext + // Record backendPath if backend.prefix == "" { backend.backendPath = backendAIStore.endpoint + "/" @@ -157,6 +180,11 @@ func (aisContext *aistoreContextStruct) deleteFile(deleteFileInput *deleteFileIn // indicates the `directory` has been completely enumerated. The `isTruncated` field will also // align with this convention. func (aisContext *aistoreContextStruct) listDirectory(listDirectoryInput *listDirectoryInputStruct) (listDirectoryOutput *listDirectoryOutputStruct, err error) { + // Delegate listing to the manifest_gen_backend (e.g. direct S3) when configured. + if aisContext.listBackend != nil { + return aisContext.listBackend.context.listDirectory(listDirectoryInput) + } + var ( backend = aisContext.backend fullDirPath = backend.prefix + listDirectoryInput.dirPath @@ -222,6 +250,11 @@ func (aisContext *aistoreContextStruct) listDirectory(listDirectoryInput *listDi // empty list of elements (`objects`) indicates the list of `objects` has been completely // enumerated. The `isTruncated` field will also align with this convention. func (aisContext *aistoreContextStruct) listObjects(listObjectsInput *listObjectsInputStruct) (listObjectsOutput *listObjectsOutputStruct, err error) { + // Delegate listing to the manifest_gen_backend (e.g. direct S3) when configured. + if aisContext.listBackend != nil { + return aisContext.listBackend.context.listObjects(listObjectsInput) + } + var ( backend = aisContext.backend lsmsg = &apc.LsoMsg{ @@ -337,6 +370,11 @@ func (aisContext *aistoreContextStruct) readFile(readFileInput *readFileInputStr // `statDirectory` is called to verify that the specified path refers to a `directory`. // An error is returned if either the specified path is not a `directory` or non-existent. func (aisContext *aistoreContextStruct) statDirectory(statDirectoryInput *statDirectoryInputStruct) (statDirectoryOutput *statDirectoryOutputStruct, err error) { + // Delegate directory stat to the manifest_gen_backend (e.g. direct S3) when configured. + if aisContext.listBackend != nil { + return aisContext.listBackend.context.statDirectory(statDirectoryInput) + } + var ( backend = aisContext.backend fullDirPath = backend.prefix + statDirectoryInput.dirPath diff --git a/multi-storage-file-system/cache.go b/multi-storage-file-system/cache.go index c84fbdd5..56d135de 100644 --- a/multi-storage-file-system/cache.go +++ b/multi-storage-file-system/cache.go @@ -21,7 +21,18 @@ func dataCacheUp() (err error) { dataCacheLineContentSize = globals.config.cacheLines * globals.config.cacheLineSize - if globals.config.mappedCache { + switch { + case globals.config.cacheBackend == "disk": + // Disk backend: no shared content arena. Each inode's cache-line bytes + // live in its own contiguous file under /cachelines/, served + // via pread. The tracker/LRU machinery below is unchanged and bounds the + // number of resident lines exactly as in memory mode. + globals.dataCacheLinesFile = nil + globals.dataCacheLinesContent = nil + if err = diskCacheUp(); err != nil { + return + } + case globals.config.mappedCache: globals.dataCacheLinesFile, err = os.OpenFile(filepath.Join(globals.cacheDir, DataCacheFileName), os.O_RDWR|os.O_CREATE|os.O_EXCL, 0o600) if err != nil { err = fmt.Errorf("os.OpenFile(filepath.Join(globals.cacheDir, DataCacheFileName), os.O_RDWR|os.O_CREATE|os.O_EXCL, 0o600) failed: %v", err) @@ -42,7 +53,7 @@ func dataCacheUp() (err error) { err = fmt.Errorf("syscall.Mmap(int(globals.dataCacheLinesFile.Fd()),,,,,) failed: %v", err) return } - } else { + default: // RAM-backed cache: use an anonymous private mmap (fd=-1, MAP_ANON) instead // of `make([]byte, N)`. The kernel only commits physical pages on first // touch, so workloads that never read file data (e.g. pure manifest ingest @@ -120,6 +131,10 @@ func dataCacheUp() (err error) { func dataCacheDown() (err error) { globals.dataCacheActivityWG.Wait() + if globals.config.cacheBackend == "disk" { + diskCacheDown() + } + // Both mapped_cache=true (file-backed) and mapped_cache=false (anonymous) paths // use mmap-allocated memory, so Munmap covers both. The file fd is only present // for the mapped_cache=true case. @@ -364,6 +379,15 @@ func allocateDataCacheLines(count uint64) (cacheLineNumbers []uint64, neededToBl globals.logger.Fatalf("[FATAL] globals.inodeMap.get(dataCacheLineTracker.inodeNumber[%v]) returned !ok", dataCacheLineTracker.inodeNumber) } + if globals.config.cacheBackend == "disk" { + // Slot is being recycled for a different (inode, line): release + // the evicted line's disk bytes. Bump the generation first so any + // in-flight unlocked optimistic reader fails its post-copy + // re-check and retries rather than accepting now-punched bytes. + dataCacheLineTracker.contentGeneration++ + dataCacheLineTracker.punchHoleDisk() + } + delete(inode.cacheMap, dataCacheLineTracker.lineNumber) cacheLineNumbers = append(cacheLineNumbers, dataCacheLineTracker.pos) @@ -393,7 +417,7 @@ func allocateDataCacheLines(count uint64) (cacheLineNumbers []uint64, neededToBl cacheLineWaiter.Wait() - globalsLock("cache.go:396:3:allocateDataCacheLines") + globalsLock("cache.go:416:3:allocateDataCacheLines") } } @@ -416,11 +440,15 @@ func releaseDataCacheLines(cacheLineNumbers []uint64) { // `free` resets a data cache line that is not currently on any LRU and returns // it to the Free LRU. The caller must hold the globals lock. func (dataCacheLineTracker *dataCacheLineTrackerStruct) free() { + if globals.config.cacheBackend == "disk" { + dataCacheLineTracker.punchHoleDisk() // no-op if this line had no disk backing + } dataCacheLineTracker.contentLength = 0 dataCacheLineTracker.contentGeneration++ dataCacheLineTracker.inodeNumber = 0 // not yet applicable dataCacheLineTracker.lineNumber = 0 // not yet applicable dataCacheLineTracker.eTag = "" // not yet applicable + dataCacheLineTracker.fetchFailed = false dataCacheLineTracker.waiters = make([]*sync.WaitGroup, 0, 1) globals.dataCacheLineFreeLRU.pushTail(dataCacheLineTracker) } @@ -441,7 +469,7 @@ func (dataCacheLineTracker *dataCacheLineTrackerStruct) fetch() { defer globals.dataCacheActivityWG.Done() - globalsLock("cache.go:444:2:(*dataCacheLineTrackerStruct).fetch") + globalsLock("cache.go:468:2:(*dataCacheLineTrackerStruct).fetch") inode, ok = globals.inodeMap.get(dataCacheLineTracker.inodeNumber) if !ok { @@ -468,12 +496,12 @@ func (dataCacheLineTracker *dataCacheLineTrackerStruct) fetch() { globalsUnlock() readFileOutput, err = readFileWrapper(backend.context, readFileInput) - if err == nil { + if err == nil && globals.config.cacheBackend == "memory" { content = globals.dataCacheLinesContent[dataCacheLineTracker.contentStart : dataCacheLineTracker.contentStart+globals.config.cacheLineSize] dataCacheLineTracker.contentLength = uint64(copy(content, readFileOutput.buf)) } - globalsLock("cache.go:476:2:(*dataCacheLineTrackerStruct).fetch") + globalsLock("cache.go:500:2:(*dataCacheLineTrackerStruct).fetch") inode, ok = globals.inodeMap.get(dataCacheLineTracker.inodeNumber) if ok { inode.inboundCacheLineCount-- @@ -489,12 +517,34 @@ func (dataCacheLineTracker *dataCacheLineTrackerStruct) fetch() { globals.dataCacheLineInboundLRU.popThis(dataCacheLineTracker) dataCacheLineTracker.contentGeneration++ - if err != nil { - globals.logger.Printf("[WARN] [TODO] (*dataCacheLineTrackerStruct) fetch() needs to handle error reading cache line") + switch { + case err != nil: + // The backend read failed. Mark the line as failed and move it to the + // Clean LRU so waiters are released and the state machine stays consistent; + // DoRead detects .fetchFailed, surfaces EIO to the caller, and evicts the + // line (so a subsequent read re-fetches rather than serving empty content). + // Leaving it un-flagged caused DoRead to compute an inverted slice + // (contentLength==0 < offset) and panic while holding the global lock. + globals.logger.Printf("[WARN] (*dataCacheLineTrackerStruct) fetch() backend read failed for inode %d line %d: %v", dataCacheLineTracker.inodeNumber, dataCacheLineTracker.lineNumber, err) dataCacheLineTracker.contentLength = 0 dataCacheLineTracker.eTag = "" - } else { + dataCacheLineTracker.fetchFailed = true + case globals.config.cacheBackend == "disk": + // Write the fetched bytes through to the inode's backing file under + // the lock (matches the memory path, which set contentLength above). + dataCacheLineTracker.contentLength = dataCacheLineTracker.storeContentDisk(readFileOutput.buf) + if len(readFileOutput.buf) > 0 && dataCacheLineTracker.contentLength == 0 { + // storeContentDisk failed (open/WriteAt error) on a non-empty read: + // treat as a fetch failure so DoRead surfaces EIO instead of EOF. + dataCacheLineTracker.eTag = "" + dataCacheLineTracker.fetchFailed = true + } else { + dataCacheLineTracker.eTag = readFileOutput.eTag + dataCacheLineTracker.fetchFailed = false + } + default: dataCacheLineTracker.eTag = readFileOutput.eTag + dataCacheLineTracker.fetchFailed = false } globals.dataCacheLineCleanLRU.pushTail(dataCacheLineTracker) diff --git a/multi-storage-file-system/cache_disk.go b/multi-storage-file-system/cache_disk.go new file mode 100644 index 00000000..52d72fc2 --- /dev/null +++ b/multi-storage-file-system/cache_disk.go @@ -0,0 +1,138 @@ +package main + +import ( + "fmt" + "os" + "path/filepath" + "sync/atomic" +) + +// Disk cache backend (cache_backend: "disk"). +// +// Instead of holding cache-line bytes in the shared mmap'd content buffer +// (globals.dataCacheLinesContent), each inode gets its own contiguous backing +// file under /cachelines/inode_.bin. A cache line for +// (inode, lineNumber) is stored at file offset lineNumber*cacheLineSize, so a +// file's consecutive lines are physically adjacent and the kernel readahead +// helps sequential reads. Reads are served via pread (os.File.ReadAt) straight +// into the FUSE reply buffer, issued OUTSIDE the global lock (guarded by the +// dataCacheLineTracker.contentGeneration optimistic re-check in DoRead), so +// concurrent warm reads neither serialize on the lock nor pay the arena memcpy. +// +// All map/refcount mutations here assume the caller holds the global lock; the +// pwrite/punch-hole syscalls touch only the per-inode fd (positioned I/O, safe). + +// inodeDiskCacheFileStruct tracks a single inode's on-disk cache file and how +// many of its cache lines are currently resident (so the file can be closed + +// removed when it drops to zero). +type inodeDiskCacheFileStruct struct { + file *os.File + residentLines uint64 +} + +// diskReadsServed counts DoRead cache-line requests served via pread from the +// on-disk cache. Purely observational (sample-logged) to confirm the disk path +// is actually hot. +var diskReadsServed atomic.Uint64 + +// diskCacheDirPath returns /cachelines. +func diskCacheDirPath() string { + return filepath.Join(globals.cacheDir, "cachelines") +} + +// diskCacheUp prepares the on-disk cache backend. Caller holds no lock (called +// once from dataCacheUp during startup, single-threaded). +func diskCacheUp() (err error) { + globals.inodeDiskCacheFiles = make(map[uint64]*inodeDiskCacheFileStruct) + if err = os.MkdirAll(diskCacheDirPath(), 0o700); err != nil { + return fmt.Errorf("os.MkdirAll(%q) failed: %v", diskCacheDirPath(), err) + } + return nil +} + +// diskCacheDown closes every open per-inode cache file. The cachelines/ dir +// itself is removed when fs.go tears down globals.cacheDir. Caller holds the +// global lock (called from dataCacheDown after dataCacheActivityWG drains). +func diskCacheDown() { + for inodeNumber, idcf := range globals.inodeDiskCacheFiles { + if idcf.file != nil { + _ = idcf.file.Close() + } + delete(globals.inodeDiskCacheFiles, inodeNumber) + } +} + +// storeContentDisk writes a just-fetched cache line's bytes to the inode's +// backing file and records the location on the tracker. Caller holds the global +// lock. Returns the number of valid bytes stored; on any error it returns 0 and +// leaves the tracker disk fields cleared so DoRead falls through to EOF/empty +// rather than serving stale bytes. +func (dataCacheLineTracker *dataCacheLineTrackerStruct) storeContentDisk(buf []byte) (contentLength uint64) { + dataCacheLineTracker.diskFile = nil + dataCacheLineTracker.diskOffset = 0 + dataCacheLineTracker.diskLength = 0 + + if len(buf) == 0 { + return 0 + } + + idcf, ok := globals.inodeDiskCacheFiles[dataCacheLineTracker.inodeNumber] + if !ok { + path := filepath.Join(diskCacheDirPath(), fmt.Sprintf("inode_%d.bin", dataCacheLineTracker.inodeNumber)) + f, openErr := os.OpenFile(path, os.O_RDWR|os.O_CREATE, 0o600) + if openErr != nil { + globals.logger.Printf("[WARN] storeContentDisk: os.OpenFile(%q) failed: %v", path, openErr) + return 0 + } + idcf = &inodeDiskCacheFileStruct{file: f, residentLines: 0} + globals.inodeDiskCacheFiles[dataCacheLineTracker.inodeNumber] = idcf + } + + offset := int64(dataCacheLineTracker.lineNumber) * int64(globals.config.cacheLineSize) + n, writeErr := idcf.file.WriteAt(buf, offset) + if writeErr != nil { + globals.logger.Printf("[WARN] storeContentDisk: WriteAt(inode=%d line=%d off=%d) failed: %v", + dataCacheLineTracker.inodeNumber, dataCacheLineTracker.lineNumber, offset, writeErr) + return 0 + } + + dataCacheLineTracker.diskFile = idcf.file + dataCacheLineTracker.diskOffset = offset + dataCacheLineTracker.diskLength = int64(n) + idcf.residentLines++ + + return uint64(n) +} + +// punchHoleDisk releases the disk bytes of an evicted/freed cache line via +// fallocate(PUNCH_HOLE) and decrements the owning inode file's resident count, +// closing + removing the file when it reaches zero. No-op if the line has no +// disk backing. Caller holds the global lock. +func (dataCacheLineTracker *dataCacheLineTrackerStruct) punchHoleDisk() { + if dataCacheLineTracker.diskFile == nil { + return + } + + if err := punchHoleSyscall(dataCacheLineTracker.diskFile, dataCacheLineTracker.diskOffset, int64(globals.config.cacheLineSize)); err != nil { + globals.logger.Printf("[WARN] punchHoleDisk(inode=%d line=%d off=%d): %v", + dataCacheLineTracker.inodeNumber, dataCacheLineTracker.lineNumber, dataCacheLineTracker.diskOffset, err) + } + + if idcf, ok := globals.inodeDiskCacheFiles[dataCacheLineTracker.inodeNumber]; ok { + if idcf.residentLines > 0 { + idcf.residentLines-- + } + if idcf.residentLines == 0 { + path := idcf.file.Name() + _ = idcf.file.Close() + if removeErr := os.Remove(path); removeErr != nil && !os.IsNotExist(removeErr) { + globals.logger.Printf("[WARN] punchHoleDisk: os.Remove(%q) failed: %v", path, removeErr) + } + delete(globals.inodeDiskCacheFiles, dataCacheLineTracker.inodeNumber) + } + } + + dataCacheLineTracker.diskFile = nil + dataCacheLineTracker.diskOffset = 0 + dataCacheLineTracker.diskLength = 0 +} diff --git a/multi-storage-file-system/cache_disk_linux.go b/multi-storage-file-system/cache_disk_linux.go new file mode 100644 index 00000000..5489fa18 --- /dev/null +++ b/multi-storage-file-system/cache_disk_linux.go @@ -0,0 +1,17 @@ +//go:build linux + +package main + +import ( + "os" + "syscall" +) + +// punchHoleSyscall releases [offset, offset+length) of f back to the filesystem +// via fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE), so an evicted cache +// line stops consuming disk space while the file's logical size (and the offsets +// of other resident lines) are preserved. +func punchHoleSyscall(f *os.File, offset, length int64) error { + const punchFlags = 0x02 | 0x01 // FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE + return syscall.Fallocate(int(f.Fd()), punchFlags, offset, length) +} diff --git a/multi-storage-file-system/cache_disk_other.go b/multi-storage-file-system/cache_disk_other.go new file mode 100644 index 00000000..73933345 --- /dev/null +++ b/multi-storage-file-system/cache_disk_other.go @@ -0,0 +1,13 @@ +//go:build !linux + +package main + +import "os" + +// punchHoleSyscall is a no-op on non-Linux platforms (no portable hole-punch). +// The disk cache backend still functions; evicted lines simply keep their disk +// space until the inode's backing file is removed. MSFS production targets are +// Linux, so this path is for local dev/test builds (e.g. macOS) only. +func punchHoleSyscall(_ *os.File, _, _ int64) error { + return nil +} diff --git a/multi-storage-file-system/cache_disk_test.go b/multi-storage-file-system/cache_disk_test.go new file mode 100644 index 00000000..e9ba5236 --- /dev/null +++ b/multi-storage-file-system/cache_disk_test.go @@ -0,0 +1,146 @@ +package main + +import ( + "bytes" + "log" + "os" + "path/filepath" + "testing" +) + +// diskTestSetup wires up the minimal globals the disk-cache helpers touch +// (config, cacheDir, logger, the per-inode file map) against a temp dir. +func diskTestSetup(t *testing.T, cacheLineSize, cacheLines uint64) { + t.Helper() + globals.logger = log.New(os.Stderr, "", 0) + globals.cacheDir = t.TempDir() + globals.config = &configStruct{ + cacheBackend: "disk", + cacheLineSize: cacheLineSize, + cacheLines: cacheLines, + } + if err := diskCacheUp(); err != nil { + t.Fatalf("diskCacheUp() failed: %v", err) + } +} + +// TestDiskCacheStoreServePunch covers the round trip: store a line to the +// per-inode backing file, serve it back via pread (the DoRead serve path), +// then punch it and confirm the file is closed + removed at zero residency. +func TestDiskCacheStoreServePunch(t *testing.T) { + const lineSize = uint64(4096) + diskTestSetup(t, lineSize, 8) + + tracker := &dataCacheLineTrackerStruct{inodeNumber: 42, lineNumber: 3} + payload := []byte("hello disk cache backend - served via pread") + + n := tracker.storeContentDisk(payload) + if n != uint64(len(payload)) { + t.Fatalf("storeContentDisk returned %d, want %d", n, len(payload)) + } + if tracker.diskFile == nil { + t.Fatal("tracker.diskFile is nil after store") + } + if tracker.diskOffset != int64(tracker.lineNumber)*int64(lineSize) { + t.Fatalf("tracker.diskOffset = %d, want %d", tracker.diskOffset, int64(tracker.lineNumber)*int64(lineSize)) + } + if tracker.diskLength != int64(len(payload)) { + t.Fatalf("tracker.diskLength = %d, want %d", tracker.diskLength, len(payload)) + } + + idcf, ok := globals.inodeDiskCacheFiles[42] + if !ok || idcf.residentLines != 1 { + t.Fatalf("inode 42 disk file not tracked with residentLines==1 (ok=%v)", ok) + } + wantPath := filepath.Join(diskCacheDirPath(), "inode_42.bin") + if _, err := os.Stat(wantPath); err != nil { + t.Fatalf("backing file %q not present: %v", wantPath, err) + } + + // Serve path: pread straight into a reply-sized buffer at diskOffset. + out := make([]byte, len(payload)) + m, err := tracker.diskFile.ReadAt(out, tracker.diskOffset) + if err != nil || m != len(payload) { + t.Fatalf("ReadAt: n=%d err=%v", m, err) + } + if !bytes.Equal(out, payload) { + t.Fatalf("served bytes mismatch: got %q want %q", out, payload) + } + + // Punch: line evicted -> residency hits 0 -> file closed + removed. + tracker.punchHoleDisk() + if tracker.diskFile != nil { + t.Fatal("tracker.diskFile not cleared after punchHoleDisk") + } + if _, ok := globals.inodeDiskCacheFiles[42]; ok { + t.Fatal("inode 42 disk file still tracked after last line punched") + } + if _, err := os.Stat(wantPath); !os.IsNotExist(err) { + t.Fatalf("backing file %q should be removed at zero residency (err=%v)", wantPath, err) + } +} + +// TestDiskCacheResidencyAccounting checks that multiple lines of one inode +// share a single backing file, residentLines tracks them, and the file is only +// removed once the last line is punched. +func TestDiskCacheResidencyAccounting(t *testing.T) { + const lineSize = uint64(1024) + diskTestSetup(t, lineSize, 16) + + lineA := &dataCacheLineTrackerStruct{inodeNumber: 7, lineNumber: 0} + lineB := &dataCacheLineTrackerStruct{inodeNumber: 7, lineNumber: 5} + + bufA := bytes.Repeat([]byte{0xAA}, int(lineSize)) + bufB := bytes.Repeat([]byte{0xBB}, 512) + + lineA.storeContentDisk(bufA) + lineB.storeContentDisk(bufB) + + idcf, ok := globals.inodeDiskCacheFiles[7] + if !ok || idcf.residentLines != 2 { + t.Fatalf("expected residentLines==2 for inode 7, got ok=%v residentLines=%v", ok, idcf.residentLines) + } + if lineA.diskFile != lineB.diskFile { + t.Fatal("two lines of the same inode should share one backing file") + } + + // Non-contiguous offsets must not collide: lineB at offset 5*lineSize. + if lineB.diskOffset != int64(5)*int64(lineSize) { + t.Fatalf("lineB.diskOffset = %d, want %d", lineB.diskOffset, int64(5)*int64(lineSize)) + } + outB := make([]byte, len(bufB)) + if _, err := lineB.diskFile.ReadAt(outB, lineB.diskOffset); err != nil { + t.Fatalf("ReadAt lineB: %v", err) + } + if !bytes.Equal(outB, bufB) { + t.Fatal("lineB served bytes mismatch") + } + + // Punch one line: file stays (residency 1). + lineA.punchHoleDisk() + if idcf, ok := globals.inodeDiskCacheFiles[7]; !ok || idcf.residentLines != 1 { + t.Fatalf("after punching lineA expected residentLines==1, got ok=%v", ok) + } + + // Punch the last line: file closed + removed. + lineB.punchHoleDisk() + if _, ok := globals.inodeDiskCacheFiles[7]; ok { + t.Fatal("inode 7 disk file should be gone after both lines punched") + } +} + +// TestDiskCacheStoreEmpty confirms an empty payload stores nothing and leaves +// the tracker with no disk backing (so DoRead treats it as EOF, not garbage). +func TestDiskCacheStoreEmpty(t *testing.T) { + diskTestSetup(t, 4096, 8) + tracker := &dataCacheLineTrackerStruct{inodeNumber: 1, lineNumber: 0} + if n := tracker.storeContentDisk([]byte{}); n != 0 { + t.Fatalf("storeContentDisk(empty) = %d, want 0", n) + } + if tracker.diskFile != nil { + t.Fatal("empty store should not set diskFile") + } + if len(globals.inodeDiskCacheFiles) != 0 { + t.Fatal("empty store should not create a backing file") + } +} diff --git a/multi-storage-file-system/config.go b/multi-storage-file-system/config.go index f0f88870..1e537d3c 100644 --- a/multi-storage-file-system/config.go +++ b/multi-storage-file-system/config.go @@ -828,6 +828,16 @@ func checkConfigFile() (err error) { return } + config.cacheBackend, ok = parseString(configFileMap, "cache_backend", "memory") + if !ok { + err = errors.New("bad cache_backend value") + return + } + if config.cacheBackend != "memory" && config.cacheBackend != "disk" { + err = fmt.Errorf("bad cache_backend value (%q): must be \"memory\" or \"disk\"", config.cacheBackend) + return + } + config.metadataCachePagingMode, ok = parseString(configFileMap, "metadata_cache_paging_mode", "pebble") if !ok { err = errors.New("bad metadata_cache_paging_mode value") @@ -1305,6 +1315,12 @@ func checkConfigFile() (err error) { err = fmt.Errorf("bad AIStore.timeout at backends[%v (\"%s\")]", backendsAsInterfaceSliceIndex, backendAsStructNew.dirName) return } + + backendConfigAIStoreAsStruct.manifestGenBackendName, ok = parseString(backendConfigAIStoreAsMap, "manifest_gen_backend", "") + if !ok { + err = fmt.Errorf("bad AIStore.manifest_gen_backend at backends[%v (\"%s\")]", backendsAsInterfaceSliceIndex, backendAsStructNew.dirName) + return + } } else { backendConfigAIStoreAsStruct = &backendConfigAIStoreStruct{ endpoint: os.Getenv("AIS_ENDPOINT"), @@ -1790,6 +1806,36 @@ func checkConfigFile() (err error) { } } + // Resolve AIStore `manifest_gen_backend` references now that all backends are + // parsed. Each reference must name another, non-AIStore backend in this config; + // that backend serves LIST/STAT-DIR while reads continue via AIS. + for _, aisBackend := range config.backends { + if aisBackend.backendType != "AIStore" { + continue + } + aisCfg := aisBackend.backendTypeSpecifics.(*backendConfigAIStoreStruct) + if aisCfg.manifestGenBackendName == "" { + continue + } + if aisCfg.manifestGenBackendName == aisBackend.dirName { + err = fmt.Errorf("AIStore backend %q manifest_gen_backend cannot reference itself", aisBackend.dirName) + return + } + srcBackend, found := config.backends[aisCfg.manifestGenBackendName] + if !found { + err = fmt.Errorf("AIStore backend %q manifest_gen_backend %q not found among configured backends", aisBackend.dirName, aisCfg.manifestGenBackendName) + return + } + if srcBackend.backendType == "AIStore" { + err = fmt.Errorf("AIStore backend %q manifest_gen_backend %q must not be an AIStore backend", aisBackend.dirName, aisCfg.manifestGenBackendName) + return + } + if srcBackend.bucketContainerName != aisBackend.bucketContainerName || srcBackend.prefix != aisBackend.prefix { + globals.logger.Printf("[WARN] an AIStore manifest_gen_backend targets a different bucket/prefix than its AIStore backend; listing and reads may not align") + } + aisCfg.manifestGenBackend = srcBackend + } + if globals.config == nil { // Move all (local) config.backends to globals.backendsToMount @@ -1915,6 +1961,11 @@ func checkConfigFile() (err error) { return } + if globals.config.cacheBackend != config.cacheBackend { + err = errors.New("cannot change cache_backend via SIGHUP") + return + } + if globals.config.metadataCachePagingMode != config.metadataCachePagingMode { err = errors.New("cannot change metadata_cache_paging_mode via SIGHUP") return @@ -2161,6 +2212,11 @@ func checkConfigFile() (err error) { err = fmt.Errorf("cannot change AIStore.timeout in backends[\"%s\"]", dirName) return } + + if backendAsStructOld.backendTypeSpecifics.(*backendConfigAIStoreStruct).manifestGenBackendName != backendAsStructNew.backendTypeSpecifics.(*backendConfigAIStoreStruct).manifestGenBackendName { + err = fmt.Errorf("cannot change AIStore.manifest_gen_backend in backends[\"%s\"]", dirName) + return + } case "GCS": if backendAsStructOld.backendTypeSpecifics.(*backendConfigGCSStruct).apiKey != backendAsStructNew.backendTypeSpecifics.(*backendConfigGCSStruct).apiKey { err = fmt.Errorf("cannot change GCS.api_key in backends[\"%s\"]", dirName) diff --git a/multi-storage-file-system/config_test.go b/multi-storage-file-system/config_test.go index bfba77fc..8c5c334f 100644 --- a/multi-storage-file-system/config_test.go +++ b/multi-storage-file-system/config_test.go @@ -705,6 +705,109 @@ backends: [ } } +// TestManifestGenBackendReference verifies that an AIStore backend's +// manifest_gen_backend resolves to another configured backend. +func TestManifestGenBackendReference(t *testing.T) { + var ( + err error + ) + + initGlobals(testOsArgs(testGlobals.testConfigFilePathMap[".yaml"])) + + err = os.WriteFile(globals.configFilePath, []byte(` +msfs_version: 1 +backends: [ + { + dir_name: ais, + bucket_container_name: test, + prefix: "p/", + backend_type: AIStore, + AIStore: { + endpoint: "http://10.0.0.1:51080", + provider: s3, + manifest_gen_backend: s3src, + }, + }, + { + dir_name: s3src, + bucket_container_name: test, + prefix: "p/", + backend_type: S3, + S3: { + region: us-east-1, + endpoint: "http://minio:9000", + access_key_id: minioadmin, + secret_access_key: minioadmin, + }, + }, +] +`), 0o600) + if err != nil { + t.Fatalf("os.WriteFile() failed: %v", err) + } + + err = checkConfigFile() + if err != nil { + t.Fatalf("checkConfigFile() unexpectedly failed: %v", err) + } + + // On first load, checkConfigFile() moves backends into globals.backendsToMount. + aisBackend, ok := globals.backendsToMount["ais"] + if !ok { + t.Fatalf("expected backend \"ais\" to be configured") + } + aisCfg, ok := aisBackend.backendTypeSpecifics.(*backendConfigAIStoreStruct) + if !ok { + t.Fatalf("expected backend \"ais\" backendTypeSpecifics to be *backendConfigAIStoreStruct") + } + if aisCfg.manifestGenBackendName != "s3src" { + t.Errorf("expected manifestGenBackendName \"s3src\", got %q", aisCfg.manifestGenBackendName) + } + if aisCfg.manifestGenBackend == nil { + t.Fatalf("expected manifest_gen_backend to resolve, got nil") + } + if aisCfg.manifestGenBackend.dirName != "s3src" { + t.Errorf("expected resolved manifest_gen_backend dirName \"s3src\", got %q", aisCfg.manifestGenBackend.dirName) + } + if aisCfg.manifestGenBackend.backendType != "S3" { + t.Errorf("expected resolved manifest_gen_backend backendType \"S3\", got %q", aisCfg.manifestGenBackend.backendType) + } +} + +// TestManifestGenBackendMissing verifies that referencing a non-existent backend +// via manifest_gen_backend is rejected. +func TestManifestGenBackendMissing(t *testing.T) { + var ( + err error + ) + + initGlobals(testOsArgs(testGlobals.testConfigFilePathMap[".yaml"])) + + err = os.WriteFile(globals.configFilePath, []byte(` +msfs_version: 1 +backends: [ + { + dir_name: ais, + bucket_container_name: test, + backend_type: AIStore, + AIStore: { + endpoint: "http://10.0.0.1:51080", + provider: s3, + manifest_gen_backend: does_not_exist, + }, + }, +] +`), 0o600) + if err != nil { + t.Fatalf("os.WriteFile() failed: %v", err) + } + + err = checkConfigFile() + if err == nil { + t.Fatalf("checkConfigFile() unexpectedly succeeded with a dangling manifest_gen_backend reference") + } +} + func TestConfigFileBadConfigFileUpdate(t *testing.T) { var ( err error diff --git a/multi-storage-file-system/fission.go b/multi-storage-file-system/fission.go index 8e671500..b0ad2ae2 100644 --- a/multi-storage-file-system/fission.go +++ b/multi-storage-file-system/fission.go @@ -2,8 +2,10 @@ package main import ( "fmt" + "io" "log" "math" + "os" "sync" "syscall" "time" @@ -43,6 +45,18 @@ const ( fission.FOpenResponseDirectIO ) +// computeOpenOutFlags returns the FOPEN flags for the DoOpen response. In the +// disk cache backend (cache_backend: "disk") DirectIO is dropped so the kernel +// page cache can absorb repeated warm reads of the FUSE replies (matching s3fs, +// which sets no FOPEN flags). In memory mode (default) DirectIO stays on so +// every read round-trips to MSFS and the in-process cache is the sole authority. +func computeOpenOutFlags() uint32 { + if globals.config != nil && globals.config.cacheBackend == "disk" { + return uint32(0) + } + return openOutFlags +} + // `performFissionMount` is called to do the single FUSE mount at startup. func performFissionMount() (err error) { var ( @@ -157,7 +171,7 @@ func (*globalsStruct) DoLookup(inHeader *fission.InHeader, lookupIn *fission.Loo defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:160:3:funcLit@158") + globalsLock("fission.go:174:3:funcLit@172") if errno == 0 { globals.fissionMetrics.LookupSuccesses.Inc() globals.fissionMetrics.LookupSuccessLatencies.Observe(latency) @@ -176,7 +190,7 @@ func (*globalsStruct) DoLookup(inHeader *fission.InHeader, lookupIn *fission.Loo globalsUnlock() }() - globalsLock("fission.go:179:2:(*globalsStruct).DoLookup") + globalsLock("fission.go:193:2:(*globalsStruct).DoLookup") parentInode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -298,7 +312,7 @@ func (*globalsStruct) DoGetAttr(inHeader *fission.InHeader, getAttrIn *fission.G defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:301:3:funcLit@299") + globalsLock("fission.go:315:3:funcLit@313") if errno == 0 { globals.fissionMetrics.GetAttrSuccesses.Inc() globals.fissionMetrics.GetAttrSuccessLatencies.Observe(latency) @@ -317,7 +331,7 @@ func (*globalsStruct) DoGetAttr(inHeader *fission.InHeader, getAttrIn *fission.G globalsUnlock() }() - globalsLock("fission.go:320:2:(*globalsStruct).DoGetAttr") + globalsLock("fission.go:334:2:(*globalsStruct).DoGetAttr") thisInode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -438,7 +452,7 @@ func (*globalsStruct) DoMkDir(inHeader *fission.InHeader, mkDirIn *fission.MkDir defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:441:3:funcLit@439") + globalsLock("fission.go:455:3:funcLit@453") if errno == 0 { globals.fissionMetrics.MkDirSuccesses.Inc() globals.fissionMetrics.MkDirSuccessLatencies.Observe(latency) @@ -457,7 +471,7 @@ func (*globalsStruct) DoMkDir(inHeader *fission.InHeader, mkDirIn *fission.MkDir globalsUnlock() }() - globalsLock("fission.go:460:2:(*globalsStruct).DoMkDir") + globalsLock("fission.go:474:2:(*globalsStruct).DoMkDir") parentInode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -561,7 +575,7 @@ func (*globalsStruct) DoUnlink(inHeader *fission.InHeader, unlinkIn *fission.Unl // Record metrics on function exit defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:564:3:funcLit@562") + globalsLock("fission.go:578:3:funcLit@576") if errno == 0 { globals.fissionMetrics.UnlinkSuccesses.Inc() globals.fissionMetrics.UnlinkSuccessLatencies.Observe(latency) @@ -580,7 +594,7 @@ func (*globalsStruct) DoUnlink(inHeader *fission.InHeader, unlinkIn *fission.Unl globalsUnlock() }() - globalsLock("fission.go:583:2:(*globalsStruct).DoUnlink") + globalsLock("fission.go:597:2:(*globalsStruct).DoUnlink") parentInode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -677,7 +691,7 @@ func (*globalsStruct) DoRmDir(inHeader *fission.InHeader, rmDirIn *fission.RmDir defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:680:3:funcLit@678") + globalsLock("fission.go:694:3:funcLit@692") if errno == 0 { globals.fissionMetrics.RmDirSuccesses.Inc() globals.fissionMetrics.RmDirSuccessLatencies.Observe(latency) @@ -696,7 +710,7 @@ func (*globalsStruct) DoRmDir(inHeader *fission.InHeader, rmDirIn *fission.RmDir globalsUnlock() }() - globalsLock("fission.go:699:2:(*globalsStruct).DoRmDir") + globalsLock("fission.go:713:2:(*globalsStruct).DoRmDir") parentInode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -843,7 +857,7 @@ func (*globalsStruct) DoOpen(inHeader *fission.InHeader, openIn *fission.OpenIn) defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:846:3:funcLit@844") + globalsLock("fission.go:860:3:funcLit@858") if errno == 0 { globals.fissionMetrics.OpenSuccesses.Inc() globals.fissionMetrics.OpenSuccessLatencies.Observe(latency) @@ -862,7 +876,7 @@ func (*globalsStruct) DoOpen(inHeader *fission.InHeader, openIn *fission.OpenIn) globalsUnlock() }() - globalsLock("fission.go:865:2:(*globalsStruct).DoOpen") + globalsLock("fission.go:879:2:(*globalsStruct).DoOpen") inode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -938,7 +952,7 @@ func (*globalsStruct) DoOpen(inHeader *fission.InHeader, openIn *fission.OpenIn) openOut = &fission.OpenOut{ FH: fh.nonce, - OpenFlags: openOutFlags, + OpenFlags: computeOpenOutFlags(), Padding: 0, } @@ -961,6 +975,13 @@ func (*globalsStruct) DoRead(inHeader *fission.InHeader, readIn *fission.ReadIn) cacheLineWaiter sync.WaitGroup cacheLineWaits uint64 cacheLinesToPotentiallyPrefetch uint64 + copyDstStart int // len(readOut.Data) snapshot for optimistic-copy rollback + copyGeneration uint64 // dataCacheLineTracker.contentGeneration snapshot + copyLength uint64 + copySrcStart uint64 + copyDiskFile *os.File // [cache_backend == "disk"] snapshot of the line's backing file for unlocked pread + copyDiskOffset int64 // [cache_backend == "disk"] byte offset within copyDiskFile to pread from + diskReadErr error curOffset = readIn.Offset dataCacheLineNumber uint64 dataCacheLineNumbers []uint64 @@ -979,7 +1000,7 @@ func (*globalsStruct) DoRead(inHeader *fission.InHeader, readIn *fission.ReadIn) defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:982:3:funcLit@980") + globalsLock("fission.go:1003:3:funcLit@1001") if errno == 0 { globals.fissionMetrics.ReadSuccesses.Inc() globals.fissionMetrics.ReadSuccessLatencies.Observe(latency) @@ -1043,7 +1064,7 @@ func (*globalsStruct) DoRead(inHeader *fission.InHeader, readIn *fission.ReadIn) } for len(readOut.Data) < cap(readOut.Data) { - globalsLock("fission.go:1046:3:(*globalsStruct).DoRead") + globalsLock("fission.go:1067:3:(*globalsStruct).DoRead") inode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -1134,7 +1155,7 @@ func (*globalsStruct) DoRead(inHeader *fission.InHeader, readIn *fission.ReadIn) dataCacheLineNumbers, _ = allocateDataCacheLines(1 + uint64(len(prefetchCacheLineNumbers))) - globalsLock("fission.go:1137:4:(*globalsStruct).DoRead") + globalsLock("fission.go:1158:4:(*globalsStruct).DoRead") inode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -1287,6 +1308,19 @@ func (*globalsStruct) DoRead(inHeader *fission.InHeader, readIn *fission.ReadIn) globals.logger.Fatalf("[FATAL] dataCacheLineTracker.state(%v) != CacheLineClean(%v)", dataCacheLineTracker.state, CacheLineClean) } + if dataCacheLineTracker.fetchFailed { + // The backend read that was supposed to populate this cache line + // failed. Surface EIO to the caller rather than serving empty/short + // content (which previously produced an inverted slice and panicked), + // and evict the line so a subsequent read re-fetches it. + delete(inode.cacheMap, cacheLineNumber) + globals.dataCacheLineCleanLRU.popThis(dataCacheLineTracker) + dataCacheLineTracker.free() + globalsUnlock() + errno = syscall.EIO + return + } + cacheLineHits++ // Note that this is the fall-thru condition that counts resolved (cacheLine)Misses & (cacheLine)Waits as (subsequent) Hits dataCacheLineTracker.touch() @@ -1301,18 +1335,71 @@ func (*globalsStruct) DoRead(inHeader *fission.InHeader, readIn *fission.ReadIn) cacheLineOffsetLimit = dataCacheLineTracker.contentLength } - if cacheLineOffsetLimit == cacheLineOffsetStart { - // We have reached EOF + if cacheLineOffsetLimit <= cacheLineOffsetStart { + // We have reached EOF (==) or, defensively, a short/empty cache line + // where the limit fell below the start (<). Never slice with lo > hi. globalsUnlock() break } - readOut.Data = append(readOut.Data, globals.dataCacheLinesContent[dataCacheLineTracker.contentStart+cacheLineOffsetStart:dataCacheLineTracker.contentStart+cacheLineOffsetLimit]...) - curOffset += cacheLineOffsetLimit - cacheLineOffsetStart + // Optimistic-lock copy: perform the cache-line -> reply memcpy WITHOUT + // holding the global lock, so concurrent warm reads do not serialize on + // it (the copy-under-lock was the warm-read throughput ceiling at high + // thread counts). globals.dataCacheLinesContent (the cache-line content buffer) and + // globals.dataCacheLinesTracker are fixed allocations for the life of the + // mount, so reading from them after unlock is memory-safe. The only hazard + // is that this line is evicted/refetched mid-copy (yielding wrong bytes); + // .contentGeneration is bumped on every free()/fetch(), so we snapshot it + // under the lock, copy unlocked, then re-validate under the lock and retry + // the same offset on mismatch. + copyGeneration = dataCacheLineTracker.contentGeneration + copyLength = cacheLineOffsetLimit - cacheLineOffsetStart + copyDstStart = len(readOut.Data) + if globals.config.cacheBackend == "disk" { + copyDiskFile = dataCacheLineTracker.diskFile + copyDiskOffset = dataCacheLineTracker.diskOffset + int64(cacheLineOffsetStart) + } else { + copySrcStart = dataCacheLineTracker.contentStart + cacheLineOffsetStart + } globalsUnlock() + + if copyDiskFile != nil { + // Disk backend: pread straight into the reply buffer's backing array + // (no intermediate memcpy), outside the lock. A concurrent eviction + // bumps contentGeneration (re-checked below); and a stable generation + // means this very line is still resident, so its inode file cannot + // have been closed under us. + readOut.Data = readOut.Data[:copyDstStart+int(copyLength)] + _, diskReadErr = copyDiskFile.ReadAt(readOut.Data[copyDstStart:copyDstStart+int(copyLength)], copyDiskOffset) + } else { + readOut.Data = append(readOut.Data, globals.dataCacheLinesContent[copySrcStart:copySrcStart+copyLength]...) + } + + globalsLock("fission.go:1381:3:(*globalsStruct).DoRead") + + if dataCacheLineTracker.contentGeneration != copyGeneration { + // The cache line was evicted/refetched while we read; discard the + // bytes and retry this same offset. Any diskReadErr was a casualty of + // that eviction, so ignore it. + readOut.Data = readOut.Data[:copyDstStart] + globalsUnlock() + continue + } + + if diskReadErr != nil && diskReadErr != io.EOF { + // Generation stable but the pread genuinely failed -> surface EIO. + readOut.Data = readOut.Data[:copyDstStart] + globalsUnlock() + errno = syscall.EIO + return + } + + globalsUnlock() + + curOffset += copyLength } errno = 0 @@ -1328,7 +1415,7 @@ func (*globalsStruct) DoWrite(inHeader *fission.InHeader, writeIn *fission.Write // `DoStatFS` implements the package fission callback to fetch statistics about this FUSE file system. func (*globalsStruct) DoStatFS(inHeader *fission.InHeader) (statFSOut *fission.StatFSOut, errno syscall.Errno) { - globalsLock("fission.go:1331:2:(*globalsStruct).DoStatFS") + globalsLock("fission.go:1418:2:(*globalsStruct).DoStatFS") statFSOut = &fission.StatFSOut{ KStatFS: fission.KStatFS{ @@ -1366,7 +1453,7 @@ func (*globalsStruct) DoRelease(inHeader *fission.InHeader, releaseIn *fission.R defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:1369:3:funcLit@1367") + globalsLock("fission.go:1456:3:funcLit@1454") if errno == 0 { globals.fissionMetrics.ReleaseSuccesses.Inc() globals.fissionMetrics.ReleaseSuccessLatencies.Observe(latency) @@ -1385,7 +1472,7 @@ func (*globalsStruct) DoRelease(inHeader *fission.InHeader, releaseIn *fission.R globalsUnlock() }() - globalsLock("fission.go:1388:2:(*globalsStruct).DoRelease") + globalsLock("fission.go:1475:2:(*globalsStruct).DoRelease") inode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -1525,7 +1612,7 @@ func (*globalsStruct) DoOpenDir(inHeader *fission.InHeader, openDirIn *fission.O defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:1528:3:funcLit@1526") + globalsLock("fission.go:1615:3:funcLit@1613") if errno == 0 { globals.fissionMetrics.OpenDirSuccesses.Inc() globals.fissionMetrics.OpenDirSuccessLatencies.Observe(latency) @@ -1544,7 +1631,7 @@ func (*globalsStruct) DoOpenDir(inHeader *fission.InHeader, openDirIn *fission.O globalsUnlock() }() - globalsLock("fission.go:1547:2:(*globalsStruct).DoOpenDir") + globalsLock("fission.go:1634:2:(*globalsStruct).DoOpenDir") inode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -1685,7 +1772,7 @@ func (*globalsStruct) DoReadDir(inHeader *fission.InHeader, readDirIn *fission.R } latency = time.Since(startTime).Seconds() - globalsLock("fission.go:1688:3:funcLit@1681") + globalsLock("fission.go:1775:3:funcLit@1768") if errno == 0 { globals.fissionMetrics.ReadDirSuccesses.Inc() globals.fissionMetrics.ReadDirSuccessLatencies.Observe(latency) @@ -1723,7 +1810,7 @@ func (*globalsStruct) DoReadDir(inHeader *fission.InHeader, readDirIn *fission.R curReadDirOutSize = 0 curOffset = readDirIn.Offset - globalsLock("fission.go:1726:2:(*globalsStruct).DoReadDir") + globalsLock("fission.go:1813:2:(*globalsStruct).DoReadDir") Restart: @@ -1846,7 +1933,7 @@ Restart: listDirectoryOutput, err = listDirectoryWrapper(backend.context, listDirectoryInput) - globalsLock("fission.go:1849:4:(*globalsStruct).DoReadDir") + globalsLock("fission.go:1936:4:(*globalsStruct).DoReadDir") fh.listDirectoryInProgress = false @@ -1962,7 +2049,7 @@ func (*globalsStruct) DoReleaseDir(inHeader *fission.InHeader, releaseDirIn *fis defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:1965:3:funcLit@1963") + globalsLock("fission.go:2052:3:funcLit@2050") if errno == 0 { globals.fissionMetrics.ReleaseDirSuccesses.Inc() globals.fissionMetrics.ReleaseDirSuccessLatencies.Observe(latency) @@ -1981,7 +2068,7 @@ func (*globalsStruct) DoReleaseDir(inHeader *fission.InHeader, releaseDirIn *fis globalsUnlock() }() - globalsLock("fission.go:1984:2:(*globalsStruct).DoReleaseDir") + globalsLock("fission.go:2071:2:(*globalsStruct).DoReleaseDir") inode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -2086,7 +2173,7 @@ func (*globalsStruct) DoCreate(inHeader *fission.InHeader, createIn *fission.Cre defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:2089:3:funcLit@2087") + globalsLock("fission.go:2176:3:funcLit@2174") if errno == 0 { globals.fissionMetrics.CreateSuccesses.Inc() globals.fissionMetrics.CreateSuccessLatencies.Observe(latency) @@ -2105,7 +2192,7 @@ func (*globalsStruct) DoCreate(inHeader *fission.InHeader, createIn *fission.Cre globalsUnlock() }() - globalsLock("fission.go:2108:2:(*globalsStruct).DoCreate") + globalsLock("fission.go:2195:2:(*globalsStruct).DoCreate") parentInode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { @@ -2307,7 +2394,7 @@ func (*globalsStruct) DoReadDirPlus(inHeader *fission.InHeader, readDirPlusIn *f } latency = time.Since(startTime).Seconds() - globalsLock("fission.go:2310:3:funcLit@2303") + globalsLock("fission.go:2397:3:funcLit@2390") if errno == 0 { globals.fissionMetrics.ReadDirPlusSuccesses.Inc() globals.fissionMetrics.ReadDirPlusSuccessLatencies.Observe(latency) @@ -2347,7 +2434,7 @@ func (*globalsStruct) DoReadDirPlus(inHeader *fission.InHeader, readDirPlusIn *f entryAttrValidSec, entryAttrValidNSec = timeDurationToAttrDuration(globals.config.entryAttrTTL) - globalsLock("fission.go:2350:2:(*globalsStruct).DoReadDirPlus") + globalsLock("fission.go:2437:2:(*globalsStruct).DoReadDirPlus") Restart: @@ -2632,7 +2719,7 @@ Restart: listDirectoryOutput, err = listDirectoryWrapper(backend.context, listDirectoryInput) - globalsLock("fission.go:2635:4:(*globalsStruct).DoReadDirPlus") + globalsLock("fission.go:2722:4:(*globalsStruct).DoReadDirPlus") fh.listDirectoryInProgress = false @@ -2769,7 +2856,7 @@ func (*globalsStruct) DoStatX(inHeader *fission.InHeader, statXIn *fission.StatX defer func() { latency = time.Since(startTime).Seconds() - globalsLock("fission.go:2772:3:funcLit@2770") + globalsLock("fission.go:2859:3:funcLit@2857") if errno == 0 { globals.fissionMetrics.StatXSuccesses.Inc() globals.fissionMetrics.StatXSuccessLatencies.Observe(latency) @@ -2788,7 +2875,7 @@ func (*globalsStruct) DoStatX(inHeader *fission.InHeader, statXIn *fission.StatX globalsUnlock() }() - globalsLock("fission.go:2791:2:(*globalsStruct).DoStatX") + globalsLock("fission.go:2878:2:(*globalsStruct).DoStatX") thisInode, ok = globals.inodeMap.get(inHeader.NodeID) if !ok { diff --git a/multi-storage-file-system/fission_test.go b/multi-storage-file-system/fission_test.go index 1fcc4bf4..14758124 100644 --- a/multi-storage-file-system/fission_test.go +++ b/multi-storage-file-system/fission_test.go @@ -1722,3 +1722,99 @@ func TestFissionConvertPhysicalToVirtual(t *testing.T) { t.Fatalf("DoLookup(ram,Name:\"dir2\") failed after conversion (errno: %v)", errno) } } + +// TestFissionDoReadFetchFailureReturnsEIO is a regression test for the cache-line +// fetch-failure deadlock. When a backend read failed, fetch() used to leave the +// cache line marked Clean with contentLength==0; a subsequent DoRead then computed +// an inverted slice (offsetLimit < offsetStart) and PANICKED while holding the +// global lock. DoRead's deferred metrics func re-acquired that non-reentrant lock +// during panic unwinding and self-deadlocked, wedging the entire filesystem. +// +// After the fix, DoRead must surface EIO for a failed cache line and evict it, +// without panicking and without leaking the global lock. +func TestFissionDoReadFetchFailureReturnsEIO(t *testing.T) { + var ( + errno syscall.Errno + fh uint64 + fileBIno uint64 + inHeader *fission.InHeader + inode *inodeStruct + lookupIn *fission.LookupIn + lookupOut *fission.LookupOut + ok bool + openOut *fission.OpenOut + ramDirIno uint64 + tracker *dataCacheLineTrackerStruct + ) + + fissionTestUp(t) + defer fissionTestDown(t) + + // Navigate to the RAM-backed fileB. + inHeader = &fission.InHeader{NodeID: FUSERootDirInodeNumber} + lookupIn = &fission.LookupIn{Name: []byte("ram")} + lookupOut, errno = globals.DoLookup(inHeader, lookupIn) + if errno != 0 { + t.Fatalf("DoLookup(root,\"ram\") failed (errno: %v)", errno) + } + ramDirIno = lookupOut.EntryOut.NodeID + + inHeader = &fission.InHeader{NodeID: ramDirIno} + lookupIn = &fission.LookupIn{Name: []byte("fileB")} + lookupOut, errno = globals.DoLookup(inHeader, lookupIn) + if errno != 0 { + t.Fatalf("DoLookup(ram,\"fileB\") failed (errno: %v)", errno) + } + fileBIno = lookupOut.EntryOut.NodeID + + // Open fileB read-only to obtain a file handle. + inHeader = &fission.InHeader{NodeID: fileBIno} + openOut, errno = globals.DoOpen(inHeader, &fission.OpenIn{Flags: fission.FOpenRequestRDONLY}) + if errno != 0 { + t.Fatalf("DoOpen(fileB, RDONLY) failed (errno: %v)", errno) + } + fh = openOut.FH + + // Inject a poisoned cache line for line 0: Clean but with a failed fetch + // (contentLength == 0) — exactly the state fetch() leaves on a backend error. + // Setting it up directly keeps the subsequent read on the cache-hit path and + // avoids depending on a flaky backend. + globalsLock("fission_test.go:1782:2:TestFissionDoReadFetchFailureReturnsEIO") + inode, ok = globals.inodeMap.get(fileBIno) + if !ok { + globalsUnlock() + t.Fatalf("inodeMap.get(fileBIno) returned !ok") + } + if inode.cacheMap == nil { + inode.cacheMap = make(map[uint64]uint64) + } + tracker = globals.dataCacheLineFreeLRU.popHead() + if tracker == nil { + globalsUnlock() + t.Fatalf("dataCacheLineFreeLRU.popHead() returned nil (no free cache lines)") + } + tracker.inodeNumber = inode.inodeNumber + tracker.lineNumber = 0 + tracker.contentLength = 0 + tracker.eTag = "" + tracker.fetchFailed = true + inode.cacheMap[0] = tracker.pos + globals.dataCacheLineCleanLRU.pushTail(tracker) + globalsUnlock() + + // Read line 0. Before the fix this panicked while holding the global lock and + // deadlocked the whole filesystem; it must now cleanly return EIO. + inHeader = &fission.InHeader{NodeID: fileBIno} + _, errno = globals.DoRead(inHeader, &fission.ReadIn{FH: fh, Offset: 0, Size: testFissionReadBufSize}) + if errno != syscall.EIO { + t.Fatalf("DoRead over a failed cache line returned errno %v (expected EIO %v)", errno, syscall.EIO) + } + + // The failed line must have been evicted so a later read re-fetches it. + globalsLock("fission_test.go:1814:2:TestFissionDoReadFetchFailureReturnsEIO") + _, ok = inode.cacheMap[0] + globalsUnlock() + if ok { + t.Fatalf("failed cache line was not evicted from inode.cacheMap after EIO") + } +} diff --git a/multi-storage-file-system/fs.go b/multi-storage-file-system/fs.go index 0bb1103e..1b34307b 100644 --- a/multi-storage-file-system/fs.go +++ b/multi-storage-file-system/fs.go @@ -173,10 +173,16 @@ func processToMountList() { for dirName, backend = range globals.backendsToMount { delete(globals.backendsToMount, dirName) - err = backend.setupContext() - if err != nil { - globals.logger.Printf("[WARN] unable to setup backend context: %s (err: %v) [skipping]", dirName, err) - continue + // Skip if already set up. An AIStore backend referencing this one via + // manifest_gen_backend may have set up its context already (lazy setup); + // re-running setupContext would build a duplicate client and replace the + // context the AIStore backend already captured. + if backend.context == nil { + err = backend.setupContext() + if err != nil { + globals.logger.Printf("[WARN] unable to set up a backend context; skipping (check the backend's credentials/endpoint/prefix)") + continue + } } backend.nonce = fetchNonce() @@ -242,7 +248,7 @@ func processToMountList() { // `processToUnmountList` is called to remove each backend subdirectory of the FUSE // file system's root directory found on the globals.backendsToUnmount list. func processToUnmountList() { - globalsLock("fs.go:245:2:processToUnmountList") + globalsLock("fs.go:251:2:processToUnmountList") processToUnmountListAlreadyLocked() globalsUnlock() } @@ -822,7 +828,7 @@ func inodeEvictor() { for { select { case <-ticker.C: - globalsLock("fs.go:825:4:inodeEvictor") + globalsLock("fs.go:831:4:inodeEvictor") // Scan globals.inodeEvictionLRU looking for expired inodes to evict @@ -1101,7 +1107,7 @@ func prefetchDirectory(dirInodeNumber uint64) { startTime = time.Now() ) - globalsLock("fs.go:1104:2:prefetchDirectory") + globalsLock("fs.go:1110:2:prefetchDirectory") dirInode, ok = globals.inodeMap.get(dirInodeNumber) if !ok { @@ -1130,7 +1136,7 @@ func prefetchDirectory(dirInodeNumber uint64) { globals.logger.Printf("[WARN] listDirectoryWrapper(dirInode.backend.context, listDirectoryInput) failed: %v", err) } - globalsLock("fs.go:1133:3:prefetchDirectory") + globalsLock("fs.go:1139:3:prefetchDirectory") dirInode, ok = globals.inodeMap.get(dirInodeNumber) if !ok { @@ -1296,7 +1302,7 @@ func dumpFS(w io.Writer) { rootDirInode *inodeStruct ) - globalsLock("fs.go:1299:2:dumpFS") + globalsLock("fs.go:1305:2:dumpFS") rootDirInode, ok = globals.inodeMap.get(FUSERootDirInodeNumber) if !ok { @@ -1460,7 +1466,7 @@ func (thisInode *inodeStruct) finishPendingDelete() { Restart: - globalsLock("fs.go:1463:2:(*inodeStruct).finishPendingDelete") + globalsLock("fs.go:1469:2:(*inodeStruct).finishPendingDelete") // Let's just drop cache lines that are either "clean" io "dirty" diff --git a/multi-storage-file-system/globals.go b/multi-storage-file-system/globals.go index 6c28d99e..60c4fba2 100644 --- a/multi-storage-file-system/globals.go +++ b/multi-storage-file-system/globals.go @@ -31,6 +31,16 @@ type backendConfigAIStoreStruct struct { authnTokenFile string // JSON/YAML "authn_token_file" default:"${AIS_AUTHN_TOKEN_FILE:=~/.config/ais/cli/auth.token}" provider string // JSON/YAML "provider" default:"s3" timeout time.Duration // JSON/YAML "timeout" default:30000 + // `manifestGenBackendName` (JSON/YAML "manifest_gen_backend", default "") names + // another backend defined in this config (by its dir_name). When set, the AIStore + // backend routes LIST/STAT-DIR operations (manifest generation, readdir) to that + // backend's context (e.g. direct S3/GCS) while object reads (readFile/statFile) + // still flow through AIS so its caching/fast-tier applies. This avoids AIS's slow + // cold cross-cloud listing without re-declaring the source backend's credentials. + manifestGenBackendName string + // `manifestGenBackend` is the resolved pointer to the backend named by + // manifestGenBackendName (set in a post-parse resolution pass). Runtime state. + manifestGenBackend *backendStruct } // `backendConfigGCSStruct` describes a backend's GCS-specific settings. @@ -168,6 +178,7 @@ type configStruct struct { dirtyCacheLinesFlushTrigger uint64 // JSON/YAML "dirty_cache_lines_flush_trigger" default:80 (as a percentage) dirtyCacheLinesMax uint64 // JSON/YAML "dirty_cache_lines_max" default:90 (as a percentage) cacheDirPath string // JSON/YAML "cache_dir_path" default:"" + cacheBackend string // JSON/YAML "cache_backend" ("memory"|"disk") default:"memory" metadataCachePagingMode string // JSON/YAML "metadata_cache_paging_mode" default:"pebble" pebbleCacheSize uint64 // JSON/YAML "pebble_cache_size" default:33554432 (32Mi) pebbleL0CompactionFileThreshold uint64 // JSON/YAML "pebble_l0_compaction_file_threshold" default:4 @@ -332,6 +343,10 @@ type dataCacheLineTrackerStruct struct { inodeNumber uint64 // Reference to an inodeStruct.inodeNumber lineNumber uint64 // Identifies file/object range covered by content as up to [lineNumber * globals.config.cacheLineSize:(lineNumber + 1) * global.config.cacheLineSize) eTag string // If state == CacheLineClean, value of inodeStruct.eTag when when fetched from backend; Otherwise, == "" + fetchFailed bool // Set when the backend read populating this line failed; DoRead surfaces this as EIO and evicts the line instead of serving empty/short content + diskFile *os.File // [cache_backend == "disk"] per-inode backing file this line was written to (== globals.inodeDiskCacheFiles[inodeNumber].file); nil in memory mode + diskOffset int64 // [cache_backend == "disk"] byte offset of this line within diskFile (== lineNumber * cacheLineSize) + diskLength int64 // [cache_backend == "disk"] number of valid bytes written at diskOffset (== contentLength) } // `inodeStruct` contains the state of an inode. @@ -393,6 +408,7 @@ type globalsStruct struct { dataCacheLineOutboundLRU dataCacheLineLRUStruct // LRU-ordered doubly linked list of dataCacheLineTrackerStruct where .state == CacheLineOutbound dataCacheLineDirtyLRU dataCacheLineLRUStruct // LRU-ordered doubly linked list of dataCacheLineTrackerStruct where .state == CacheLineDirty dataCacheActivityWG sync.WaitGroup // + inodeDiskCacheFiles map[uint64]*inodeDiskCacheFileStruct // [cache_backend == "disk"] Key == inodeStruct.inodeNumber; per-inode contiguous backing file + resident-line refcount fhMap map[uint64]*fhStruct // Key == fhStruct.nonce fissionMetrics *fissionMetricsStruct // backendMetrics *backendMetricsStruct // diff --git a/multi-storage-file-system/globals_lock.go b/multi-storage-file-system/globals_lock.go index 36526c8d..cd3a6018 100644 --- a/multi-storage-file-system/globals_lock.go +++ b/multi-storage-file-system/globals_lock.go @@ -18,7 +18,7 @@ import ( // globalsLockSiteCount is the number of distinct lockgen site strings (unique globalsLock("…") call // sites in this module). Maintained by: go generate (tools/lockgen). -const globalsLockSiteCount = 66 +const globalsLockSiteCount = 69 // globalsLockMaxSiteKeyLen is the length in bytes of the longest site string key in globalsLockMaxHoldBySite // (len(s) for that key). Maintained by: go generate (tools/lockgen). @@ -57,56 +57,59 @@ var globalsLockMaxHoldBySite = map[string]globalsLockSiteStats{ "backend.go:610:3:funcLit@609": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "backend.go:671:3:funcLit@670": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "bptree_test.go:60:3:BenchmarkBPTreePageInsertion": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "cache.go:396:3:allocateDataCacheLines": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "cache.go:444:2:(*dataCacheLineTrackerStruct).fetch": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "cache.go:476:2:(*dataCacheLineTrackerStruct).fetch": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1046:3:(*globalsStruct).DoRead": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1137:4:(*globalsStruct).DoRead": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1331:2:(*globalsStruct).DoStatFS": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1369:3:funcLit@1367": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1388:2:(*globalsStruct).DoRelease": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1528:3:funcLit@1526": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1547:2:(*globalsStruct).DoOpenDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:160:3:funcLit@158": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1688:3:funcLit@1681": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1726:2:(*globalsStruct).DoReadDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:179:2:(*globalsStruct).DoLookup": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1849:4:(*globalsStruct).DoReadDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1965:3:funcLit@1963": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:1984:2:(*globalsStruct).DoReleaseDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:2089:3:funcLit@2087": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:2108:2:(*globalsStruct).DoCreate": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:2310:3:funcLit@2303": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:2350:2:(*globalsStruct).DoReadDirPlus": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:2635:4:(*globalsStruct).DoReadDirPlus": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:2772:3:funcLit@2770": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:2791:2:(*globalsStruct).DoStatX": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:301:3:funcLit@299": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:320:2:(*globalsStruct).DoGetAttr": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:441:3:funcLit@439": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:460:2:(*globalsStruct).DoMkDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:564:3:funcLit@562": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:583:2:(*globalsStruct).DoUnlink": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:680:3:funcLit@678": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:699:2:(*globalsStruct).DoRmDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:846:3:funcLit@844": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:865:2:(*globalsStruct).DoOpen": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fission.go:982:3:funcLit@980": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "cache.go:416:3:allocateDataCacheLines": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "cache.go:468:2:(*dataCacheLineTrackerStruct).fetch": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "cache.go:500:2:(*dataCacheLineTrackerStruct).fetch": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1003:3:funcLit@1001": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1067:3:(*globalsStruct).DoRead": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1158:4:(*globalsStruct).DoRead": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1381:3:(*globalsStruct).DoRead": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1418:2:(*globalsStruct).DoStatFS": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1456:3:funcLit@1454": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1475:2:(*globalsStruct).DoRelease": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1615:3:funcLit@1613": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1634:2:(*globalsStruct).DoOpenDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:174:3:funcLit@172": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1775:3:funcLit@1768": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1813:2:(*globalsStruct).DoReadDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:1936:4:(*globalsStruct).DoReadDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:193:2:(*globalsStruct).DoLookup": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2052:3:funcLit@2050": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2071:2:(*globalsStruct).DoReleaseDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2176:3:funcLit@2174": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2195:2:(*globalsStruct).DoCreate": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2397:3:funcLit@2390": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2437:2:(*globalsStruct).DoReadDirPlus": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2722:4:(*globalsStruct).DoReadDirPlus": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2859:3:funcLit@2857": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:2878:2:(*globalsStruct).DoStatX": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:315:3:funcLit@313": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:334:2:(*globalsStruct).DoGetAttr": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:455:3:funcLit@453": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:474:2:(*globalsStruct).DoMkDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:578:3:funcLit@576": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:597:2:(*globalsStruct).DoUnlink": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:694:3:funcLit@692": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:713:2:(*globalsStruct).DoRmDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:860:3:funcLit@858": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission.go:879:2:(*globalsStruct).DoOpen": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fission_test.go:1228:2:TestFissionDoUnlinkRollbackOnBackendFailure": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fission_test.go:1618:2:TestFissionConvertPhysicalToVirtual": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fission_test.go:1644:2:TestFissionConvertPhysicalToVirtual": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fission_test.go:1680:2:TestFissionConvertPhysicalToVirtual": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission_test.go:1782:2:TestFissionDoReadFetchFailureReturnsEIO": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fission_test.go:1814:2:TestFissionDoReadFetchFailureReturnsEIO": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fission_test.go:461:2:TestFissionDoGetAttrStatX": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fission_test.go:641:2:TestFissionDoOpenDirReadDirReadDirPlusReleaseDir": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fs.go:1104:2:prefetchDirectory": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fs.go:1133:3:prefetchDirectory": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fs.go:1110:2:prefetchDirectory": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fs.go:1139:3:prefetchDirectory": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fs.go:123:2:drainFS": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fs.go:1299:2:dumpFS": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fs.go:1463:2:(*inodeStruct).finishPendingDelete": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fs.go:1305:2:dumpFS": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fs.go:1469:2:(*inodeStruct).finishPendingDelete": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fs.go:169:2:processToMountList": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "fs.go:23:2:initFS": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fs.go:245:2:processToUnmountList": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, - "fs.go:825:4:inodeEvictor": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fs.go:251:2:processToUnmountList": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, + "fs.go:831:4:inodeEvictor": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "http.go:150:4:(*globalsStruct).ServeHTTP": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "http.go:170:4:(*globalsStruct).ServeHTTP": {HoldCnt: 0, HoldSum: 0, HoldMax: 0}, "http.go:184:3:(*globalsStruct).ServeHTTP": {HoldCnt: 0, HoldSum: 0, HoldMax: 0},