Feature request: safer alternate filesystem collection modes for `do-agent`

I would like to request one or more alternate filesystem collection modes for `do-agent` that do not require the agent to walk and interpret every mounted filesystem on the host.

Some environments have unusual, duplicated, virtualized, or bind-mounted filesystem layouts. In those environments, the current filesystem collector can encounter duplicate or misleading mount data. A safer alternative would allow administrators to collect only the basic disk and inode metrics they actually need.

This request is not for full support of every unusual environment. Instead, it is a request for safer fallback collection modes that could help administrators avoid full mount-table discovery when it is not appropriate for their server.

The three possible approaches are:

1. A `df`-based filesystem collector mode;
2. An explicit path-based filesystem collector mode;
3. A customer-provided filesystem metrics file.

Any one of these would help. Supporting more than one would give administrators flexibility.

---

## Proposed option 1: `df`-based filesystem collector mode

Please consider adding a filesystem collector mode that gathers disk and inode metrics using the equivalent of:

```bash
df -P
df -Pi
```

Possible option names:

```bash
--collector.filesystem.mode=df
```

or:

```bash
--collector.filesystem.use-df
```

When enabled, `do-agent` would collect filesystem space and inode metrics using `df`-style output instead of walking and interpreting the full mount table through the current filesystem collector.

This would provide the same type of filesystem information administrators already trust from the terminal:

* total space;
* used space;
* available space;
* percent used;
* total inodes;
* used inodes;
* available inodes;
* inode percent used;
* filesystem/device;
* mountpoint.

Possible advanced options:

```bash
--collector.filesystem.df-path=/usr/bin/df
```

```bash
--collector.filesystem.df-args="-P"
```

```bash
--collector.filesystem.df-inode-args="-Pi"
```

If parsing fails, the agent could disable only the `df` filesystem collector and emit a single warning, rather than repeatedly logging the same failure.

---

## Why a `df`-based mode may be enough

On the affected server, standard `df` output provides a clean and practical filesystem view without exposing the large CageFS bind-mount layout that caused problems for the normal collector.

Example `df -h` output:

```text
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           1.8G     0  1.8G   0% /dev/shm
tmpfs           732M   74M  658M  11% /run
tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
/dev/vda1       120G   30G   90G  26% /
/dev/vda3       507M  316M  191M  63% /boot
/dev/vda2       200M  7.5M  193M   4% /boot/efi
/dev/loop0      3.9G  204K  3.7G   1% /tmp
none            1.8G  4.0K  1.8G   1% /var/lve/dbgovernor-shm
```

Example `df -ih` output:

```text
Filesystem     Inodes IUsed IFree IUse% Mounted on
devtmpfs         447K   344  447K    1% /dev
tmpfs            457K     1  457K    1% /dev/shm
tmpfs            800K   924  800K    1% /run
tmpfs            1.0K    18  1006    2% /sys/fs/cgroup
/dev/vda1         60M  663K   60M    2% /
/dev/vda3        256K   327  256K    1% /boot
/dev/vda2           0     0     0     - /boot/efi
/dev/loop0       256K    49  256K    1% /tmp
none             457K     2  457K    1% /var/lve/dbgovernor-shm
tmpfs             92K    22   92K    1% /run/user/1002
```

This suggests that the issue is not that filesystem usage cannot be reported on this host. The issue is that the current collector appears to inspect the mount layout in a way that encounters CageFS bind mounts and duplicate filesystem metrics.

A `df`-based fallback mode could collect the basic disk and inode information administrators already use from the terminal, while avoiding deeper mount-table discovery.

For many servers, this would be sufficient. In this example, the useful monitored filesystems would primarily be:

```text
/
/boot
/boot/efi
/tmp
```

and possibly `/var/lve/dbgovernor-shm` only if the administrator chooses to include tmpfs-style filesystems.

The agent could optionally ignore common virtual filesystems by default, such as:

```text
devtmpfs
tmpfs
cgroup
cgroup2
proc
sysfs
debugfs
tracefs
overlay
squashfs
```

This would give `do-agent` a safer fallback for unusual mount layouts without requiring full CloudLinux/CageFS support.

---

## Proposed option 2: explicit path-based filesystem checks

Please consider an option that collects filesystem metrics only for specific administrator-provided paths.

For example:

```bash
--collector.filesystem.paths=/,/boot,/boot/efi,/tmp
```

or:

```bash
--collector.filesystem.paths-file=/etc/do-agent/filesystem-paths.conf
```

Example paths file:

```text
/
/boot
/boot/efi
/tmp
```

When this option is used, `do-agent` would skip full mountpoint discovery and collect filesystem metrics only for the listed paths.

The behavior could be similar to running:

```bash
df -P /
df -P /boot
df -P /boot/efi
df -P /tmp

df -Pi /
df -Pi /boot
df -Pi /boot/efi
df -Pi /tmp
```

or using equivalent `statfs` / `statvfs` calls internally.

This would allow administrators to say:

> Only report disk and inode usage for these important paths.

That is often all that is needed for practical alerting.

This would also avoid requiring administrators to craft complex mountpoint exclusion regular expressions for bind-mount-heavy systems.

---

## Proposed option 3: customer-provided filesystem metrics file

Please also consider allowing `do-agent` to read filesystem metrics from a local file.

For example:

```bash
--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt
```

or:

```bash
--collector.filesystem.source=file
--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt
```

In this model, the customer could generate the file however they prefer:

* `df`;
* `stat`;
* a shell script;
* a cron job;
* a monitoring tool;
* a custom parser with environment-specific exclusions.

`do-agent` would remain the trusted process that submits metrics to DigitalOcean, but the customer would control how filesystem metrics are gathered.

A file-based approach may be safer than an exec-based plugin because `do-agent` would not need to run arbitrary customer commands. It would only read a documented local file format.

Example conceptual format:

```text
mountpoint=/ size_bytes=128849018880 used_bytes=32212254720 avail_bytes=96636764160 used_percent=26 inode_total=62914560 inode_used=663000 inode_avail=62251560 inode_used_percent=2
mountpoint=/boot size_bytes=531628032 used_bytes=331350016 avail_bytes=200278016 used_percent=63 inode_total=262144 inode_used=327 inode_avail=261817 inode_used_percent=1
mountpoint=/tmp size_bytes=4187593113 used_bytes=208896 avail_bytes=3972844748 used_percent=1 inode_total=262144 inode_used=49 inode_avail=262095 inode_used_percent=1
```

Or, if preferred, the file could use a documented Prometheus-style text format.

---

## Why this is useful

Some environments have mount tables that are technically valid but difficult for a general-purpose filesystem collector to interpret safely.

Examples include:

* CloudLinux CageFS;
* cPanel/WHM systems;
* chroot-heavy systems;
* bind-mount-heavy systems;
* container-heavy hosts;
* Docker/LXC environments;
* systems with duplicated or virtualized mountpoints.

In these environments, the administrator may not need the agent to understand every mountpoint. They may only need reliable metrics for a few filesystems or paths, such as:

```text
/
/boot
/boot/efi
/tmp
```

A `df`-style mode, explicit path mode, or customer-provided file mode would avoid unnecessary full mount discovery and reduce the risk of duplicate filesystem metrics.

---

## Example use case: CloudLinux / CageFS / cPanel

I understand that CloudLinux/CageFS is not officially supported by `do-agent`. This feature request is not asking for full CloudLinux support.

However, this environment is a good example of why safer alternate filesystem collection modes would be useful.

Environment:

* DigitalOcean Droplet;
* CloudLinux + cPanel/WHM;
* CageFS enabled;
* `do-agent` upgraded automatically from `3.18.10-1` to `3.18.12-1`;
* Upgrade occurred around `2026-04-24 03:49 UTC`.

After the upgrade, the filesystem collector began repeatedly logging duplicate metric errors related to CageFS bind mounts.

The repeated mountpoints were under:

```text
/usr/share/cagefs-skeleton/
```

The logs repeatedly contained errors similar to:

```text
failed to gather metrics: collected metric "node_filesystem_size_bytes" ... was collected before with the same name and label values
```

The impact was significant:

* sustained high CPU usage, around 75%;
* approximately 55 GB `/var/log/messages`;
* approximately 40 GB rotated messages log;
* disk exhaustion;
* WHM/cPanel service interruption.

Disabling `do-agent` immediately stopped the log flood and CPU returned to normal.

In this case, I did not need the agent to inspect CageFS mountpoints. I only needed basic disk and inode metrics for the main filesystems. Commands such as the following were sufficient to show the information I needed:

```bash
df -P
df -Pi
```

or, for specific paths:

```bash
df -P /
df -P /boot
df -P /boot/efi
df -P /tmp

df -Pi /
df -Pi /boot
df -Pi /boot/efi
df -Pi /tmp
```

---

## Current workaround

The only safe workaround I currently have is to disable the filesystem collector entirely:

```bash
/opt/digitalocean/bin/do-agent --syslog --no-collector.filesystem
```

That prevents the runaway filesystem collector behavior, but it also removes the DigitalOcean filesystem metrics I actually need for this Droplet.

This creates an unfortunate tradeoff:

* leave filesystem collection enabled and risk duplicate metric errors, runaway logging, high CPU usage, and disk exhaustion;
* disable filesystem collection and lose the disk/inode metrics that would help detect or prevent disk exhaustion.

A safer alternate collection mode would avoid this tradeoff by allowing `do-agent` to report basic filesystem usage without walking the full mount layout.

---

## Why mountpoint exclusion rules are not always enough

Mountpoint exclusion rules are useful, but they still require the agent to discover and reason about the host’s mount layout.

In bind-mount-heavy or CageFS-style environments, that discovery process can be fragile. Administrators may also have to write complex regular expressions to exclude paths the agent did not need to inspect in the first place.

A `df`-based, path-based, or file-based mode would be simpler and more predictable:

* do not walk every mountpoint;
* do not inspect CageFS bind mounts unnecessarily;
* do not require complex mountpoint exclusion regular expressions;
* collect only the filesystems or paths the administrator explicitly cares about;
* allow administrators to generate clean filesystem metrics themselves when needed.

---

## Requested features

Please consider adding one or more of the following options.

### `df`-based mode

```bash
--collector.filesystem.mode=df
```

or:

```bash
--collector.filesystem.use-df
```

This would collect filesystem space and inode metrics using the equivalent of `df -P` and `df -Pi`.

### Explicit path mode

```bash
--collector.filesystem.paths=/,/boot,/boot/efi,/tmp
```

or:

```bash
--collector.filesystem.paths-file=/etc/do-agent/filesystem-paths.conf
```

This would collect filesystem metrics only for explicitly configured paths.

### Customer-provided file mode

```bash
--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt
```

or:

```bash
--collector.filesystem.source=file
--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt
```

This would allow customers to generate filesystem metrics themselves and let `do-agent` read and submit them.

---

## Additional defensive behavior

Even when an environment is unsupported, it may also be helpful for the agent to handle repeated filesystem collector failures more defensively.

For example:

* rate-limit repeated duplicate metric errors;
* disable only the affected collector after repeated failures;
* emit one clear warning instead of repeatedly logging the same error;
* avoid filling system logs when the metrics collector is unhealthy.

A metrics issue should not be able to fill `/var/log/messages`, exhaust disk space, and contribute to a production service outage.

---

## Related issues

This request may also help with or relate to other reports involving CloudLinux support, duplicate metric collection, or high CPU from filesystem metric collection:

* digitalocean/do-agent#129 — CloudLinux install/support issue for CloudLinux + WHM/cPanel.
* digitalocean/droplet-agent#131 — request for CloudLinux support.
* digitalocean/do-agent#228 — duplicate metric errors with “was collected before with the same name and label values.”
* digitalocean/do-agent#233 — high CPU and repeated duplicate `node_filesystem_*` metric errors.

This feature request is more specific: provide safer alternate filesystem collection modes, such as `df`-based collection, explicit path collection, or customer-provided filesystem metrics, so that users do not have to choose between unsafe full mount discovery and disabling filesystem metrics entirely.

---

## Trouble Ticket

I also opened a DigitalOcean Support ticket for this incident in April. Support confirmed that CloudLinux/CageFS is not officially supported by `do-agent`.

This request is not for full CloudLinux support, but for a safer alternative filesystem collection mode that could help unsupported or unusual mount layouts avoid a full mount-table discovery.

[[#12093409](https://cloudsupport.digitalocean.com/s/case-detail?recordId=500QP00001QvKFRYA3)](https://cloudsupport.digitalocean.com/s/case-detail?recordId=500QP00001QvKFRYA3) do-agent 3.18.12 causes runaway logging and high CPU on CloudLinux CageFS Droplet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: safer alternate filesystem collection modes for `do-agent` #357

Proposed option 1: `df`-based filesystem collector mode

Why a `df`-based mode may be enough

Proposed option 2: explicit path-based filesystem checks

Proposed option 3: customer-provided filesystem metrics file

Why this is useful

Example use case: CloudLinux / CageFS / cPanel

Current workaround

Why mountpoint exclusion rules are not always enough

Requested features

`df`-based mode

Explicit path mode

Customer-provided file mode

Additional defensive behavior

Related issues

Trouble Ticket

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature request: safer alternate filesystem collection modes for do-agent #357

Description

Proposed option 1: df-based filesystem collector mode

Why a df-based mode may be enough

Proposed option 2: explicit path-based filesystem checks

Proposed option 3: customer-provided filesystem metrics file

Why this is useful

Example use case: CloudLinux / CageFS / cPanel

Current workaround

Why mountpoint exclusion rules are not always enough

Requested features

df-based mode

Explicit path mode

Customer-provided file mode

Additional defensive behavior

Related issues

Trouble Ticket

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Feature request: safer alternate filesystem collection modes for `do-agent` #357

Proposed option 1: `df`-based filesystem collector mode

Why a `df`-based mode may be enough

`df`-based mode