Skip to content

Feature request: safer alternate filesystem collection modes for do-agent #357

@lewismoten

Description

@lewismoten

I would like to request one or more alternate filesystem collection modes for do-agent that do not require the agent to walk and interpret every mounted filesystem on the host.

Some environments have unusual, duplicated, virtualized, or bind-mounted filesystem layouts. In those environments, the current filesystem collector can encounter duplicate or misleading mount data. A safer alternative would allow administrators to collect only the basic disk and inode metrics they actually need.

This request is not for full support of every unusual environment. Instead, it is a request for safer fallback collection modes that could help administrators avoid full mount-table discovery when it is not appropriate for their server.

The three possible approaches are:

  1. A df-based filesystem collector mode;
  2. An explicit path-based filesystem collector mode;
  3. A customer-provided filesystem metrics file.

Any one of these would help. Supporting more than one would give administrators flexibility.


Proposed option 1: df-based filesystem collector mode

Please consider adding a filesystem collector mode that gathers disk and inode metrics using the equivalent of:

df -P
df -Pi

Possible option names:

--collector.filesystem.mode=df

or:

--collector.filesystem.use-df

When enabled, do-agent would collect filesystem space and inode metrics using df-style output instead of walking and interpreting the full mount table through the current filesystem collector.

This would provide the same type of filesystem information administrators already trust from the terminal:

  • total space;
  • used space;
  • available space;
  • percent used;
  • total inodes;
  • used inodes;
  • available inodes;
  • inode percent used;
  • filesystem/device;
  • mountpoint.

Possible advanced options:

--collector.filesystem.df-path=/usr/bin/df
--collector.filesystem.df-args="-P"
--collector.filesystem.df-inode-args="-Pi"

If parsing fails, the agent could disable only the df filesystem collector and emit a single warning, rather than repeatedly logging the same failure.


Why a df-based mode may be enough

On the affected server, standard df output provides a clean and practical filesystem view without exposing the large CageFS bind-mount layout that caused problems for the normal collector.

Example df -h output:

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           1.8G     0  1.8G   0% /dev/shm
tmpfs           732M   74M  658M  11% /run
tmpfs           4.0M     0  4.0M   0% /sys/fs/cgroup
/dev/vda1       120G   30G   90G  26% /
/dev/vda3       507M  316M  191M  63% /boot
/dev/vda2       200M  7.5M  193M   4% /boot/efi
/dev/loop0      3.9G  204K  3.7G   1% /tmp
none            1.8G  4.0K  1.8G   1% /var/lve/dbgovernor-shm

Example df -ih output:

Filesystem     Inodes IUsed IFree IUse% Mounted on
devtmpfs         447K   344  447K    1% /dev
tmpfs            457K     1  457K    1% /dev/shm
tmpfs            800K   924  800K    1% /run
tmpfs            1.0K    18  1006    2% /sys/fs/cgroup
/dev/vda1         60M  663K   60M    2% /
/dev/vda3        256K   327  256K    1% /boot
/dev/vda2           0     0     0     - /boot/efi
/dev/loop0       256K    49  256K    1% /tmp
none             457K     2  457K    1% /var/lve/dbgovernor-shm
tmpfs             92K    22   92K    1% /run/user/1002

This suggests that the issue is not that filesystem usage cannot be reported on this host. The issue is that the current collector appears to inspect the mount layout in a way that encounters CageFS bind mounts and duplicate filesystem metrics.

A df-based fallback mode could collect the basic disk and inode information administrators already use from the terminal, while avoiding deeper mount-table discovery.

For many servers, this would be sufficient. In this example, the useful monitored filesystems would primarily be:

/
/boot
/boot/efi
/tmp

and possibly /var/lve/dbgovernor-shm only if the administrator chooses to include tmpfs-style filesystems.

The agent could optionally ignore common virtual filesystems by default, such as:

devtmpfs
tmpfs
cgroup
cgroup2
proc
sysfs
debugfs
tracefs
overlay
squashfs

This would give do-agent a safer fallback for unusual mount layouts without requiring full CloudLinux/CageFS support.


Proposed option 2: explicit path-based filesystem checks

Please consider an option that collects filesystem metrics only for specific administrator-provided paths.

For example:

--collector.filesystem.paths=/,/boot,/boot/efi,/tmp

or:

--collector.filesystem.paths-file=/etc/do-agent/filesystem-paths.conf

Example paths file:

/
/boot
/boot/efi
/tmp

When this option is used, do-agent would skip full mountpoint discovery and collect filesystem metrics only for the listed paths.

The behavior could be similar to running:

df -P /
df -P /boot
df -P /boot/efi
df -P /tmp

df -Pi /
df -Pi /boot
df -Pi /boot/efi
df -Pi /tmp

or using equivalent statfs / statvfs calls internally.

This would allow administrators to say:

Only report disk and inode usage for these important paths.

That is often all that is needed for practical alerting.

This would also avoid requiring administrators to craft complex mountpoint exclusion regular expressions for bind-mount-heavy systems.


Proposed option 3: customer-provided filesystem metrics file

Please also consider allowing do-agent to read filesystem metrics from a local file.

For example:

--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt

or:

--collector.filesystem.source=file
--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt

In this model, the customer could generate the file however they prefer:

  • df;
  • stat;
  • a shell script;
  • a cron job;
  • a monitoring tool;
  • a custom parser with environment-specific exclusions.

do-agent would remain the trusted process that submits metrics to DigitalOcean, but the customer would control how filesystem metrics are gathered.

A file-based approach may be safer than an exec-based plugin because do-agent would not need to run arbitrary customer commands. It would only read a documented local file format.

Example conceptual format:

mountpoint=/ size_bytes=128849018880 used_bytes=32212254720 avail_bytes=96636764160 used_percent=26 inode_total=62914560 inode_used=663000 inode_avail=62251560 inode_used_percent=2
mountpoint=/boot size_bytes=531628032 used_bytes=331350016 avail_bytes=200278016 used_percent=63 inode_total=262144 inode_used=327 inode_avail=261817 inode_used_percent=1
mountpoint=/tmp size_bytes=4187593113 used_bytes=208896 avail_bytes=3972844748 used_percent=1 inode_total=262144 inode_used=49 inode_avail=262095 inode_used_percent=1

Or, if preferred, the file could use a documented Prometheus-style text format.


Why this is useful

Some environments have mount tables that are technically valid but difficult for a general-purpose filesystem collector to interpret safely.

Examples include:

  • CloudLinux CageFS;
  • cPanel/WHM systems;
  • chroot-heavy systems;
  • bind-mount-heavy systems;
  • container-heavy hosts;
  • Docker/LXC environments;
  • systems with duplicated or virtualized mountpoints.

In these environments, the administrator may not need the agent to understand every mountpoint. They may only need reliable metrics for a few filesystems or paths, such as:

/
/boot
/boot/efi
/tmp

A df-style mode, explicit path mode, or customer-provided file mode would avoid unnecessary full mount discovery and reduce the risk of duplicate filesystem metrics.


Example use case: CloudLinux / CageFS / cPanel

I understand that CloudLinux/CageFS is not officially supported by do-agent. This feature request is not asking for full CloudLinux support.

However, this environment is a good example of why safer alternate filesystem collection modes would be useful.

Environment:

  • DigitalOcean Droplet;
  • CloudLinux + cPanel/WHM;
  • CageFS enabled;
  • do-agent upgraded automatically from 3.18.10-1 to 3.18.12-1;
  • Upgrade occurred around 2026-04-24 03:49 UTC.

After the upgrade, the filesystem collector began repeatedly logging duplicate metric errors related to CageFS bind mounts.

The repeated mountpoints were under:

/usr/share/cagefs-skeleton/

The logs repeatedly contained errors similar to:

failed to gather metrics: collected metric "node_filesystem_size_bytes" ... was collected before with the same name and label values

The impact was significant:

  • sustained high CPU usage, around 75%;
  • approximately 55 GB /var/log/messages;
  • approximately 40 GB rotated messages log;
  • disk exhaustion;
  • WHM/cPanel service interruption.

Disabling do-agent immediately stopped the log flood and CPU returned to normal.

In this case, I did not need the agent to inspect CageFS mountpoints. I only needed basic disk and inode metrics for the main filesystems. Commands such as the following were sufficient to show the information I needed:

df -P
df -Pi

or, for specific paths:

df -P /
df -P /boot
df -P /boot/efi
df -P /tmp

df -Pi /
df -Pi /boot
df -Pi /boot/efi
df -Pi /tmp

Current workaround

The only safe workaround I currently have is to disable the filesystem collector entirely:

/opt/digitalocean/bin/do-agent --syslog --no-collector.filesystem

That prevents the runaway filesystem collector behavior, but it also removes the DigitalOcean filesystem metrics I actually need for this Droplet.

This creates an unfortunate tradeoff:

  • leave filesystem collection enabled and risk duplicate metric errors, runaway logging, high CPU usage, and disk exhaustion;
  • disable filesystem collection and lose the disk/inode metrics that would help detect or prevent disk exhaustion.

A safer alternate collection mode would avoid this tradeoff by allowing do-agent to report basic filesystem usage without walking the full mount layout.


Why mountpoint exclusion rules are not always enough

Mountpoint exclusion rules are useful, but they still require the agent to discover and reason about the host’s mount layout.

In bind-mount-heavy or CageFS-style environments, that discovery process can be fragile. Administrators may also have to write complex regular expressions to exclude paths the agent did not need to inspect in the first place.

A df-based, path-based, or file-based mode would be simpler and more predictable:

  • do not walk every mountpoint;
  • do not inspect CageFS bind mounts unnecessarily;
  • do not require complex mountpoint exclusion regular expressions;
  • collect only the filesystems or paths the administrator explicitly cares about;
  • allow administrators to generate clean filesystem metrics themselves when needed.

Requested features

Please consider adding one or more of the following options.

df-based mode

--collector.filesystem.mode=df

or:

--collector.filesystem.use-df

This would collect filesystem space and inode metrics using the equivalent of df -P and df -Pi.

Explicit path mode

--collector.filesystem.paths=/,/boot,/boot/efi,/tmp

or:

--collector.filesystem.paths-file=/etc/do-agent/filesystem-paths.conf

This would collect filesystem metrics only for explicitly configured paths.

Customer-provided file mode

--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt

or:

--collector.filesystem.source=file
--collector.filesystem.file=/var/lib/do-agent/filesystem-metrics.txt

This would allow customers to generate filesystem metrics themselves and let do-agent read and submit them.


Additional defensive behavior

Even when an environment is unsupported, it may also be helpful for the agent to handle repeated filesystem collector failures more defensively.

For example:

  • rate-limit repeated duplicate metric errors;
  • disable only the affected collector after repeated failures;
  • emit one clear warning instead of repeatedly logging the same error;
  • avoid filling system logs when the metrics collector is unhealthy.

A metrics issue should not be able to fill /var/log/messages, exhaust disk space, and contribute to a production service outage.


Related issues

This request may also help with or relate to other reports involving CloudLinux support, duplicate metric collection, or high CPU from filesystem metric collection:

This feature request is more specific: provide safer alternate filesystem collection modes, such as df-based collection, explicit path collection, or customer-provided filesystem metrics, so that users do not have to choose between unsafe full mount discovery and disabling filesystem metrics entirely.


Trouble Ticket

I also opened a DigitalOcean Support ticket for this incident in April. Support confirmed that CloudLinux/CageFS is not officially supported by do-agent.

This request is not for full CloudLinux support, but for a safer alternative filesystem collection mode that could help unsupported or unusual mount layouts avoid a full mount-table discovery.

[#12093409](https://cloudsupport.digitalocean.com/s/case-detail?recordId=500QP00001QvKFRYA3) do-agent 3.18.12 causes runaway logging and high CPU on CloudLinux CageFS Droplet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions