Skip to content

tools: add pg backfill_toofull fix#51

Open
JoshuaGabriel wants to merge 1 commit into
clyso:mainfrom
JoshuaGabriel:toolkit/pg_toofull
Open

tools: add pg backfill_toofull fix#51
JoshuaGabriel wants to merge 1 commit into
clyso:mainfrom
JoshuaGabriel:toolkit/pg_toofull

Conversation

@JoshuaGabriel

Copy link
Copy Markdown
Collaborator

reads the failure domain of a pool then upmaps the backfill_toofull pg into another OSD based on %utilization

@JoshuaGabriel JoshuaGabriel requested a review from sam0044 November 4, 2025 06:59
Signed-off-by: Joshua Blanch <joshua.blanch@clyso.com>
Comment thread docs/backfill-toofull.md
```

Problem:
Usually when a node goes down or when draining capacity, there are some OSDs that become nearfull and eventually can lead to PGs being backfill_toofull warning pops up.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually when a node goes down or when draining capacity seems broken. Should it be Usually when a node goes down or when draining with limited capacity?

@sam0044 sam0044 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice script to have. I would rephrase the problem statement to be a bit more clear. Outside that, the logic looks pretty solid.

@JoshuaGabriel

Copy link
Copy Markdown
Collaborator Author

actually I don't think this will take into account device class for the crush rule, only tried this on all nvme cluster. If there were mixed hdd/ssd it could create an upmap to one outside its device class

@dvanders

Copy link
Copy Markdown
Collaborator

Tried this today on a Pacific cluster with some 6+3 PGs in backfill_toofull.
It did not output any upmaps.

@dvanders

Copy link
Copy Markdown
Collaborator

@bstillwell ^

Comment on lines +187 to +188
if ("nearfull" in st) or ("backfillfull" in st):
flagged.add(oid)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bstillwell @dvanders
if the osds aren't marked as nearfull / backfillfull they aren't flagged taken into account. this may be possible if operator manually changes the ratio warning.

It probably should just check for most utilized here, but I had an assumption that nearfull/backfillfull osds were the most utilized
or
work backwards from pg that is backfill_toofull to the OSD in that is 'full'

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use pg state backfill_toofull to determine the 'full' OSD by looking at the up/acting set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants