tools: add pg backfill_toofull fix#51
Conversation
Signed-off-by: Joshua Blanch <joshua.blanch@clyso.com>
189310e to
9ebe69c
Compare
| ``` | ||
|
|
||
| Problem: | ||
| Usually when a node goes down or when draining capacity, there are some OSDs that become nearfull and eventually can lead to PGs being backfill_toofull warning pops up. |
There was a problem hiding this comment.
Usually when a node goes down or when draining capacity seems broken. Should it be Usually when a node goes down or when draining with limited capacity?
sam0044
left a comment
There was a problem hiding this comment.
this is a nice script to have. I would rephrase the problem statement to be a bit more clear. Outside that, the logic looks pretty solid.
|
actually I don't think this will take into account device class for the crush rule, only tried this on all nvme cluster. If there were mixed hdd/ssd it could create an upmap to one outside its device class |
|
Tried this today on a Pacific cluster with some 6+3 PGs in backfill_toofull. |
| if ("nearfull" in st) or ("backfillfull" in st): | ||
| flagged.add(oid) |
There was a problem hiding this comment.
@bstillwell @dvanders
if the osds aren't marked as nearfull / backfillfull they aren't flagged taken into account. this may be possible if operator manually changes the ratio warning.
It probably should just check for most utilized here, but I had an assumption that nearfull/backfillfull osds were the most utilized
or
work backwards from pg that is backfill_toofull to the OSD in that is 'full'
There was a problem hiding this comment.
can use pg state backfill_toofull to determine the 'full' OSD by looking at the up/acting set
reads the failure domain of a pool then upmaps the backfill_toofull pg into another OSD based on %utilization