Skip to content

Gamifying Health Checks #45

Description

@ColonelPanics

A common issue with checks is that, over time, they can become boring or repetitive which can lead to the person doing them missing details or losing focus.

We ought to look at ways to keep the manual 20 minute health checks interesting, promote dynamic use cases and improve test quality.

Below is a non-exhaustive list of some ideas we can look at for modifying the health check report submission page to support this.

Personality/Lens

Providing a perspective/lens for the staff member to approach their checks as (e.g. asking themselves "how would this person behave and use the cluster?")

"You're a new user who's never logged in before" (did you get easily lost?)
"You're a grumpy researcher who's had a bad day" (what annoys you about the experience?)
"You're trying to submit a job that uses 10 nodes" (what does that journey look like?)
"You're suspicious something's wrong with the storage" (how do you prove/disprove that to yourself?)
"Pretend it's 3am and you're on-call, half asleep" (what do unnatural paths through the experience show you?)

Variable Depths

The tests are designed to be roughly 20 mins long. Currently the suggested amount of time in each area is rigid. We could look at this changing day-by-day to encourage greater focus in different areas (which would also promote discovery of new ways in looking at each area)

Basic: 1-3 mins
Moderate: 3-5 mins
Advanced: 7-10 mins

How do we keep the tests roughly within the 20 min window with the variable depths?

Restrictions

A lot of the methods in tests are heavily reliant on a particular command in order to verify them. These examples aren't the only ways to do it but are more familiar. To encourage new usage and finding new ways we can add restrictions on how a particular day's checks are run, for example:

  • Home Dir, Quota & Filesystem:
    • Not allowed to use dd
    • Not allowed to use /dev/zero
  • Scratch/Lustre, Filesystem Experience:
    • Not allowed to use dd
    • Not allowed to use /dev/zero
    • Not allowed to use find
  • Scheduler:
    • Not allowed to use sinfo
  • GPU/Compute Node Availability:
    • Not allowed to use sinfo
  • Test Job Submission:
    • Not allowed to use sleep, echo, etc but job must run for a few minutes

There are likely many many more things we can do to block certain testing pathways in order to discover new ones.

Order

Randomise the order the areas are presented

Collaboration Level

Changing how the staff member collaborates on the check may bring new questions, ideas and evolution to the checks process.

Currently every cluster health check is "solo", we could look at having them be:

  • Passengered: Someone runs the health check while another staff member watches them check it and questions everything
  • Split: Each person only does half a health check then swaps to do the second half of a different one (areas they cover could overlap in these)
  • Verbal: The staff member performing the health check is assigned another staff member who they must explain the cluster status to (without referencing their completed report)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions