A common issue with checks is that, over time, they can become boring or repetitive which can lead to the person doing them missing details or losing focus.
We ought to look at ways to keep the manual 20 minute health checks interesting, promote dynamic use cases and improve test quality.
Below is a non-exhaustive list of some ideas we can look at for modifying the health check report submission page to support this.
Personality/Lens
Providing a perspective/lens for the staff member to approach their checks as (e.g. asking themselves "how would this person behave and use the cluster?")
"You're a new user who's never logged in before" (did you get easily lost?)
"You're a grumpy researcher who's had a bad day" (what annoys you about the experience?)
"You're trying to submit a job that uses 10 nodes" (what does that journey look like?)
"You're suspicious something's wrong with the storage" (how do you prove/disprove that to yourself?)
"Pretend it's 3am and you're on-call, half asleep" (what do unnatural paths through the experience show you?)
Variable Depths
The tests are designed to be roughly 20 mins long. Currently the suggested amount of time in each area is rigid. We could look at this changing day-by-day to encourage greater focus in different areas (which would also promote discovery of new ways in looking at each area)
Basic: 1-3 mins
Moderate: 3-5 mins
Advanced: 7-10 mins
How do we keep the tests roughly within the 20 min window with the variable depths?
Restrictions
A lot of the methods in tests are heavily reliant on a particular command in order to verify them. These examples aren't the only ways to do it but are more familiar. To encourage new usage and finding new ways we can add restrictions on how a particular day's checks are run, for example:
- Home Dir, Quota & Filesystem:
- Not allowed to use
dd
- Not allowed to use
/dev/zero
- Scratch/Lustre, Filesystem Experience:
- Not allowed to use
dd
- Not allowed to use
/dev/zero
- Not allowed to use
find
- Scheduler:
- GPU/Compute Node Availability:
- Test Job Submission:
- Not allowed to use
sleep, echo, etc but job must run for a few minutes
There are likely many many more things we can do to block certain testing pathways in order to discover new ones.
Order
Randomise the order the areas are presented
Collaboration Level
Changing how the staff member collaborates on the check may bring new questions, ideas and evolution to the checks process.
Currently every cluster health check is "solo", we could look at having them be:
- Passengered: Someone runs the health check while another staff member watches them check it and questions everything
- Split: Each person only does half a health check then swaps to do the second half of a different one (areas they cover could overlap in these)
- Verbal: The staff member performing the health check is assigned another staff member who they must explain the cluster status to (without referencing their completed report)
A common issue with checks is that, over time, they can become boring or repetitive which can lead to the person doing them missing details or losing focus.
We ought to look at ways to keep the manual 20 minute health checks interesting, promote dynamic use cases and improve test quality.
Below is a non-exhaustive list of some ideas we can look at for modifying the health check report submission page to support this.
Personality/Lens
Providing a perspective/lens for the staff member to approach their checks as (e.g. asking themselves "how would this person behave and use the cluster?")
"You're a new user who's never logged in before" (did you get easily lost?)
"You're a grumpy researcher who's had a bad day" (what annoys you about the experience?)
"You're trying to submit a job that uses 10 nodes" (what does that journey look like?)
"You're suspicious something's wrong with the storage" (how do you prove/disprove that to yourself?)
"Pretend it's 3am and you're on-call, half asleep" (what do unnatural paths through the experience show you?)
Variable Depths
The tests are designed to be roughly 20 mins long. Currently the suggested amount of time in each area is rigid. We could look at this changing day-by-day to encourage greater focus in different areas (which would also promote discovery of new ways in looking at each area)
Basic: 1-3 mins
Moderate: 3-5 mins
Advanced: 7-10 mins
How do we keep the tests roughly within the 20 min window with the variable depths?
Restrictions
A lot of the methods in tests are heavily reliant on a particular command in order to verify them. These examples aren't the only ways to do it but are more familiar. To encourage new usage and finding new ways we can add restrictions on how a particular day's checks are run, for example:
dd/dev/zerodd/dev/zerofindsinfosinfosleep,echo, etc but job must run for a few minutesThere are likely many many more things we can do to block certain testing pathways in order to discover new ones.
Order
Randomise the order the areas are presented
Collaboration Level
Changing how the staff member collaborates on the check may bring new questions, ideas and evolution to the checks process.
Currently every cluster health check is "solo", we could look at having them be: