Gamifying Health Checks

A common issue with checks is that, over time, they can become boring or repetitive which can lead to the person doing them missing details or losing focus.

We ought to look at ways to keep the manual 20 minute health checks interesting, promote dynamic use cases and improve test quality.

Below is a non-exhaustive list of some ideas we can look at for modifying the health check report submission page to support this.


### Personality/Lens

Providing a perspective/lens for the staff member to approach their checks as (e.g. asking themselves "how would this person behave and use the cluster?")

"You're a new user who's never logged in before" (did you get easily lost?)
"You're a grumpy researcher who's had a bad day" (what annoys you about the experience?)
"You're trying to submit a job that uses 10 nodes" (what does that journey look like?)
"You're suspicious something's wrong with the storage" (how do you prove/disprove that to yourself?)
"Pretend it's 3am and you're on-call, half asleep" (what do unnatural paths through the experience show you?)

### Variable Depths

The tests are designed to be roughly 20 mins long. Currently the suggested amount of time in each area is rigid. We could look at this changing day-by-day to encourage greater focus in different areas (which would also promote discovery of new ways in looking at each area)

Basic: 1-3 mins
Moderate: 3-5 mins
Advanced: 7-10 mins

How do we keep the tests roughly within the 20 min window with the variable depths? 

### Restrictions

A lot of the methods in tests are heavily reliant on a particular command in order to verify them. These examples aren't the only ways to do it but are more familiar. To encourage new usage and finding new ways we can add restrictions on how a particular day's checks are run, for example:

- Home Dir, Quota & Filesystem:
  - Not allowed to use `dd`
  - Not allowed to use `/dev/zero`
- Scratch/Lustre, Filesystem Experience:
  - Not allowed to use `dd`
  - Not allowed to use `/dev/zero`
  - Not allowed to use `find`
- Scheduler: 
  - Not allowed to use `sinfo`
- GPU/Compute Node Availability: 
  - Not allowed to use `sinfo`
- Test Job Submission: 
  - Not allowed to use `sleep`, `echo`, etc but job must run for a few minutes
  
There are likely many many more things we can do to block certain testing pathways in order to discover new ones.

### Order

Randomise the order the areas are presented 

### Collaboration Level

Changing how the staff member collaborates on the check may bring new questions, ideas and evolution to the checks process. 

Currently every cluster health check is "solo", we could look at having them be:
- Passengered: Someone runs the health check while another staff member watches them check it and questions everything
- Split: Each person only does half a health check then swaps to do the second half of a different one (areas they cover could overlap in these)
- Verbal: The staff member performing the health check is assigned another staff member who they must explain the cluster status to (*without referencing their completed report*)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gamifying Health Checks #45

Personality/Lens

Variable Depths

Restrictions

Order

Collaboration Level

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Gamifying Health Checks #45

Description

Personality/Lens

Variable Depths

Restrictions

Order

Collaboration Level

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions