Internal tooling that's being developed at MauKerja Malaysia, which is used for distributed workload scheduling across different teams to orchestrate massive Playwright end-to-end UI automation tests for MauKerja Malaysia/Ricebowl Malaysia/internal dashboard platforms and such
Decided to make it as an open-source with limited constraints after a round of discussion so it won't be disclosed all the critical functionality
At high-level, we encountered some a constraints and inertia causing our QA engineering team will have a long awaited queue when run their automation testing because shared and limited resources on our self-hosted runer. It also included with these conditions:
- We have a hundred (or more less) E2E flows
- limited concurrency resources
- competing test execution demand
- multiple teams with heavy-system using Playwright
The problem starts shifting from "How do we automate tests" into "How do we orchestrate test execution efficiently"
Our testing infrastructure at gist, can be displayed somewhat like this (and it's pretty straightforward):
flowchart LR
A(Github CI)-->B[Shared runners]
B-->C[Playwright Workers]
C-->D[Browser Instances]
D-->E[Target Environment]
The major bottleneck and pain point is not Playwright or our UI automation frameworks, but somewhat it's more shifted towards runner contention, browser resource exhaustion which is caused a website crash during automation running, environment instability leading towards flaky retries (which is supposed never meant to be conceited) and massive imbalance test suite distribution
Imagine, we would run on the daily basis
1. 100 E2E flows
2. with average 1 minutes/test suites
3. shared runners
4. Teams need deploy immediately
Total pipeline that's executed for the whole E2E flows approximately lasted around ~1-2 hours, which is not very ideal situation if we're going to run with multiple teams and test against different platforms (different test scenarios, environment and such). Thus, we're conducting a kind of experiment around "Queue-driven allocation"
Practically speaking, it might be improved:
- Test distribution fairness
- Test allocation throughput
- Test executionn predictability
Instead using naive scheduling that we had right now and running everything sequentially, we were trying to tweaking and improving which are revolves around:
- discover the tests directories
- enqueue them first
- workers pull jobs dynamically
- and each workers launches a batches with Playwright
Now, the testing infrastructure would become different and somewhat looks like this:
flowchart LR
A[Test Discovery] --> B[fastqueue-playwright]
B--Worker 1-->C[Playwright]
B--Worker 2-->D[Playwright]
B--Worker 3-->E[Playwright]
The biggest difference was: Playwright workers optimize test execution inside a single Playwright test run, while this experiment that we'd conducted optimizes system-wide scheduling based on limited resource coordination. Those are somehwat still related, but fundamentally speaking it's on different layers.
When we run playwright, let's say with 5 workers, the Playwright internally would: (1) discover our test specs (2) and then the partitions start working (3) spawn a worker process (4) executes tests in parallel mode. And this is eventually excellent for local parallelism and straightforward CI pipelines
However, when you were trying to get the scope on much more higher level, there will be a hard limitation if we're still rely on the pure Playwright workers, since its only sees its own execution scope
It doesn't understand about:
- Other teams' pipeline
- Shared runner pressure
- Global resource contention
- Cross-project/team urgency or prioritization
- Different targeted environment
Thus, when all of these resurfaced, it becomes the real and actual problems that we've encountered so far. To put it simply, the typical behavior can be considered like this:
Without queue-driven allocation
CI starts
↓
Playwright spawns N workers
↓
All tests suites begin aggressively
With queue-driven allocation
CI starts
↓
Tests enter centralized queue
↓
Workers pull dynamically
↓
Concurrency controlled globally
↓
Priorities test suites
It will be pretty straightforward to run the orchestrator, but of course you need to clone the repo and install the dependencies first, and then you can run the orchestrator suite with:
npx ts-node src/orchestrator.tsAssumptions: the test suites that are going to be orchestrated on your end is located under
./testsdirectory and the test files are named with*.spec.tspattern
If you'd want to limit concurrency, you can set the WORKERS variable on the global scheduler limit on the orchestrator.ts file
Once you run the command, it will discover the test files and then the workers will start pulling the batches and run the tests, you'll see something like on the runtime output:
[Worker 1] is running tests with batches
[Worker 2] is running tests with batches
[Worker 3] is running tests with batches
[Worker 4] failed to run tests: Error: Command failed: npx playwright test ...
All batches were completedRun the tests
If you want to run the integration tests locally, you can execute the command:
npm test
# or for watch mode
npm run test:watchAfterwards it will be run over 34 different test cases which are compromised of unit test for individual modules, integration tests around discovery.ts, worker.ts and also orchestrator test
If you want to use this messed up codebase in your actual Github Actions workflow, just drop a few things on your .yml file in a particular repo and it's good to go. For example you may use this necessary setup:
env:
WORKERS: ${{ github.event.inputs.workers || '5' }}
BATCH_SIZE: ${{ github.event.inputs.batch_size || '5' }}
MAX_RETRIES: "3"
SPEC_PATTERN: ".spec.ts"
WORKER_TIMEOUT: "600000" # set a reasonable worker timeout
jobs:
e2e:
name: Run E2E orchestrator
runs-on: self-hosted # target your shared self-hosted runner pool
# add a job-level timeout as a hard ceiling, meaning even if a worker
# hangs and WORKER_TIMEOUT somehow doesn't fire, the job won't run forever
timeout-minutes: 60
### rest of your workflow ###
- name: Run orchestrator
run: npx ts-node src/orchestrator.ts
env:
WORKERS: ${{ env.WORKERS }}
BATCH_SIZE: ${{ env.BATCH_SIZE }}
MAX_RETRIES: ${{ env.MAX_RETRIES }}
SPEC_PATTERN: ${{ env.SPEC_PATTERN }}
WORKER_TIMEOUT: ${{ env.WORKER_TIMEOUT }}Important
One thing to watch out: if you don't use self-hosted runnner, you need to carefully configure the installation of Playwright browsers before runs the cache restore. Otherwise, it would be installed those browsers twice. But, for most self-hosted configurationn it's not a big deal since the cache hit will be high
Without batching, when you enqueue 1 test suite item it can be considered as 1 test spec. The problems is, it will causing too many Playwright process startups, too many browser initializations causing inefficient CPU/Memory utiliziation. In short: it's expensive process
Future considerations that we could take over to significantly improve this experiment or setup that we have right now:
- Priority scheduling based P0, P1, P2 tagging -> the setup were much complex, but we don't know yet
- Result aggregator. Unified and collects all the test execution results, includes retries and error traces
- Observability layer. Another complex setup, but becomes extremely important later (insya Allah) since it would be capture:
- Queue metrics
- Worker metrics
- E2E test metrics
- More advanced setup? workers across Kubernetes + VMs + multiple runners
When we started to implementing this orchestration around 2 weeks ago, we beginning to understand that it will costs:
- testing complexity
- testinng infra ownership
- scheduler logic
- observability needs
So we make a wild guess it’s only worth it when:
- test suites become large with additional teams multiply
- shared infra becomes bottleneck
- CI costs increase
If you want to avoid this complexity and perhaps can be considered overkill nor overenigineering, the best solution is whether to optimize the test suites or to invest in more powerful infastructure, and BOOM!