This artifact contains the environment, code, and data required to fully replicate the results of our ICSE 2026 paper.
E-Test: E'er-Improving Test Suites Accepted at the 48th International Conference on Software Engineering (ICSE 2026)
Authors: Ketai Qiu, Luca Di Grazia, Leonardo Mariani, and Mauro Pezzè.
- Paper PDF: Read here
- Source Code: GitHub Repository
- AutonomicTester: A Python application designed to implement advanced techniques for E-TEST.
- DataAnalysis: A set of Jupyter notebooks to analyze results and compute evaluation metrics.
- Archives: A set of tar archives of datasets of prompts and responses from LLMs.
-
Create a HuggingFace user access token on https://huggingface.co/docs/hub/security-tokens.
-
Install Docker.
-
Run the following commands from the project root directory using a Unix-compatible shell (Bash, Zsh). You can build an image from scratch and then switch to other LLMs by changing OLLAMA_MODEL to any LLMs available on Ollama. If you want to use different LLMs, you have to set
-e HUGGING_FACE_API_KEY="hf_xxxxxx"when starting a Docker container.
Step 1. Prepare the Docker image
# Choice 1: Pull the pre-built image from Docker Hub
docker pull ketaiq/e-test:v1-llama3-1b
docker tag ketaiq/e-test:v1-llama3-1b e-test-llama3-1b
# Choice 2: Load the pre-built image for Linux AMD64 platform
docker load -i e-test-llama3-1b-amd64.tar
docker tag e-test-llama3-1b-amd64 e-test-llama3-1b
# Choice 3: Build the image locally
docker build -f ./E-Test.Dockerfile -t e-test-llama3-1b .
# Build and push to Docker Hub
docker buildx build -f E-Test.Dockerfile --platform linux/amd64,linux/arm64 -t ketaiq/e-test:v1-llama3-1b --push .Step 2. Run the Docker container
docker run -it --rm \
-p 20268:8888 \
-e OLLAMA_MODEL="llama3.2:1b" \
-v $(pwd):/app \
e-test-llama3-1bTo reproduce evaluation results shown in the paper, please run notebooks in DataAnalysis folder.
You can open http://localhost:20268 to run and edit notebooks directly.
Dataset Stats.ipynbandGH Dataset Stats.ipynbcompute statistics about the dataset, which corresponds to Section 2.2 Dataset paragraph, and Table 1 in the paper.RQ1 Impact of LLMs.ipynbcomputes evaluation metrics (precision, recall, and F1-score) for each scenario and the average F1-scores, which corresponds to Section 3.1, Table 3, Figure 3 and Figure 4 in the paper.RQ2 Comparative Evaluation.ipynbcomputes evaluation metrics of two state-of-the-art approaches (i.e., FAST++ and Field-ready testing), which corresponds to Section 3.2 and Table 3 in the paper.RQ3 Impact of Queries.ipynbcomputes evaluation metrics for different combinations of queries, which corresponds to Section 3.3 and Figure 5 in the paper.RQ4 Efficiency.ipynbmeasures efficiency of E-Test in terms of response time and token consumption, which corresponds to Section 3.4 and Figure 6 in the paper.RQ5 Test Case Generation.ipynbanalyzes JUnit test cases generated by E-Test, which corresponds to Section 3.5 and Figure 7 in the paper.
In the Docker interactive shell, run the following commands to launch experiments. The results are generated in the folder AutonomicTester/experiment_results. The test case generation also outputs in the folder Defects4jDataset.
# Test Llama3 1B with prompts generated from error-prone scenarios in Defects4J
python AutonomicTester/main.py prompt -v 4 -d Defects4J -m LLama3_2_1B -s BUGGY
# Test Llama3 1B with prompts generated from need-test scenarios in Defects4J
python AutonomicTester/main.py prompt -v 4 -d Defects4J -m LLama3_2_1B -s FIXED
# Test Llama3 1B with prompts generated from already-tested scenarios in Defects4J
python AutonomicTester/main.py prompt -v 4 -d Defects4J -m LLama3_2_1B -s SIMILAR
# Test Llama3 1B with test case generation for error-prone scenarios in Defects4J
python AutonomicTester/main.py prompt -v 4 -d Defects4J -m LLama3_2_1B -s BUGGY -tcgFor other settings mentioned in the paper, please check the help message via python AutonomicTester/main.py -h.
Run exit to stop the Docker container.