Fine-tune Qwen2.5-3B-Instruct with QLoRA to automatically transform vague, underspecified prompts into clear, structured, and effective prompts.
| Input (vague) | Output (optimized) |
|---|---|
| write about dogs | Write a 500-word informative essay about working dog breeds, covering their historical roles, modern applications, and training requirements. Use a professional tone and support your claims with specific examples. |
| fix my python code | Debug the following Python code. Identify all bugs and issues, explain the root cause of each, and provide the corrected version with inline comments explaining each fix. Focus on: off-by-one errors, edge cases, and type safety. |
| give me ideas for a mobile app | Generate 10 mobile app ideas for the Finnish market that solve real problems in daily life. For each idea, provide: (1) the problem it solves, (2) target demographic, (3) key differentiator from existing solutions, (4) estimated development complexity (Low/Medium/High). Prioritize ideas that can be built by a solo developer in under 3 months. |
- 1,183 prompt optimization pairs across 10 categories
- Categories: writing, coding, analysis, translation, Q&A, roleplay, summarization, brainstorming, instruction, editing
- Split: 619 train / 76 val / 84 test
- Format: instruction-tuning (instruction, input, output)
| Metric | Value |
|---|---|
| Avg specificity markers | 3.2/7 |
| Avg length ratio (optimized/vague) | 15.0x |
| Has structure | 83% |
| Has format specification | 71% |
| Has audience targeting | 54% |
| Has tone specification | 23% |
- Base model: Qwen2.5-3B-Instruct (via Unsloth)
- Method: QLoRA (4-bit quantization + LoRA adapters)
- Hardware: Google Colab T4 GPU (16GB VRAM)
- LoRA config: r=16, alpha=16, dropout=0
- Training: 5 epochs, batch=2, grad_accum=8, lr=2e-4
- Effective batch size: 16
- Max sequence length: 2048
- Warmup ratio: 0.1
- Weight decay: 0.01
- Open
notebooks/finetune_colab.ipynbin Google Colab - Change runtime to T4 GPU
- Upload
data/train.jsonlanddata/val.jsonl - Run all cells
Expected training time: ~2-4 hours on T4.
Full 84-sample evaluation with 95% confidence intervals, comparing base model vs. fine-tuned model.
| Metric | Base | Fine-tuned | Change |
|---|---|---|---|
| ROUGE-1 | 0.238 | 0.339 | +0.100 |
| ROUGE-2 | 0.062 | 0.141 | +0.079 |
| ROUGE-L | 0.172 | 0.253 | +0.081 |
| Jaccard vs reference | 0.101 | 0.177 | +0.076 |
| Specificity /7 | 1.30 | 3.33 | +2.04 |
| Length vs reference | 0.558 | 0.975 | +0.417 |
| Avg output words | 28.8 | 49.7 | +20.9 |
| Structure present | 40.5% | 76.2% | +35.7 pts |
| Format guidance | 22.6% | 71.4% | +48.8 pts |
| Audience guidance | 23.8% | 61.9% | +38.1 pts |
| Tone guidance | 6.0% | 20.2% | +14.3 pts |
| Prefix bug rate | 0.0% | 0.0% | 0.0 pts |
Metric definitions:
- ROUGE-1/2/L: F1 scores vs. reference. Higher = better lexical overlap.
- Jaccard: Word-level
|A & B| / |A union B|vs. reference. Higher = better alignment. - Specificity /7: Count of optimization dimensions addressed (length, format, audience, tone, specificity, constraints, examples).
- Length vs reference:
output_words / reference_words. Closer to 1.0 = better length alignment. - Prefix bug: Unwanted
Optimized Prompt:prefix. 0% = clean.
ROUGE-L comparison with 0.01 margin:
| Outcome | Count | Share |
|---|---|---|
| Fine-tuned wins | 57 | 67.9% |
| Base wins | 20 | 23.8% |
| Ties | 7 | 8.3% |
| Category | Samples | Base | Fine-tuned | Change |
|---|---|---|---|---|
| analysis | 12 | 0.179 | 0.341 | +0.162 |
| brainstorming | 9 | 0.143 | 0.245 | +0.101 |
| coding | 19 | 0.190 | 0.195 | +0.005 |
| editing | 13 | 0.155 | 0.281 | +0.126 |
| instruction | 9 | 0.154 | 0.300 | +0.146 |
| q_and_a | 12 | 0.173 | 0.178 | +0.006 |
| roleplay | 6 | 0.194 | 0.313 | +0.119 |
| translation | 4 | 0.188 | 0.211 | +0.023 |
Strongest gains in analysis (+0.162), instruction (+0.146), editing (+0.126), and roleplay (+0.119). Coding and Q&A improved only marginally, suggesting these categories may need more targeted training examples.
The fine-tuned model learned to produce prompts that are:
- More specific and constraint-aware (specificity 1.3 -> 3.3/7)
- Closer to reference prompt length (0.56 -> 0.98 ratio)
- More likely to include structure (+36 pts), formatting (+49 pts), and audience guidance (+38 pts)
- Less likely to return a generic clarification question
- Free of the unwanted
Optimized Prompt:prefix
Example improvement:
Input: what could I do with portfolio projects
Base output:
What specific portfolio projects could I create to enhance my skills in [specific field or technology]?
Fine-tuned output:
Suggest 12 projects using portfolio projects for a mid-level developer transitioning from Python to JavaScript. Include: quick wins (< 5 hours), team collaborations, and cross-functional opportunities. Format: bullet points with estimated complexity (Beginner/Intermediate/Advanced) and required tools/platforms.
- Some outputs may add details not present in the original prompt
- Coding and Q&A categories showed small ROUGE-L gains
- ROUGE and Jaccard measure overlap with references, not full human preference
- Small categories (translation n=4, roleplay n=6) should be treated as directional
- Open
notebooks/evaluate_colab.ipynbin Google Colab - Set runtime to T4 GPU
- Upload the LoRA adapter zip (
prompt-optimizer-lora.zip) - Run all cells in order
- Download
eval_results_v2.jsonafter the final cell
The notebook automatically extracts the LoRA zip, loads the base model and adapter, evaluates all 84 test prompts, and saves per-sample output with confidence intervals.
prompt-optimizer/
|-- data/
| |-- raw/ # Generated and seed data
| |-- train.jsonl # Training split (619 examples)
| |-- val.jsonl # Validation split (76 examples)
| |-- test.jsonl # Test split (84 examples)
| `-- cleaned_data.jsonl # Full cleaned dataset
|-- scripts/
| |-- generate_data_v2.py # Template-based data generation
| |-- augment_data.py # Data augmentation (paraphrases)
| |-- prepare_final.py # Cleaning, dedup, splitting
| `-- clean_data_v2.py # Quality filtering
|-- notebooks/
| |-- finetune_colab.ipynb # Colab training notebook
| `-- evaluate_colab.ipynb # Colab evaluation notebook
|-- eval/
| `-- evaluate.py # Evaluation script
|-- eval_results_v2.json # Full evaluation output (84 samples)
|-- training_config.json # Training hyperparameters
|-- docs/
| |-- PLAN.md # Project plan
| `-- CV_ENTRY.md # CV description
`-- README.md
python eval/evaluate.pyMetrics: ROUGE-1/2/L, Jaccard similarity, specificity markers /7, length ratios, structure/format/audience/tone percentages, head-to-head comparison, per-category breakdown with 95% confidence intervals.
See docs/REFERENCES.md for the full bibliography. Key papers:
- QLoRA -- Dettmers et al. (2023), our fine-tuning method
- LoRA -- Hu et al. (2022), adapter-based training foundation
- Self-Instruct -- Wang et al. (2023), synthetic data generation methodology
- InstructGPT -- Ouyang et al. (2022), instruction-following paradigm
- Qwen2.5 -- Yang et al. (2024), our base model
- Automatic Prompt Optimization -- Pryzant et al. (2023), prompt optimization research
MIT
Petteri Kosonen -- Built as a portfolio project demonstrating LLM fine-tuning, dataset creation, and prompt engineering expertise.