Prompt Optimizer -- LLM Fine-Tuning Project

Fine-tune Qwen2.5-3B-Instruct with QLoRA to automatically transform vague, underspecified prompts into clear, structured, and effective prompts.

What It Does

Input (vague)	Output (optimized)
write about dogs	Write a 500-word informative essay about working dog breeds, covering their historical roles, modern applications, and training requirements. Use a professional tone and support your claims with specific examples.
fix my python code	Debug the following Python code. Identify all bugs and issues, explain the root cause of each, and provide the corrected version with inline comments explaining each fix. Focus on: off-by-one errors, edge cases, and type safety.
give me ideas for a mobile app	Generate 10 mobile app ideas for the Finnish market that solve real problems in daily life. For each idea, provide: (1) the problem it solves, (2) target demographic, (3) key differentiator from existing solutions, (4) estimated development complexity (Low/Medium/High). Prioritize ideas that can be built by a solo developer in under 3 months.

Dataset

1,183 prompt optimization pairs across 10 categories
Categories: writing, coding, analysis, translation, Q&A, roleplay, summarization, brainstorming, instruction, editing
Split: 619 train / 76 val / 84 test
Format: instruction-tuning (instruction, input, output)

Data Quality Metrics

Metric	Value
Avg specificity markers	3.2/7
Avg length ratio (optimized/vague)	15.0x
Has structure	83%
Has format specification	71%
Has audience targeting	54%
Has tone specification	23%

Fine-Tuning

Base model: Qwen2.5-3B-Instruct (via Unsloth)
Method: QLoRA (4-bit quantization + LoRA adapters)
Hardware: Google Colab T4 GPU (16GB VRAM)
LoRA config: r=16, alpha=16, dropout=0
Training: 5 epochs, batch=2, grad_accum=8, lr=2e-4
Effective batch size: 16
Max sequence length: 2048
Warmup ratio: 0.1
Weight decay: 0.01

Quick Start

Open notebooks/finetune_colab.ipynb in Google Colab
Change runtime to T4 GPU
Upload data/train.jsonl and data/val.jsonl
Run all cells

Expected training time: ~2-4 hours on T4.

Evaluation Results

Full 84-sample evaluation with 95% confidence intervals, comparing base model vs. fine-tuned model.

Quantitative Metrics

Metric	Base	Fine-tuned	Change
ROUGE-1	0.238	0.339	+0.100
ROUGE-2	0.062	0.141	+0.079
ROUGE-L	0.172	0.253	+0.081
Jaccard vs reference	0.101	0.177	+0.076
Specificity /7	1.30	3.33	+2.04
Length vs reference	0.558	0.975	+0.417
Avg output words	28.8	49.7	+20.9
Structure present	40.5%	76.2%	+35.7 pts
Format guidance	22.6%	71.4%	+48.8 pts
Audience guidance	23.8%	61.9%	+38.1 pts
Tone guidance	6.0%	20.2%	+14.3 pts
Prefix bug rate	0.0%	0.0%	0.0 pts

Metric definitions:

ROUGE-1/2/L: F1 scores vs. reference. Higher = better lexical overlap.
Jaccard: Word-level |A & B| / |A union B| vs. reference. Higher = better alignment.
Specificity /7: Count of optimization dimensions addressed (length, format, audience, tone, specificity, constraints, examples).
Length vs reference: output_words / reference_words. Closer to 1.0 = better length alignment.
Prefix bug: Unwanted Optimized Prompt: prefix. 0% = clean.

Head-to-Head Result

ROUGE-L comparison with 0.01 margin:

Outcome	Count	Share
Fine-tuned wins	57	67.9%
Base wins	20	23.8%
Ties	7	8.3%

Per-Category ROUGE-L

Category	Samples	Base	Fine-tuned	Change
analysis	12	0.179	0.341	+0.162
brainstorming	9	0.143	0.245	+0.101
coding	19	0.190	0.195	+0.005
editing	13	0.155	0.281	+0.126
instruction	9	0.154	0.300	+0.146
q_and_a	12	0.173	0.178	+0.006
roleplay	6	0.194	0.313	+0.119
translation	4	0.188	0.211	+0.023

Strongest gains in analysis (+0.162), instruction (+0.146), editing (+0.126), and roleplay (+0.119). Coding and Q&A improved only marginally, suggesting these categories may need more targeted training examples.

What Improved

The fine-tuned model learned to produce prompts that are:

More specific and constraint-aware (specificity 1.3 -> 3.3/7)
Closer to reference prompt length (0.56 -> 0.98 ratio)
More likely to include structure (+36 pts), formatting (+49 pts), and audience guidance (+38 pts)
Less likely to return a generic clarification question
Free of the unwanted Optimized Prompt: prefix

Example improvement:

Input: what could I do with portfolio projects

Base output:
What specific portfolio projects could I create to enhance my skills in [specific field or technology]?

Fine-tuned output:
Suggest 12 projects using portfolio projects for a mid-level developer transitioning from Python to JavaScript. Include: quick wins (< 5 hours), team collaborations, and cross-functional opportunities. Format: bullet points with estimated complexity (Beginner/Intermediate/Advanced) and required tools/platforms.

Known Limitations

Some outputs may add details not present in the original prompt
Coding and Q&A categories showed small ROUGE-L gains
ROUGE and Jaccard measure overlap with references, not full human preference
Small categories (translation n=4, roleplay n=6) should be treated as directional

Reproducing Evaluation

Open notebooks/evaluate_colab.ipynb in Google Colab
Set runtime to T4 GPU
Upload the LoRA adapter zip (prompt-optimizer-lora.zip)
Run all cells in order
Download eval_results_v2.json after the final cell

The notebook automatically extracts the LoRA zip, loads the base model and adapter, evaluates all 84 test prompts, and saves per-sample output with confidence intervals.

Project Structure

prompt-optimizer/
|-- data/
|  |-- raw/                    # Generated and seed data
|  |-- train.jsonl             # Training split (619 examples)
|  |-- val.jsonl               # Validation split (76 examples)
|  |-- test.jsonl              # Test split (84 examples)
|  `-- cleaned_data.jsonl      # Full cleaned dataset
|-- scripts/
|  |-- generate_data_v2.py     # Template-based data generation
|  |-- augment_data.py         # Data augmentation (paraphrases)
|  |-- prepare_final.py        # Cleaning, dedup, splitting
|  `-- clean_data_v2.py        # Quality filtering
|-- notebooks/
|  |-- finetune_colab.ipynb   # Colab training notebook
|  `-- evaluate_colab.ipynb   # Colab evaluation notebook
|-- eval/
|  `-- evaluate.py             # Evaluation script
|-- eval_results_v2.json        # Full evaluation output (84 samples)
|-- training_config.json        # Training hyperparameters
|-- docs/
|  |-- PLAN.md                 # Project plan
|  `-- CV_ENTRY.md             # CV description
`-- README.md

Evaluation

python eval/evaluate.py

Metrics: ROUGE-1/2/L, Jaccard similarity, specificity markers /7, length ratios, structure/format/audience/tone percentages, head-to-head comparison, per-category breakdown with 95% confidence intervals.

References

See docs/REFERENCES.md for the full bibliography. Key papers:

QLoRA -- Dettmers et al. (2023), our fine-tuning method
LoRA -- Hu et al. (2022), adapter-based training foundation
Self-Instruct -- Wang et al. (2023), synthetic data generation methodology
InstructGPT -- Ouyang et al. (2022), instruction-following paradigm
Qwen2.5 -- Yang et al. (2024), our base model
Automatic Prompt Optimization -- Pryzant et al. (2023), prompt optimization research

License

MIT

Author

Petteri Kosonen -- Built as a portfolio project demonstrating LLM fine-tuning, dataset creation, and prompt engineering expertise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt Optimizer -- LLM Fine-Tuning Project

What It Does

Dataset

Data Quality Metrics

Fine-Tuning

Quick Start

Evaluation Results

Quantitative Metrics

Head-to-Head Result

Per-Category ROUGE-L

What Improved

Known Limitations

Reproducing Evaluation

Project Structure

Evaluation

References

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
docs		docs
eval		eval
models/lora		models/lora
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
READMEeval.md		READMEeval.md
eval_results.json		eval_results.json
eval_results_v2.json		eval_results_v2.json
requirements.txt		requirements.txt
training_config.json		training_config.json

Folders and files

Latest commit

History

Repository files navigation

Prompt Optimizer -- LLM Fine-Tuning Project

What It Does

Dataset

Data Quality Metrics

Fine-Tuning

Quick Start

Evaluation Results

Quantitative Metrics

Head-to-Head Result

Per-Category ROUGE-L

What Improved

Known Limitations

Reproducing Evaluation

Project Structure

Evaluation

References

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages