Skip to content

Petsku01/prompt-optimizer

Repository files navigation

Prompt Optimizer -- LLM Fine-Tuning Project

Fine-tune Qwen2.5-3B-Instruct with QLoRA to automatically transform vague, underspecified prompts into clear, structured, and effective prompts.

What It Does

Input (vague) Output (optimized)
write about dogs Write a 500-word informative essay about working dog breeds, covering their historical roles, modern applications, and training requirements. Use a professional tone and support your claims with specific examples.
fix my python code Debug the following Python code. Identify all bugs and issues, explain the root cause of each, and provide the corrected version with inline comments explaining each fix. Focus on: off-by-one errors, edge cases, and type safety.
give me ideas for a mobile app Generate 10 mobile app ideas for the Finnish market that solve real problems in daily life. For each idea, provide: (1) the problem it solves, (2) target demographic, (3) key differentiator from existing solutions, (4) estimated development complexity (Low/Medium/High). Prioritize ideas that can be built by a solo developer in under 3 months.

Dataset

  • 1,183 prompt optimization pairs across 10 categories
  • Categories: writing, coding, analysis, translation, Q&A, roleplay, summarization, brainstorming, instruction, editing
  • Split: 619 train / 76 val / 84 test
  • Format: instruction-tuning (instruction, input, output)

Data Quality Metrics

Metric Value
Avg specificity markers 3.2/7
Avg length ratio (optimized/vague) 15.0x
Has structure 83%
Has format specification 71%
Has audience targeting 54%
Has tone specification 23%

Fine-Tuning

  • Base model: Qwen2.5-3B-Instruct (via Unsloth)
  • Method: QLoRA (4-bit quantization + LoRA adapters)
  • Hardware: Google Colab T4 GPU (16GB VRAM)
  • LoRA config: r=16, alpha=16, dropout=0
  • Training: 5 epochs, batch=2, grad_accum=8, lr=2e-4
  • Effective batch size: 16
  • Max sequence length: 2048
  • Warmup ratio: 0.1
  • Weight decay: 0.01

Quick Start

  1. Open notebooks/finetune_colab.ipynb in Google Colab
  2. Change runtime to T4 GPU
  3. Upload data/train.jsonl and data/val.jsonl
  4. Run all cells

Expected training time: ~2-4 hours on T4.

Evaluation Results

Full 84-sample evaluation with 95% confidence intervals, comparing base model vs. fine-tuned model.

Quantitative Metrics

Metric Base Fine-tuned Change
ROUGE-1 0.238 0.339 +0.100
ROUGE-2 0.062 0.141 +0.079
ROUGE-L 0.172 0.253 +0.081
Jaccard vs reference 0.101 0.177 +0.076
Specificity /7 1.30 3.33 +2.04
Length vs reference 0.558 0.975 +0.417
Avg output words 28.8 49.7 +20.9
Structure present 40.5% 76.2% +35.7 pts
Format guidance 22.6% 71.4% +48.8 pts
Audience guidance 23.8% 61.9% +38.1 pts
Tone guidance 6.0% 20.2% +14.3 pts
Prefix bug rate 0.0% 0.0% 0.0 pts

Metric definitions:

  • ROUGE-1/2/L: F1 scores vs. reference. Higher = better lexical overlap.
  • Jaccard: Word-level |A & B| / |A union B| vs. reference. Higher = better alignment.
  • Specificity /7: Count of optimization dimensions addressed (length, format, audience, tone, specificity, constraints, examples).
  • Length vs reference: output_words / reference_words. Closer to 1.0 = better length alignment.
  • Prefix bug: Unwanted Optimized Prompt: prefix. 0% = clean.

Head-to-Head Result

ROUGE-L comparison with 0.01 margin:

Outcome Count Share
Fine-tuned wins 57 67.9%
Base wins 20 23.8%
Ties 7 8.3%

Per-Category ROUGE-L

Category Samples Base Fine-tuned Change
analysis 12 0.179 0.341 +0.162
brainstorming 9 0.143 0.245 +0.101
coding 19 0.190 0.195 +0.005
editing 13 0.155 0.281 +0.126
instruction 9 0.154 0.300 +0.146
q_and_a 12 0.173 0.178 +0.006
roleplay 6 0.194 0.313 +0.119
translation 4 0.188 0.211 +0.023

Strongest gains in analysis (+0.162), instruction (+0.146), editing (+0.126), and roleplay (+0.119). Coding and Q&A improved only marginally, suggesting these categories may need more targeted training examples.

What Improved

The fine-tuned model learned to produce prompts that are:

  • More specific and constraint-aware (specificity 1.3 -> 3.3/7)
  • Closer to reference prompt length (0.56 -> 0.98 ratio)
  • More likely to include structure (+36 pts), formatting (+49 pts), and audience guidance (+38 pts)
  • Less likely to return a generic clarification question
  • Free of the unwanted Optimized Prompt: prefix

Example improvement:

Input: what could I do with portfolio projects

Base output:
What specific portfolio projects could I create to enhance my skills in [specific field or technology]?

Fine-tuned output:
Suggest 12 projects using portfolio projects for a mid-level developer transitioning from Python to JavaScript. Include: quick wins (< 5 hours), team collaborations, and cross-functional opportunities. Format: bullet points with estimated complexity (Beginner/Intermediate/Advanced) and required tools/platforms.

Known Limitations

  • Some outputs may add details not present in the original prompt
  • Coding and Q&A categories showed small ROUGE-L gains
  • ROUGE and Jaccard measure overlap with references, not full human preference
  • Small categories (translation n=4, roleplay n=6) should be treated as directional

Reproducing Evaluation

  1. Open notebooks/evaluate_colab.ipynb in Google Colab
  2. Set runtime to T4 GPU
  3. Upload the LoRA adapter zip (prompt-optimizer-lora.zip)
  4. Run all cells in order
  5. Download eval_results_v2.json after the final cell

The notebook automatically extracts the LoRA zip, loads the base model and adapter, evaluates all 84 test prompts, and saves per-sample output with confidence intervals.

Project Structure

prompt-optimizer/
|-- data/
|  |-- raw/                    # Generated and seed data
|  |-- train.jsonl             # Training split (619 examples)
|  |-- val.jsonl               # Validation split (76 examples)
|  |-- test.jsonl              # Test split (84 examples)
|  `-- cleaned_data.jsonl      # Full cleaned dataset
|-- scripts/
|  |-- generate_data_v2.py     # Template-based data generation
|  |-- augment_data.py         # Data augmentation (paraphrases)
|  |-- prepare_final.py        # Cleaning, dedup, splitting
|  `-- clean_data_v2.py        # Quality filtering
|-- notebooks/
|  |-- finetune_colab.ipynb   # Colab training notebook
|  `-- evaluate_colab.ipynb   # Colab evaluation notebook
|-- eval/
|  `-- evaluate.py             # Evaluation script
|-- eval_results_v2.json        # Full evaluation output (84 samples)
|-- training_config.json        # Training hyperparameters
|-- docs/
|  |-- PLAN.md                 # Project plan
|  `-- CV_ENTRY.md             # CV description
`-- README.md

Evaluation

python eval/evaluate.py

Metrics: ROUGE-1/2/L, Jaccard similarity, specificity markers /7, length ratios, structure/format/audience/tone percentages, head-to-head comparison, per-category breakdown with 95% confidence intervals.

References

See docs/REFERENCES.md for the full bibliography. Key papers:

  • QLoRA -- Dettmers et al. (2023), our fine-tuning method
  • LoRA -- Hu et al. (2022), adapter-based training foundation
  • Self-Instruct -- Wang et al. (2023), synthetic data generation methodology
  • InstructGPT -- Ouyang et al. (2022), instruction-following paradigm
  • Qwen2.5 -- Yang et al. (2024), our base model
  • Automatic Prompt Optimization -- Pryzant et al. (2023), prompt optimization research

License

MIT

Author

Petteri Kosonen -- Built as a portfolio project demonstrating LLM fine-tuning, dataset creation, and prompt engineering expertise.

About

QLoRA fine-tuning: Qwen2.5-3B-Instruct for prompt optimization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors