Language:
- English:
README.md - 中文:
README_zh.md
Status-oriented workspace for multimodal safety alignment research on Qwen2.5-VL-class models.
Vision-language models can follow harmful requests when the attack is image-conditioned or phrased differently from text-only safety training data. This project tries to transfer safety behavior learned from text supervision into multimodal interactions without claiming that text-only alignment is enough by itself.
The target method is:
- Build text-side safe vs. unsafe representations and extract a safety direction.
- Construct paired multimodal training records from MM-SafetyBench-style data.
- Fine-tune a VLM with native LoRA SFT plus an explicit cross-modal alignment loss so multimodal hidden-state differences track the text-side safety direction.
- Evaluate on image-conditioned attacks from MM-SafetyBench and JailBreakV-28K.
The project now distinguishes clearly between the original design target and the current best-performing recipe. The original target was explicit safety-direction transfer: extract a text-side safe/unsafe direction, then align multimodal hidden-state differences to that direction during LoRA finetuning. In practice, the direct harmful-only + explicit alignment route often improved safety by pushing the model into over-refusal. The current mainline therefore uses harmful-only multimodal LoRA finetuning with preventative steering: a small set of benign probes is used only as a behavioral anchor, while an additional over-refusal penalty suppresses drift toward blanket refusal. The resulting objective keeps the safety gains of harmful-only training while preserving general multimodal capability much better than the pure alignment route.
- Project-local dataset layout is in place under
data/mm-safetybenchanddata/jailbreakv-28k. - Native Qwen2.5-VL LoRA fine-tuning is implemented in
src/lora_finetune.pyand exposed throughmain.py lora. - Image-conditioned evaluation is implemented through
main.py evaland the current collator/evaluator path. - The codebase now includes an explicit alignment-training path:
- alignment config fields in
LoRAConfig - per-record alignment payload derivation from
safe_text,unsafe_text,multimodal_text, andimage_path - combined LM loss + pair/global alignment loss in the training loop
- helper coverage in
tests/test_lora_alignment.py
- alignment config fields in
- The first explicit alignment-enabled training run has completed successfully:
- output dir:
output/alignment_run_20260316_r1c - one epoch on
3360records global_step=210final_loss=0.004323- intermediate checkpoints through
checkpoint-200
- output dir:
- The current mainline checkpoint is the harmful-only preventative-steering run:
- output dir:
output/planned_harmful_only_preventative_steering_r1 - mainline checkpoint:
checkpoint-300 - training mix:
harmful_compliance=3360safe_helpful=0
- preventative steering:
preventative_anchor_weight=0.05preventative_overrefusal_weight=0.10preventative_projection_margin=0.02
- output dir:
- A behavior-mix side-reference run is kept for comparison only:
- output dir:
output/behavior_mix_mmstar512_r1 - side-reference checkpoint:
checkpoint-150 - use case: contamination audit and mixed-data comparison, not the main project claim
- output dir:
- A newer-base replication on local
Qwen3.5-9Bis also now recorded:- output dir:
output/qwen3.5_preventative_steering_r2_fixedlogic - evaluated checkpoint:
checkpoint-300 - same harmful-only preventative recipe:
preventative_anchor_weight=0.05preventative_overrefusal_weight=0.10preventative_projection_margin=0.02
- caveat:
- its first standalone
MMBenchartifact was invalid because the evaluator left thinking enabled and used right-padding in batched generation - the repaired artifact is
results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.jsonand supersedesresults/qwen3.5_r2_ckpt300_mmbench_full.json
- its first standalone
- output dir:
- MMStar overlap auditing and an independent held-out capability path are now implemented:
- overlap audit artifacts:
results/ablation/mmstar_overlap_audit.csvandresults/ablation/mmstar_overlap_audit.md - held-out split builder:
src/mmstar_heldout.py - independent held-out eval entrypoint:
scripts/run_mmstar_heldout_eval.py
- overlap audit artifacts:
- Multi-agent orchestration support exists for dataset prep, script generation, preflight checks, and lightweight readiness artifacts.
- A dedicated mainline technical summary is available at
docs/preventative_steering_mainline.md. - A one-page project briefing is available at
docs/project_briefing_onepage.md. - A final handoff-oriented technical scheme is available at
docs/final_technical_scheme.md. - A paper-style abstract draft is available at
docs/paper_style_abstract.md.
Quick image-conditioned evaluations already present in this workspace:
| Benchmark | Samples | Model | ASR | Safety Rate | Artifact |
|---|---|---|---|---|---|
| MM-SafetyBench | 50 | base model | 0.92 | 0.08 | results/mm_safetybench_image_base_50.json |
| MM-SafetyBench | 50 | output/checkpoint-100 |
0.00 | 1.00 | results/mm_safetybench_image_ckpt100_50.json |
| MM-SafetyBench | 50 | output/alignment_run_20260316_r1c |
0.00 | 1.00 | results/mm_safetybench_image_alignment_r1c_50.json |
| JailBreakV-28K | 50 | base model | 0.40 | 0.60 | results/jailbreakv_image_base_50.json |
| JailBreakV-28K | 50 | output/checkpoint-100 |
0.00 | 1.00 | results/jailbreakv_image_ckpt100_50.json |
| JailBreakV-28K | 50 | output/alignment_run_20260316_r1c |
0.00 | 1.00 | results/jailbreakv_image_alignment_r1c_50.json |
Capability-side check now completed for the explicit alignment checkpoint:
| Benchmark | Samples | Model | Main Metric | Value | Artifact |
|---|---|---|---|---|---|
| MMStar | 1500 | output/alignment_run_20260316_r1c |
Accuracy | 0.286 | results/mmstar_alignment_r1c_1500_postfix_bs1_20260317_073600.json |
Mainline comparison: base model vs harmful-only preventative steering:
| Benchmark | Samples | Base Model | Preventative ckpt300 | Interpretation |
|---|---|---|---|---|
| MM-SafetyBench ASR | 972 | 0.920 on 50-sample quick slice | 0.000 | Measured multimodal attack success drops to zero on the larger loaded slice. |
| JailBreakV ASR | 1000 | 0.400 on 50-sample quick slice | 0.000 | Large-slice jailbreak success also drops to zero. |
| MMBench Accuracy | 4329 | 0.879 | 0.878 | General multimodal capability remains effectively flat. |
| Held-out MMStar Accuracy | 980 | 0.625510 | 0.627551 | Capability stays near base on a train/eval-disjoint slice. |
Cross-model replication on local LLaVA-1.5-7B: base model vs harmful-only preventative steering checkpoint-300:
| Benchmark | Samples | LLaVA Base | LLaVA ckpt300 | Delta | Interpretation |
|---|---|---|---|---|---|
| MM-SafetyBench ASR | 972 | 0.996914 | 0.000000 | -0.996914 | The preventative-steering checkpoint removes almost all measured image-conditioned attack success on the larger slice. |
| JailBreakV ASR | 1000 | 0.822000 | 0.000000 | -0.822000 | Large-slice jailbreak success also drops to zero for the tuned LLaVA checkpoint. |
| MMBench Accuracy | 4329 | 0.733426 | 0.732502 | -0.000924 | General multimodal capability stays effectively flat relative to the same LLaVA base model. |
| Held-out MMStar Accuracy | 980 | 0.379592 | 0.373469 | -0.006122 | Held-out capability drops only slightly relative to the same LLaVA base model. |
Artifacts:
results/llava_base_mmsafety972.jsonresults/llava_ckpt300_mmsafety972.jsonresults/llava_base_jailbreakv1000.jsonresults/llava_ckpt300_jailbreakv1000.jsonresults/llava_base_mmbench_full.jsonresults/llava_ckpt300_mmbench_full.jsonresults/llava_base_mmstar_heldout980.jsonresults/llava_ckpt300_mmstar_heldout980.jsonresults/ablation/llava_base_vs_ckpt300_table.md
Newer-base replication on local Qwen3.5-9B with repaired MMBench evaluation:
| Benchmark | Samples | Model | Main Metric | Value | Artifact |
|---|---|---|---|---|---|
| MM-SafetyBench | 972 | output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 |
ASR | 0.000000 | results/qwen3.5_r2_ckpt300_mmsafety972.json |
| JailBreakV | 1000 | output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 |
ASR | 0.000000 | results/qwen3.5_r2_ckpt300_jailbreakv1000.json |
| MMStar held-out | 980 | output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 |
Accuracy | 0.667347 | results/qwen3.5_r2_ckpt300_mmstar_heldout980.json |
| MMBench | 4329 | output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 |
Accuracy | 0.852391 | results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json |
Interpretation update for the Qwen3.5 replication:
- the repaired
MMBenchartifact shows that the earlier Qwen3.50.0231result was an evaluator bug, not a model collapse - the superseded artifact was
results/qwen3.5_r2_ckpt300_mmbench_full.json - the valid replacement is
results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json - this gives the repository a third positive harmful-only preventative result family:
- Qwen2.5-VL mainline
- LLaVA-1.5-7B replication
- Qwen3.5-9B replication
- strict within-model comparison is now complete against the same local Qwen3.5 base: held-out MMStar improves by 0.012245, MMBench changes by -0.023100, and both MM-SafetyBench972/JailBreakV1000 ASR fall from 0.727366/0.266000 to 0.000000
Strict within-model comparison on local Qwen3.5-9B:
| Benchmark | Qwen3.5 Base | Qwen3.5 ckpt300 | Delta | Interpretation |
|---|---|---|---|---|
| MM-SafetyBench ASR | 0.727366 | 0.000000 | -0.727366 | Preventative steering removes the measured image-conditioned attack success on the larger MM-SafetyBench slice. |
| JailBreakV ASR | 0.266000 | 0.000000 | -0.266000 | Large-slice jailbreak success also drops to zero for the tuned Qwen3.5 checkpoint. |
| MMBench Accuracy | 0.875491 | 0.852391 | -0.023100 | General multimodal capability stays high, although the tuned checkpoint is lower than the same Qwen3.5 base on full MMBench. |
| Held-out MMStar Accuracy | 0.655102 | 0.667347 | 0.012245 | Held-out MMStar improves slightly relative to the same Qwen3.5 base model. |
Artifacts:
results/qwen3.5_base_mmsafety972.jsonresults/qwen3.5_r2_ckpt300_mmsafety972.jsonresults/qwen3.5_base_jailbreakv1000.jsonresults/qwen3.5_r2_ckpt300_jailbreakv1000.jsonresults/qwen3.5_base_mmbench_full.jsonresults/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.jsonresults/qwen3.5_base_mmstar_heldout980.jsonresults/qwen3.5_r2_ckpt300_mmstar_heldout980.jsonresults/ablation/qwen3.5_base_vs_ckpt300_table.md
Historical refusal-aware MMStar comparison on the same 1500-sample slice:
| Model | Accuracy | Refusal Rate | Non-Refusal Accuracy | Artifact |
|---|---|---|---|---|
| base model | 0.616 | 0.0000 | 0.6160 | results/mmstar_base_1500_refusalaware_20260317.json |
output/checkpoint-100 |
0.2913 | 0.9773 | 0.4706 | results/mmstar_ckpt100_1500_refusalaware_parallel_20260317.json |
output/alignment_run_20260316_r1c |
0.2860 | 1.0000 | 0.0000 | results/mmstar_alignment_r1c_1500_postfix_bs1_20260317_073600.json |
output/behavior_mix_mmstar512_r1/checkpoint-150 |
0.7180 | 0.0000 | 0.7180 | results/mmstar_behavior_mix_mmstar512_r1_ckpt150_1500.json |
Behavior-mix contamination audit and held-out retest:
| Model | Original MMStar Slice | Held-Out MMStar Slice | Delta | Interpretation |
|---|---|---|---|---|
| base model | 0.6160 on 1500 | 0.625510 on 980 | +0.009510 | No MMStar training overlap; held-out score mainly reflects slice difficulty. |
output/behavior_mix_mmstar512_r1/checkpoint-150 |
0.7180 on 1500 | 0.631633 on 980 | -0.086367 | Original score is contamination-sensitive because 512 MMStar-derived safe_helpful pairs overlap with the benchmark source file. |
output/planned_harmful_only_preventative_steering_r1/checkpoint-300 |
0.616667 on 1500 | 0.627551 on 980 | +0.010884 | Harmful-only training path stays close to base on held-out MMStar without MMStar-answer overlap. |
Larger-scale safety checks for the current mainline checkpoint:
| Benchmark | Samples | Model | ASR | Safety Rate | Artifact |
|---|---|---|---|---|---|
| MM-SafetyBench | 972 | output/planned_harmful_only_preventative_steering_r1/checkpoint-300 |
0.00 | 1.00 | results/preventative_steering_r1_checkpoint300_mmsafety972.json |
| JailBreakV-28K | 1000 | output/planned_harmful_only_preventative_steering_r1/checkpoint-300 |
0.00 | 1.00 | results/preventative_steering_r1_checkpoint300_jailbreakv1000.json |
| MMBench | 4329 | output/planned_harmful_only_preventative_steering_r1/checkpoint-300 |
Accuracy | 0.8776 | results/mmbench_checkpoint300_full.json |
Interpretation:
- These quick checks show a large improvement versus the base model on the sampled image-conditioned attacks.
- Earlier 50-sample wins did not prove a good safe-helpful tradeoff. That caveat still applies to the older checkpoints.
output/checkpoint-100predates the current explicit alignment-training path. Its measured gains should be treated as results from the earlier native LoRA/SFT path, not as proof of the newer alignment-enabled path.- The first explicit alignment-enabled checkpoint
output/alignment_run_20260316_r1cmatches the earliercheckpoint-100quick-check outcome on both sampled benchmarks. - The next question is not whether the new checkpoint can refuse these 50 attacks, but whether it can do so without unnecessary over-refusal and whether it remains strong on larger evaluations.
- The new 1500-sample MMStar run answers part of that question unfavorably: the checkpoint completes the benchmark but still returns refusal-style text on effectively all samples, and the parser extracts
Afor all 1500 predictions. - The MMStar evaluator now includes refusal-aware metrics; replaying the saved 1500-sample artifact yields
refusal_rate=1.0andaccuracy_excluding_refusals=0.0. - As a result, the recorded
0.286MMStar accuracy should be read as evidence of persistent over-refusal plus answer-parsing bias, not as evidence of preserved visual reasoning capability. - The base-model comparison removes the remaining ambiguity: MMStar is a valid capability signal here, because the untuned model reaches
0.616accuracy with0.0refusal rate on the same 1500-sample slice. output/checkpoint-100sits between the two extremes. It is not fully collapsed likeoutput/alignment_run_20260316_r1c, but its0.9773refusal rate shows that its apparent MMStar score is still mostly a side effect of refusal behavior.- The current mainline claim is now the harmful-only preventative route, not the mixed-data route:
output/planned_harmful_only_preventative_steering_r1/checkpoint-300reaches0.0measured ASR on MM-SafetyBench972 and JailBreakV1000- it stays effectively flat on MMBench (
0.879 -> 0.878) - it stays near base on held-out MMStar980 (
0.625510 -> 0.627551)
- This supports the current project hypothesis: pure harmful training can improve multimodal safety without obvious general-capability collapse, provided preventative steering is used to suppress over-refusal.
- The same pattern now appears on a second local VLM family (
/mnt/data/llms/mllm/llava-1.5-7b-hf): relative to the LLaVA base model,checkpoint-300keeps MMStar held-out and MMBench nearly flat while taking MM-SafetyBench972 and JailBreakV1000 ASR to0.0. output/behavior_mix_mmstar512_r1/checkpoint-150is kept only as a side reference:- it also performs well on safety
- but its raw MMStar1500 headline is contamination-sensitive and should not anchor the main project narrative
- For future capability claims, the recommended path is to export and report the MMStar held-out split rather than reusing the raw MMStar slice directly.
- Remaining caveat: JailBreakV is still a 1000-sample slice rather than a full 28K sweep, and no seed-variance study has been run yet.
Estimated project completion toward a defensible harmful-only research baseline: about 80%.
Completed:
- data relocation and local dataset conventions
- native training/eval loop
- image-conditioned evaluation path
- explicit alignment-capable training code path
- first explicit alignment-enabled training run
- initial quick-check results
- first explicit alignment-enabled checkpoint evaluated on the same quick image-conditioned protocol
- repaired multi-GPU MMStar evaluation path and a completed 1500-sample MMStar run for
output/alignment_run_20260316_r1c - three-way MMStar comparison across base model,
checkpoint-100, andoutput/alignment_run_20260316_r1c - behavior-vector-inspired preventative-steering stack
- completed harmful-only preventative checkpoint with larger-slice safety and capability validation
- MMStar overlap audit for the behavior-mix recipe
- independent held-out MMStar split/export/eval path
- a mainline checkpoint that has passed larger-scale safety validation and a contamination-aware capability retest
- cross-model LLaVA validation showing the same safety gain with near-flat capability relative to the corresponding LLaVA base model
Still missing:
- repeated-seed stability checks for the harmful-only preventative route
- broader preventative-steering ablations beyond the current anchor/projection sweep
- repeated-seed stability checks
- larger JailBreakV coverage beyond the current 1000-sample slice
- final write-up quality documentation around the current mainline settings
- Generalization risk: the current mainline checkpoint is strong on the measured slices, but not yet stress-tested across repeated seeds or full-scale JailBreakV.
- Cross-model scope risk: the current LLaVA result is strong on the measured slices, but it has only been validated for one LLaVA-family checkpoint and one training seed.
- Ablation risk: the preventative recipe is now the mainline, but anchor and over-refusal regularization still need cleaner causal isolation.
- Contamination risk: raw MMStar scores can be overstated if the training mix reuses MMStar-derived rows from the same source file.
- Helpfulness-gap risk: MMStar is a useful capability proxy, but it is not a complete safe-helpful benchmark.
- Capability-eval parsing risk still applies to older artifacts and should remain part of every MMStar interpretation.
- Research scope risk: method intent is still broader than the current empirical coverage.
- Run repeated-seed reruns for
output/planned_harmful_only_preventative_steering_r1/checkpoint-300. - Extend the preventative-steering ablation surface:
anchor_weightoverrefusal_weightprojection_margin
- Extend JailBreakV evaluation beyond the current 1000-sample slice.
- Add a more explicit helpfulness / over-refusal benchmark beyond MMStar alone.
- Use the held-out MMStar path for all future capability reporting and keep raw MMStar results as secondary evidence only.
- Refresh this README when a new harmful-only checkpoint beats the current mainline on both safety and train/eval-disjoint capability.
data/: project-local datasetssrc/: training, alignment, evaluation, and utility codescripts/: orchestration and helper scriptsconfigs/: configuration defaults and status hintsresults/: evaluation outputs and orchestration artifactsoutput/: training outputs and checkpointstests/: targeted regression and helper testsrecords/: experiment log, change log, and record templatesdocs/superpowers/plans/: implementation plans used by agentic workers
pm_agent: roadmap decomposition, stage gates, acceptancedata_engineer_agent: dataset loading, pair construction, dataloader readinesssafety_research_agent: safety direction extraction and alignment-loss researchmodel_engineer_agent: native LoRA training and checkpoint generationeval_engineer_agent: benchmark execution and ASR/safety-rate reportingexperiment_historian_agent: maintainsrecords/experiment_log.md,records/change_log.md, and refreshes this README when the mainline project status changes
# Orchestrator status
python scripts/agent_orchestrator.py status
# GPU preflight before heavy work
python scripts/agent_orchestrator.py preflight
# Native LoRA/SFT training
python main.py lora \
--dataset results/orchestration/train_dataset.jsonl \
--output-dir output \
--model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct
# Alignment-enabled training path
python main.py lora \
--dataset results/orchestration/train_dataset.jsonl \
--output-dir output/alignment-run \
--model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
--alignment-enabled
# Preventative-steering mainline training path
python main.py lora \
--dataset results/orchestration/train_dataset.jsonl \
--output-dir output/preventative-mainline \
--model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
--preventative-steering \
--probe-harmful-records results/orchestration/train_dataset.jsonl \
--probe-mmstar-path /mnt/data/wzh/MMStar.tsv \
--preventative-anchor-weight 0.05 \
--preventative-overrefusal-weight 0.10 \
--preventative-projection-margin 0.02
# MM-SafetyBench image-conditioned quick eval
python main.py eval \
--model-path /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
--benchmark mm-safetybench \
--benchmark-path ./data/mm-safetybench \
--max-samples 50 \
--output results/mm_safetybench_image_base_50.json
# JailBreakV-28K image-conditioned quick eval
python main.py eval \
--model-path output/checkpoint-100 \
--benchmark jailbreakv \
--benchmark-path ./data/jailbreakv-28k \
--max-samples 50 \
--output results/jailbreakv_image_ckpt100_50.json
# Alignment helper tests
python -m unittest tests.test_lora_alignment -v
# Generate ablation run/contrast tables
python scripts/build_ablation_tables.py
# Generate the human-readable research conclusion table
python scripts/build_research_conclusion_table.py
# Generate the final human-readable main comparison table
python scripts/build_final_main_table.py
# Export a held-out MMStar split that excludes MMStar-derived training rows
python scripts/run_mmstar_heldout_eval.py \
--train-dataset-path results/orchestration/train_dataset.jsonl \
--heldout-path results/mmstar_preventative_heldout.tsv \
--audit-output results/mmstar_preventative_heldout_audit.json
# Evaluate a checkpoint on the held-out MMStar split with GPU preflight
python scripts/run_mmstar_heldout_eval.py \
--train-dataset-path results/orchestration/train_dataset.jsonl \
--model-path output/planned_harmful_only_preventative_steering_r1/checkpoint-300 \
--heldout-path results/mmstar_preventative_heldout.tsv \
--audit-output results/mmstar_preventative_heldout_audit.json \
--output results/mmstar_preventative_ckpt300_heldout.json \
--wait-gpurecords/experiment_log.mdis the append-only summary of evaluated runs and what they mean.records/change_log.mdis the append-only summary of project-level milestones that change interpretation of results.records/templates/contains copy-ready entry formats for new experiments and changes.- When a new result materially changes the current mainline status, update both the relevant record and this README in the same work session.