MM-SafetyResearch

Language:

English: README.md
中文: README_zh.md

Status-oriented workspace for multimodal safety alignment research on Qwen2.5-VL-class models.

Problem

Vision-language models can follow harmful requests when the attack is image-conditioned or phrased differently from text-only safety training data. This project tries to transfer safety behavior learned from text supervision into multimodal interactions without claiming that text-only alignment is enough by itself.

Intended Method

The target method is:

Build text-side safe vs. unsafe representations and extract a safety direction.
Construct paired multimodal training records from MM-SafetyBench-style data.
Fine-tune a VLM with native LoRA SFT plus an explicit cross-modal alignment loss so multimodal hidden-state differences track the text-side safety direction.
Evaluate on image-conditioned attacks from MM-SafetyBench and JailBreakV-28K.

Method Summary

The project now distinguishes clearly between the original design target and the current best-performing recipe. The original target was explicit safety-direction transfer: extract a text-side safe/unsafe direction, then align multimodal hidden-state differences to that direction during LoRA finetuning. In practice, the direct harmful-only + explicit alignment route often improved safety by pushing the model into over-refusal. The current mainline therefore uses harmful-only multimodal LoRA finetuning with preventative steering: a small set of benign probes is used only as a behavioral anchor, while an additional over-refusal penalty suppresses drift toward blanket refusal. The resulting objective keeps the safety gains of harmful-only training while preserving general multimodal capability much better than the pure alignment route.

What Is Implemented Now

Project-local dataset layout is in place under data/mm-safetybench and data/jailbreakv-28k.
Native Qwen2.5-VL LoRA fine-tuning is implemented in src/lora_finetune.py and exposed through main.py lora.
Image-conditioned evaluation is implemented through main.py eval and the current collator/evaluator path.
The codebase now includes an explicit alignment-training path:
- alignment config fields in LoRAConfig
- per-record alignment payload derivation from safe_text, unsafe_text, multimodal_text, and image_path
- combined LM loss + pair/global alignment loss in the training loop
- helper coverage in tests/test_lora_alignment.py
The first explicit alignment-enabled training run has completed successfully:
- output dir: output/alignment_run_20260316_r1c
- one epoch on 3360 records
- global_step=210
- final_loss=0.004323
- intermediate checkpoints through checkpoint-200
The current mainline checkpoint is the harmful-only preventative-steering run:
- output dir: output/planned_harmful_only_preventative_steering_r1
- mainline checkpoint: checkpoint-300
- training mix:
  - harmful_compliance=3360
  - safe_helpful=0
- preventative steering:
  - preventative_anchor_weight=0.05
  - preventative_overrefusal_weight=0.10
  - preventative_projection_margin=0.02
A behavior-mix side-reference run is kept for comparison only:
- output dir: output/behavior_mix_mmstar512_r1
- side-reference checkpoint: checkpoint-150
- use case: contamination audit and mixed-data comparison, not the main project claim
A newer-base replication on local Qwen3.5-9B is also now recorded:
- output dir: output/qwen3.5_preventative_steering_r2_fixedlogic
- evaluated checkpoint: checkpoint-300
- same harmful-only preventative recipe:
  - preventative_anchor_weight=0.05
  - preventative_overrefusal_weight=0.10
  - preventative_projection_margin=0.02
- caveat:
  - its first standalone MMBench artifact was invalid because the evaluator left thinking enabled and used right-padding in batched generation
  - the repaired artifact is results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json and supersedes results/qwen3.5_r2_ckpt300_mmbench_full.json
MMStar overlap auditing and an independent held-out capability path are now implemented:
- overlap audit artifacts: results/ablation/mmstar_overlap_audit.csv and results/ablation/mmstar_overlap_audit.md
- held-out split builder: src/mmstar_heldout.py
- independent held-out eval entrypoint: scripts/run_mmstar_heldout_eval.py
Multi-agent orchestration support exists for dataset prep, script generation, preflight checks, and lightweight readiness artifacts.
A dedicated mainline technical summary is available at docs/preventative_steering_mainline.md.
A one-page project briefing is available at docs/project_briefing_onepage.md.
A final handoff-oriented technical scheme is available at docs/final_technical_scheme.md.
A paper-style abstract draft is available at docs/paper_style_abstract.md.

Verified Quantitative Results

Quick image-conditioned evaluations already present in this workspace:

Benchmark	Samples	Model	ASR	Safety Rate	Artifact
MM-SafetyBench	50	base model	0.92	0.08	`results/mm_safetybench_image_base_50.json`
MM-SafetyBench	50	`output/checkpoint-100`	0.00	1.00	`results/mm_safetybench_image_ckpt100_50.json`
MM-SafetyBench	50	`output/alignment_run_20260316_r1c`	0.00	1.00	`results/mm_safetybench_image_alignment_r1c_50.json`
JailBreakV-28K	50	base model	0.40	0.60	`results/jailbreakv_image_base_50.json`
JailBreakV-28K	50	`output/checkpoint-100`	0.00	1.00	`results/jailbreakv_image_ckpt100_50.json`
JailBreakV-28K	50	`output/alignment_run_20260316_r1c`	0.00	1.00	`results/jailbreakv_image_alignment_r1c_50.json`

Capability-side check now completed for the explicit alignment checkpoint:

Benchmark	Samples	Model	Main Metric	Value	Artifact
MMStar	1500	`output/alignment_run_20260316_r1c`	Accuracy	0.286	`results/mmstar_alignment_r1c_1500_postfix_bs1_20260317_073600.json`

Mainline comparison: base model vs harmful-only preventative steering:

Benchmark	Samples	Base Model	Preventative ckpt300	Interpretation
MM-SafetyBench ASR	972	0.920 on 50-sample quick slice	0.000	Measured multimodal attack success drops to zero on the larger loaded slice.
JailBreakV ASR	1000	0.400 on 50-sample quick slice	0.000	Large-slice jailbreak success also drops to zero.
MMBench Accuracy	4329	0.879	0.878	General multimodal capability remains effectively flat.
Held-out MMStar Accuracy	980	0.625510	0.627551	Capability stays near base on a train/eval-disjoint slice.

Cross-model replication on local LLaVA-1.5-7B: base model vs harmful-only preventative steering checkpoint-300:

Benchmark	Samples	LLaVA Base	LLaVA ckpt300	Delta	Interpretation
MM-SafetyBench ASR	972	0.996914	0.000000	-0.996914	The preventative-steering checkpoint removes almost all measured image-conditioned attack success on the larger slice.
JailBreakV ASR	1000	0.822000	0.000000	-0.822000	Large-slice jailbreak success also drops to zero for the tuned LLaVA checkpoint.
MMBench Accuracy	4329	0.733426	0.732502	-0.000924	General multimodal capability stays effectively flat relative to the same LLaVA base model.
Held-out MMStar Accuracy	980	0.379592	0.373469	-0.006122	Held-out capability drops only slightly relative to the same LLaVA base model.

Artifacts:

results/llava_base_mmsafety972.json
results/llava_ckpt300_mmsafety972.json
results/llava_base_jailbreakv1000.json
results/llava_ckpt300_jailbreakv1000.json
results/llava_base_mmbench_full.json
results/llava_ckpt300_mmbench_full.json
results/llava_base_mmstar_heldout980.json
results/llava_ckpt300_mmstar_heldout980.json
results/ablation/llava_base_vs_ckpt300_table.md

Newer-base replication on local Qwen3.5-9B with repaired MMBench evaluation:

Benchmark	Samples	Model	Main Metric	Value	Artifact
MM-SafetyBench	972	`output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300`	ASR	0.000000	`results/qwen3.5_r2_ckpt300_mmsafety972.json`
JailBreakV	1000	`output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300`	ASR	0.000000	`results/qwen3.5_r2_ckpt300_jailbreakv1000.json`
MMStar held-out	980	`output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300`	Accuracy	0.667347	`results/qwen3.5_r2_ckpt300_mmstar_heldout980.json`
MMBench	4329	`output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300`	Accuracy	0.852391	`results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json`

Interpretation update for the Qwen3.5 replication:

the repaired MMBench artifact shows that the earlier Qwen3.5 0.0231 result was an evaluator bug, not a model collapse
the superseded artifact was results/qwen3.5_r2_ckpt300_mmbench_full.json
the valid replacement is results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json
this gives the repository a third positive harmful-only preventative result family:
- Qwen2.5-VL mainline
- LLaVA-1.5-7B replication
- Qwen3.5-9B replication
strict within-model comparison is now complete against the same local Qwen3.5 base: held-out MMStar improves by 0.012245, MMBench changes by -0.023100, and both MM-SafetyBench972/JailBreakV1000 ASR fall from 0.727366/0.266000 to 0.000000

Strict within-model comparison on local Qwen3.5-9B:

Benchmark	Qwen3.5 Base	Qwen3.5 ckpt300	Delta	Interpretation
MM-SafetyBench ASR	0.727366	0.000000	-0.727366	Preventative steering removes the measured image-conditioned attack success on the larger MM-SafetyBench slice.
JailBreakV ASR	0.266000	0.000000	-0.266000	Large-slice jailbreak success also drops to zero for the tuned Qwen3.5 checkpoint.
MMBench Accuracy	0.875491	0.852391	-0.023100	General multimodal capability stays high, although the tuned checkpoint is lower than the same Qwen3.5 base on full MMBench.
Held-out MMStar Accuracy	0.655102	0.667347	0.012245	Held-out MMStar improves slightly relative to the same Qwen3.5 base model.

Artifacts:

results/qwen3.5_base_mmsafety972.json
results/qwen3.5_r2_ckpt300_mmsafety972.json
results/qwen3.5_base_jailbreakv1000.json
results/qwen3.5_r2_ckpt300_jailbreakv1000.json
results/qwen3.5_base_mmbench_full.json
results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json
results/qwen3.5_base_mmstar_heldout980.json
results/qwen3.5_r2_ckpt300_mmstar_heldout980.json
results/ablation/qwen3.5_base_vs_ckpt300_table.md

Historical refusal-aware MMStar comparison on the same 1500-sample slice:

Model	Accuracy	Refusal Rate	Non-Refusal Accuracy	Artifact
base model	0.616	0.0000	0.6160	`results/mmstar_base_1500_refusalaware_20260317.json`
`output/checkpoint-100`	0.2913	0.9773	0.4706	`results/mmstar_ckpt100_1500_refusalaware_parallel_20260317.json`
`output/alignment_run_20260316_r1c`	0.2860	1.0000	0.0000	`results/mmstar_alignment_r1c_1500_postfix_bs1_20260317_073600.json`
`output/behavior_mix_mmstar512_r1/checkpoint-150`	0.7180	0.0000	0.7180	`results/mmstar_behavior_mix_mmstar512_r1_ckpt150_1500.json`

Behavior-mix contamination audit and held-out retest:

Model	Original MMStar Slice	Held-Out MMStar Slice	Delta	Interpretation
base model	0.6160 on 1500	0.625510 on 980	+0.009510	No MMStar training overlap; held-out score mainly reflects slice difficulty.
`output/behavior_mix_mmstar512_r1/checkpoint-150`	0.7180 on 1500	0.631633 on 980	-0.086367	Original score is contamination-sensitive because 512 MMStar-derived `safe_helpful` pairs overlap with the benchmark source file.
`output/planned_harmful_only_preventative_steering_r1/checkpoint-300`	0.616667 on 1500	0.627551 on 980	+0.010884	Harmful-only training path stays close to base on held-out MMStar without MMStar-answer overlap.

Larger-scale safety checks for the current mainline checkpoint:

Benchmark	Samples	Model	ASR	Safety Rate	Artifact
MM-SafetyBench	972	`output/planned_harmful_only_preventative_steering_r1/checkpoint-300`	0.00	1.00	`results/preventative_steering_r1_checkpoint300_mmsafety972.json`
JailBreakV-28K	1000	`output/planned_harmful_only_preventative_steering_r1/checkpoint-300`	0.00	1.00	`results/preventative_steering_r1_checkpoint300_jailbreakv1000.json`
MMBench	4329	`output/planned_harmful_only_preventative_steering_r1/checkpoint-300`	Accuracy	0.8776	`results/mmbench_checkpoint300_full.json`

Interpretation:

These quick checks show a large improvement versus the base model on the sampled image-conditioned attacks.
Earlier 50-sample wins did not prove a good safe-helpful tradeoff. That caveat still applies to the older checkpoints.
output/checkpoint-100 predates the current explicit alignment-training path. Its measured gains should be treated as results from the earlier native LoRA/SFT path, not as proof of the newer alignment-enabled path.
The first explicit alignment-enabled checkpoint output/alignment_run_20260316_r1c matches the earlier checkpoint-100 quick-check outcome on both sampled benchmarks.
The next question is not whether the new checkpoint can refuse these 50 attacks, but whether it can do so without unnecessary over-refusal and whether it remains strong on larger evaluations.
The new 1500-sample MMStar run answers part of that question unfavorably: the checkpoint completes the benchmark but still returns refusal-style text on effectively all samples, and the parser extracts A for all 1500 predictions.
The MMStar evaluator now includes refusal-aware metrics; replaying the saved 1500-sample artifact yields refusal_rate=1.0 and accuracy_excluding_refusals=0.0.
As a result, the recorded 0.286 MMStar accuracy should be read as evidence of persistent over-refusal plus answer-parsing bias, not as evidence of preserved visual reasoning capability.
The base-model comparison removes the remaining ambiguity: MMStar is a valid capability signal here, because the untuned model reaches 0.616 accuracy with 0.0 refusal rate on the same 1500-sample slice.
output/checkpoint-100 sits between the two extremes. It is not fully collapsed like output/alignment_run_20260316_r1c, but its 0.9773 refusal rate shows that its apparent MMStar score is still mostly a side effect of refusal behavior.
The current mainline claim is now the harmful-only preventative route, not the mixed-data route:
- output/planned_harmful_only_preventative_steering_r1/checkpoint-300 reaches 0.0 measured ASR on MM-SafetyBench972 and JailBreakV1000
- it stays effectively flat on MMBench (0.879 -> 0.878)
- it stays near base on held-out MMStar980 (0.625510 -> 0.627551)
This supports the current project hypothesis: pure harmful training can improve multimodal safety without obvious general-capability collapse, provided preventative steering is used to suppress over-refusal.
The same pattern now appears on a second local VLM family (/mnt/data/llms/mllm/llava-1.5-7b-hf): relative to the LLaVA base model, checkpoint-300 keeps MMStar held-out and MMBench nearly flat while taking MM-SafetyBench972 and JailBreakV1000 ASR to 0.0.
output/behavior_mix_mmstar512_r1/checkpoint-150 is kept only as a side reference:
- it also performs well on safety
- but its raw MMStar1500 headline is contamination-sensitive and should not anchor the main project narrative
For future capability claims, the recommended path is to export and report the MMStar held-out split rather than reusing the raw MMStar slice directly.
Remaining caveat: JailBreakV is still a 1000-sample slice rather than a full 28K sweep, and no seed-variance study has been run yet.

Completion Estimate

Estimated project completion toward a defensible harmful-only research baseline: about 80%.

Completed:

data relocation and local dataset conventions
native training/eval loop
image-conditioned evaluation path
explicit alignment-capable training code path
first explicit alignment-enabled training run
initial quick-check results
first explicit alignment-enabled checkpoint evaluated on the same quick image-conditioned protocol
repaired multi-GPU MMStar evaluation path and a completed 1500-sample MMStar run for output/alignment_run_20260316_r1c
three-way MMStar comparison across base model, checkpoint-100, and output/alignment_run_20260316_r1c
behavior-vector-inspired preventative-steering stack
completed harmful-only preventative checkpoint with larger-slice safety and capability validation
MMStar overlap audit for the behavior-mix recipe
independent held-out MMStar split/export/eval path
a mainline checkpoint that has passed larger-scale safety validation and a contamination-aware capability retest
cross-model LLaVA validation showing the same safety gain with near-flat capability relative to the corresponding LLaVA base model

Still missing:

repeated-seed stability checks for the harmful-only preventative route
broader preventative-steering ablations beyond the current anchor/projection sweep
repeated-seed stability checks
larger JailBreakV coverage beyond the current 1000-sample slice
final write-up quality documentation around the current mainline settings

Risks

Generalization risk: the current mainline checkpoint is strong on the measured slices, but not yet stress-tested across repeated seeds or full-scale JailBreakV.
Cross-model scope risk: the current LLaVA result is strong on the measured slices, but it has only been validated for one LLaVA-family checkpoint and one training seed.
Ablation risk: the preventative recipe is now the mainline, but anchor and over-refusal regularization still need cleaner causal isolation.
Contamination risk: raw MMStar scores can be overstated if the training mix reuses MMStar-derived rows from the same source file.
Helpfulness-gap risk: MMStar is a useful capability proxy, but it is not a complete safe-helpful benchmark.
Capability-eval parsing risk still applies to older artifacts and should remain part of every MMStar interpretation.
Research scope risk: method intent is still broader than the current empirical coverage.

Next Steps

Run repeated-seed reruns for output/planned_harmful_only_preventative_steering_r1/checkpoint-300.
Extend the preventative-steering ablation surface:
- anchor_weight
- overrefusal_weight
- projection_margin
Extend JailBreakV evaluation beyond the current 1000-sample slice.
Add a more explicit helpfulness / over-refusal benchmark beyond MMStar alone.
Use the held-out MMStar path for all future capability reporting and keep raw MMStar results as secondary evidence only.
Refresh this README when a new harmful-only checkpoint beats the current mainline on both safety and train/eval-disjoint capability.

Key Directories

data/: project-local datasets
src/: training, alignment, evaluation, and utility code
scripts/: orchestration and helper scripts
configs/: configuration defaults and status hints
results/: evaluation outputs and orchestration artifacts
output/: training outputs and checkpoints
tests/: targeted regression and helper tests
records/: experiment log, change log, and record templates
docs/superpowers/plans/: implementation plans used by agentic workers

Multi-Agent Ownership

pm_agent: roadmap decomposition, stage gates, acceptance
data_engineer_agent: dataset loading, pair construction, dataloader readiness
safety_research_agent: safety direction extraction and alignment-loss research
model_engineer_agent: native LoRA training and checkpoint generation
eval_engineer_agent: benchmark execution and ASR/safety-rate reporting
experiment_historian_agent: maintains records/experiment_log.md, records/change_log.md, and refreshes this README when the mainline project status changes

Common Commands

# Orchestrator status
python scripts/agent_orchestrator.py status

# GPU preflight before heavy work
python scripts/agent_orchestrator.py preflight

# Native LoRA/SFT training
python main.py lora \
  --dataset results/orchestration/train_dataset.jsonl \
  --output-dir output \
  --model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct

# Alignment-enabled training path
python main.py lora \
  --dataset results/orchestration/train_dataset.jsonl \
  --output-dir output/alignment-run \
  --model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
  --alignment-enabled

# Preventative-steering mainline training path
python main.py lora \
  --dataset results/orchestration/train_dataset.jsonl \
  --output-dir output/preventative-mainline \
  --model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
  --preventative-steering \
  --probe-harmful-records results/orchestration/train_dataset.jsonl \
  --probe-mmstar-path /mnt/data/wzh/MMStar.tsv \
  --preventative-anchor-weight 0.05 \
  --preventative-overrefusal-weight 0.10 \
  --preventative-projection-margin 0.02

# MM-SafetyBench image-conditioned quick eval
python main.py eval \
  --model-path /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
  --benchmark mm-safetybench \
  --benchmark-path ./data/mm-safetybench \
  --max-samples 50 \
  --output results/mm_safetybench_image_base_50.json

# JailBreakV-28K image-conditioned quick eval
python main.py eval \
  --model-path output/checkpoint-100 \
  --benchmark jailbreakv \
  --benchmark-path ./data/jailbreakv-28k \
  --max-samples 50 \
  --output results/jailbreakv_image_ckpt100_50.json

# Alignment helper tests
python -m unittest tests.test_lora_alignment -v

# Generate ablation run/contrast tables
python scripts/build_ablation_tables.py

# Generate the human-readable research conclusion table
python scripts/build_research_conclusion_table.py

# Generate the final human-readable main comparison table
python scripts/build_final_main_table.py

# Export a held-out MMStar split that excludes MMStar-derived training rows
python scripts/run_mmstar_heldout_eval.py \
  --train-dataset-path results/orchestration/train_dataset.jsonl \
  --heldout-path results/mmstar_preventative_heldout.tsv \
  --audit-output results/mmstar_preventative_heldout_audit.json

# Evaluate a checkpoint on the held-out MMStar split with GPU preflight
python scripts/run_mmstar_heldout_eval.py \
  --train-dataset-path results/orchestration/train_dataset.jsonl \
  --model-path output/planned_harmful_only_preventative_steering_r1/checkpoint-300 \
  --heldout-path results/mmstar_preventative_heldout.tsv \
  --audit-output results/mmstar_preventative_heldout_audit.json \
  --output results/mmstar_preventative_ckpt300_heldout.json \
  --wait-gpu

Records Maintenance

records/experiment_log.md is the append-only summary of evaluated runs and what they mean.
records/change_log.md is the append-only summary of project-level milestones that change interpretation of results.
records/templates/ contains copy-ready entry formats for new experiments and changes.
When a new result materially changes the current mainline status, update both the relevant record and this README in the same work session.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
configs		configs
docs		docs
records		records
results		results
scripts		scripts
src		src
tests		tests
todo		todo
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
main.py		main.py
mmstar_parallel_worker.py		mmstar_parallel_worker.py
report.json		report.json
requirements.txt		requirements.txt
run_ds_lora.sh		run_ds_lora.sh
run_qwen3.5_evals_only.sh		run_qwen3.5_evals_only.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-SafetyResearch

Problem

Intended Method

Method Summary

What Is Implemented Now

Verified Quantitative Results

Completion Estimate

Risks

Next Steps

Key Directories

Multi-Agent Ownership

Common Commands

Records Maintenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MM-SafetyResearch

Problem

Intended Method

Method Summary

What Is Implemented Now

Verified Quantitative Results

Completion Estimate

Risks

Next Steps

Key Directories

Multi-Agent Ownership

Common Commands

Records Maintenance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages