Skip to content

waterdrop26651/MMSteer

Repository files navigation

MM-SafetyResearch

Language:

  • English: README.md
  • 中文: README_zh.md

Status-oriented workspace for multimodal safety alignment research on Qwen2.5-VL-class models.

Problem

Vision-language models can follow harmful requests when the attack is image-conditioned or phrased differently from text-only safety training data. This project tries to transfer safety behavior learned from text supervision into multimodal interactions without claiming that text-only alignment is enough by itself.

Intended Method

The target method is:

  1. Build text-side safe vs. unsafe representations and extract a safety direction.
  2. Construct paired multimodal training records from MM-SafetyBench-style data.
  3. Fine-tune a VLM with native LoRA SFT plus an explicit cross-modal alignment loss so multimodal hidden-state differences track the text-side safety direction.
  4. Evaluate on image-conditioned attacks from MM-SafetyBench and JailBreakV-28K.

Method Summary

The project now distinguishes clearly between the original design target and the current best-performing recipe. The original target was explicit safety-direction transfer: extract a text-side safe/unsafe direction, then align multimodal hidden-state differences to that direction during LoRA finetuning. In practice, the direct harmful-only + explicit alignment route often improved safety by pushing the model into over-refusal. The current mainline therefore uses harmful-only multimodal LoRA finetuning with preventative steering: a small set of benign probes is used only as a behavioral anchor, while an additional over-refusal penalty suppresses drift toward blanket refusal. The resulting objective keeps the safety gains of harmful-only training while preserving general multimodal capability much better than the pure alignment route.

What Is Implemented Now

  • Project-local dataset layout is in place under data/mm-safetybench and data/jailbreakv-28k.
  • Native Qwen2.5-VL LoRA fine-tuning is implemented in src/lora_finetune.py and exposed through main.py lora.
  • Image-conditioned evaluation is implemented through main.py eval and the current collator/evaluator path.
  • The codebase now includes an explicit alignment-training path:
    • alignment config fields in LoRAConfig
    • per-record alignment payload derivation from safe_text, unsafe_text, multimodal_text, and image_path
    • combined LM loss + pair/global alignment loss in the training loop
    • helper coverage in tests/test_lora_alignment.py
  • The first explicit alignment-enabled training run has completed successfully:
    • output dir: output/alignment_run_20260316_r1c
    • one epoch on 3360 records
    • global_step=210
    • final_loss=0.004323
    • intermediate checkpoints through checkpoint-200
  • The current mainline checkpoint is the harmful-only preventative-steering run:
    • output dir: output/planned_harmful_only_preventative_steering_r1
    • mainline checkpoint: checkpoint-300
    • training mix:
      • harmful_compliance=3360
      • safe_helpful=0
    • preventative steering:
      • preventative_anchor_weight=0.05
      • preventative_overrefusal_weight=0.10
      • preventative_projection_margin=0.02
  • A behavior-mix side-reference run is kept for comparison only:
    • output dir: output/behavior_mix_mmstar512_r1
    • side-reference checkpoint: checkpoint-150
    • use case: contamination audit and mixed-data comparison, not the main project claim
  • A newer-base replication on local Qwen3.5-9B is also now recorded:
    • output dir: output/qwen3.5_preventative_steering_r2_fixedlogic
    • evaluated checkpoint: checkpoint-300
    • same harmful-only preventative recipe:
      • preventative_anchor_weight=0.05
      • preventative_overrefusal_weight=0.10
      • preventative_projection_margin=0.02
    • caveat:
      • its first standalone MMBench artifact was invalid because the evaluator left thinking enabled and used right-padding in batched generation
      • the repaired artifact is results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json and supersedes results/qwen3.5_r2_ckpt300_mmbench_full.json
  • MMStar overlap auditing and an independent held-out capability path are now implemented:
    • overlap audit artifacts: results/ablation/mmstar_overlap_audit.csv and results/ablation/mmstar_overlap_audit.md
    • held-out split builder: src/mmstar_heldout.py
    • independent held-out eval entrypoint: scripts/run_mmstar_heldout_eval.py
  • Multi-agent orchestration support exists for dataset prep, script generation, preflight checks, and lightweight readiness artifacts.
  • A dedicated mainline technical summary is available at docs/preventative_steering_mainline.md.
  • A one-page project briefing is available at docs/project_briefing_onepage.md.
  • A final handoff-oriented technical scheme is available at docs/final_technical_scheme.md.
  • A paper-style abstract draft is available at docs/paper_style_abstract.md.

Verified Quantitative Results

Quick image-conditioned evaluations already present in this workspace:

Benchmark Samples Model ASR Safety Rate Artifact
MM-SafetyBench 50 base model 0.92 0.08 results/mm_safetybench_image_base_50.json
MM-SafetyBench 50 output/checkpoint-100 0.00 1.00 results/mm_safetybench_image_ckpt100_50.json
MM-SafetyBench 50 output/alignment_run_20260316_r1c 0.00 1.00 results/mm_safetybench_image_alignment_r1c_50.json
JailBreakV-28K 50 base model 0.40 0.60 results/jailbreakv_image_base_50.json
JailBreakV-28K 50 output/checkpoint-100 0.00 1.00 results/jailbreakv_image_ckpt100_50.json
JailBreakV-28K 50 output/alignment_run_20260316_r1c 0.00 1.00 results/jailbreakv_image_alignment_r1c_50.json

Capability-side check now completed for the explicit alignment checkpoint:

Benchmark Samples Model Main Metric Value Artifact
MMStar 1500 output/alignment_run_20260316_r1c Accuracy 0.286 results/mmstar_alignment_r1c_1500_postfix_bs1_20260317_073600.json

Mainline comparison: base model vs harmful-only preventative steering:

Benchmark Samples Base Model Preventative ckpt300 Interpretation
MM-SafetyBench ASR 972 0.920 on 50-sample quick slice 0.000 Measured multimodal attack success drops to zero on the larger loaded slice.
JailBreakV ASR 1000 0.400 on 50-sample quick slice 0.000 Large-slice jailbreak success also drops to zero.
MMBench Accuracy 4329 0.879 0.878 General multimodal capability remains effectively flat.
Held-out MMStar Accuracy 980 0.625510 0.627551 Capability stays near base on a train/eval-disjoint slice.

Cross-model replication on local LLaVA-1.5-7B: base model vs harmful-only preventative steering checkpoint-300:

Benchmark Samples LLaVA Base LLaVA ckpt300 Delta Interpretation
MM-SafetyBench ASR 972 0.996914 0.000000 -0.996914 The preventative-steering checkpoint removes almost all measured image-conditioned attack success on the larger slice.
JailBreakV ASR 1000 0.822000 0.000000 -0.822000 Large-slice jailbreak success also drops to zero for the tuned LLaVA checkpoint.
MMBench Accuracy 4329 0.733426 0.732502 -0.000924 General multimodal capability stays effectively flat relative to the same LLaVA base model.
Held-out MMStar Accuracy 980 0.379592 0.373469 -0.006122 Held-out capability drops only slightly relative to the same LLaVA base model.

Artifacts:

  • results/llava_base_mmsafety972.json
  • results/llava_ckpt300_mmsafety972.json
  • results/llava_base_jailbreakv1000.json
  • results/llava_ckpt300_jailbreakv1000.json
  • results/llava_base_mmbench_full.json
  • results/llava_ckpt300_mmbench_full.json
  • results/llava_base_mmstar_heldout980.json
  • results/llava_ckpt300_mmstar_heldout980.json
  • results/ablation/llava_base_vs_ckpt300_table.md

Newer-base replication on local Qwen3.5-9B with repaired MMBench evaluation:

Benchmark Samples Model Main Metric Value Artifact
MM-SafetyBench 972 output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 ASR 0.000000 results/qwen3.5_r2_ckpt300_mmsafety972.json
JailBreakV 1000 output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 ASR 0.000000 results/qwen3.5_r2_ckpt300_jailbreakv1000.json
MMStar held-out 980 output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 Accuracy 0.667347 results/qwen3.5_r2_ckpt300_mmstar_heldout980.json
MMBench 4329 output/qwen3.5_preventative_steering_r2_fixedlogic/checkpoint-300 Accuracy 0.852391 results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json

Interpretation update for the Qwen3.5 replication:

  • the repaired MMBench artifact shows that the earlier Qwen3.5 0.0231 result was an evaluator bug, not a model collapse
  • the superseded artifact was results/qwen3.5_r2_ckpt300_mmbench_full.json
  • the valid replacement is results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json
  • this gives the repository a third positive harmful-only preventative result family:
    • Qwen2.5-VL mainline
    • LLaVA-1.5-7B replication
    • Qwen3.5-9B replication
  • strict within-model comparison is now complete against the same local Qwen3.5 base: held-out MMStar improves by 0.012245, MMBench changes by -0.023100, and both MM-SafetyBench972/JailBreakV1000 ASR fall from 0.727366/0.266000 to 0.000000

Strict within-model comparison on local Qwen3.5-9B:

Benchmark Qwen3.5 Base Qwen3.5 ckpt300 Delta Interpretation
MM-SafetyBench ASR 0.727366 0.000000 -0.727366 Preventative steering removes the measured image-conditioned attack success on the larger MM-SafetyBench slice.
JailBreakV ASR 0.266000 0.000000 -0.266000 Large-slice jailbreak success also drops to zero for the tuned Qwen3.5 checkpoint.
MMBench Accuracy 0.875491 0.852391 -0.023100 General multimodal capability stays high, although the tuned checkpoint is lower than the same Qwen3.5 base on full MMBench.
Held-out MMStar Accuracy 0.655102 0.667347 0.012245 Held-out MMStar improves slightly relative to the same Qwen3.5 base model.

Artifacts:

  • results/qwen3.5_base_mmsafety972.json
  • results/qwen3.5_r2_ckpt300_mmsafety972.json
  • results/qwen3.5_base_jailbreakv1000.json
  • results/qwen3.5_r2_ckpt300_jailbreakv1000.json
  • results/qwen3.5_base_mmbench_full.json
  • results/qwen3.5_r2_ckpt300_mmbench_full_fixedeval.json
  • results/qwen3.5_base_mmstar_heldout980.json
  • results/qwen3.5_r2_ckpt300_mmstar_heldout980.json
  • results/ablation/qwen3.5_base_vs_ckpt300_table.md

Historical refusal-aware MMStar comparison on the same 1500-sample slice:

Model Accuracy Refusal Rate Non-Refusal Accuracy Artifact
base model 0.616 0.0000 0.6160 results/mmstar_base_1500_refusalaware_20260317.json
output/checkpoint-100 0.2913 0.9773 0.4706 results/mmstar_ckpt100_1500_refusalaware_parallel_20260317.json
output/alignment_run_20260316_r1c 0.2860 1.0000 0.0000 results/mmstar_alignment_r1c_1500_postfix_bs1_20260317_073600.json
output/behavior_mix_mmstar512_r1/checkpoint-150 0.7180 0.0000 0.7180 results/mmstar_behavior_mix_mmstar512_r1_ckpt150_1500.json

Behavior-mix contamination audit and held-out retest:

Model Original MMStar Slice Held-Out MMStar Slice Delta Interpretation
base model 0.6160 on 1500 0.625510 on 980 +0.009510 No MMStar training overlap; held-out score mainly reflects slice difficulty.
output/behavior_mix_mmstar512_r1/checkpoint-150 0.7180 on 1500 0.631633 on 980 -0.086367 Original score is contamination-sensitive because 512 MMStar-derived safe_helpful pairs overlap with the benchmark source file.
output/planned_harmful_only_preventative_steering_r1/checkpoint-300 0.616667 on 1500 0.627551 on 980 +0.010884 Harmful-only training path stays close to base on held-out MMStar without MMStar-answer overlap.

Larger-scale safety checks for the current mainline checkpoint:

Benchmark Samples Model ASR Safety Rate Artifact
MM-SafetyBench 972 output/planned_harmful_only_preventative_steering_r1/checkpoint-300 0.00 1.00 results/preventative_steering_r1_checkpoint300_mmsafety972.json
JailBreakV-28K 1000 output/planned_harmful_only_preventative_steering_r1/checkpoint-300 0.00 1.00 results/preventative_steering_r1_checkpoint300_jailbreakv1000.json
MMBench 4329 output/planned_harmful_only_preventative_steering_r1/checkpoint-300 Accuracy 0.8776 results/mmbench_checkpoint300_full.json

Interpretation:

  • These quick checks show a large improvement versus the base model on the sampled image-conditioned attacks.
  • Earlier 50-sample wins did not prove a good safe-helpful tradeoff. That caveat still applies to the older checkpoints.
  • output/checkpoint-100 predates the current explicit alignment-training path. Its measured gains should be treated as results from the earlier native LoRA/SFT path, not as proof of the newer alignment-enabled path.
  • The first explicit alignment-enabled checkpoint output/alignment_run_20260316_r1c matches the earlier checkpoint-100 quick-check outcome on both sampled benchmarks.
  • The next question is not whether the new checkpoint can refuse these 50 attacks, but whether it can do so without unnecessary over-refusal and whether it remains strong on larger evaluations.
  • The new 1500-sample MMStar run answers part of that question unfavorably: the checkpoint completes the benchmark but still returns refusal-style text on effectively all samples, and the parser extracts A for all 1500 predictions.
  • The MMStar evaluator now includes refusal-aware metrics; replaying the saved 1500-sample artifact yields refusal_rate=1.0 and accuracy_excluding_refusals=0.0.
  • As a result, the recorded 0.286 MMStar accuracy should be read as evidence of persistent over-refusal plus answer-parsing bias, not as evidence of preserved visual reasoning capability.
  • The base-model comparison removes the remaining ambiguity: MMStar is a valid capability signal here, because the untuned model reaches 0.616 accuracy with 0.0 refusal rate on the same 1500-sample slice.
  • output/checkpoint-100 sits between the two extremes. It is not fully collapsed like output/alignment_run_20260316_r1c, but its 0.9773 refusal rate shows that its apparent MMStar score is still mostly a side effect of refusal behavior.
  • The current mainline claim is now the harmful-only preventative route, not the mixed-data route:
    • output/planned_harmful_only_preventative_steering_r1/checkpoint-300 reaches 0.0 measured ASR on MM-SafetyBench972 and JailBreakV1000
    • it stays effectively flat on MMBench (0.879 -> 0.878)
    • it stays near base on held-out MMStar980 (0.625510 -> 0.627551)
  • This supports the current project hypothesis: pure harmful training can improve multimodal safety without obvious general-capability collapse, provided preventative steering is used to suppress over-refusal.
  • The same pattern now appears on a second local VLM family (/mnt/data/llms/mllm/llava-1.5-7b-hf): relative to the LLaVA base model, checkpoint-300 keeps MMStar held-out and MMBench nearly flat while taking MM-SafetyBench972 and JailBreakV1000 ASR to 0.0.
  • output/behavior_mix_mmstar512_r1/checkpoint-150 is kept only as a side reference:
    • it also performs well on safety
    • but its raw MMStar1500 headline is contamination-sensitive and should not anchor the main project narrative
  • For future capability claims, the recommended path is to export and report the MMStar held-out split rather than reusing the raw MMStar slice directly.
  • Remaining caveat: JailBreakV is still a 1000-sample slice rather than a full 28K sweep, and no seed-variance study has been run yet.

Completion Estimate

Estimated project completion toward a defensible harmful-only research baseline: about 80%.

Completed:

  • data relocation and local dataset conventions
  • native training/eval loop
  • image-conditioned evaluation path
  • explicit alignment-capable training code path
  • first explicit alignment-enabled training run
  • initial quick-check results
  • first explicit alignment-enabled checkpoint evaluated on the same quick image-conditioned protocol
  • repaired multi-GPU MMStar evaluation path and a completed 1500-sample MMStar run for output/alignment_run_20260316_r1c
  • three-way MMStar comparison across base model, checkpoint-100, and output/alignment_run_20260316_r1c
  • behavior-vector-inspired preventative-steering stack
  • completed harmful-only preventative checkpoint with larger-slice safety and capability validation
  • MMStar overlap audit for the behavior-mix recipe
  • independent held-out MMStar split/export/eval path
  • a mainline checkpoint that has passed larger-scale safety validation and a contamination-aware capability retest
  • cross-model LLaVA validation showing the same safety gain with near-flat capability relative to the corresponding LLaVA base model

Still missing:

  • repeated-seed stability checks for the harmful-only preventative route
  • broader preventative-steering ablations beyond the current anchor/projection sweep
  • repeated-seed stability checks
  • larger JailBreakV coverage beyond the current 1000-sample slice
  • final write-up quality documentation around the current mainline settings

Risks

  • Generalization risk: the current mainline checkpoint is strong on the measured slices, but not yet stress-tested across repeated seeds or full-scale JailBreakV.
  • Cross-model scope risk: the current LLaVA result is strong on the measured slices, but it has only been validated for one LLaVA-family checkpoint and one training seed.
  • Ablation risk: the preventative recipe is now the mainline, but anchor and over-refusal regularization still need cleaner causal isolation.
  • Contamination risk: raw MMStar scores can be overstated if the training mix reuses MMStar-derived rows from the same source file.
  • Helpfulness-gap risk: MMStar is a useful capability proxy, but it is not a complete safe-helpful benchmark.
  • Capability-eval parsing risk still applies to older artifacts and should remain part of every MMStar interpretation.
  • Research scope risk: method intent is still broader than the current empirical coverage.

Next Steps

  1. Run repeated-seed reruns for output/planned_harmful_only_preventative_steering_r1/checkpoint-300.
  2. Extend the preventative-steering ablation surface:
    • anchor_weight
    • overrefusal_weight
    • projection_margin
  3. Extend JailBreakV evaluation beyond the current 1000-sample slice.
  4. Add a more explicit helpfulness / over-refusal benchmark beyond MMStar alone.
  5. Use the held-out MMStar path for all future capability reporting and keep raw MMStar results as secondary evidence only.
  6. Refresh this README when a new harmful-only checkpoint beats the current mainline on both safety and train/eval-disjoint capability.

Key Directories

  • data/: project-local datasets
  • src/: training, alignment, evaluation, and utility code
  • scripts/: orchestration and helper scripts
  • configs/: configuration defaults and status hints
  • results/: evaluation outputs and orchestration artifacts
  • output/: training outputs and checkpoints
  • tests/: targeted regression and helper tests
  • records/: experiment log, change log, and record templates
  • docs/superpowers/plans/: implementation plans used by agentic workers

Multi-Agent Ownership

  • pm_agent: roadmap decomposition, stage gates, acceptance
  • data_engineer_agent: dataset loading, pair construction, dataloader readiness
  • safety_research_agent: safety direction extraction and alignment-loss research
  • model_engineer_agent: native LoRA training and checkpoint generation
  • eval_engineer_agent: benchmark execution and ASR/safety-rate reporting
  • experiment_historian_agent: maintains records/experiment_log.md, records/change_log.md, and refreshes this README when the mainline project status changes

Common Commands

# Orchestrator status
python scripts/agent_orchestrator.py status

# GPU preflight before heavy work
python scripts/agent_orchestrator.py preflight

# Native LoRA/SFT training
python main.py lora \
  --dataset results/orchestration/train_dataset.jsonl \
  --output-dir output \
  --model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct

# Alignment-enabled training path
python main.py lora \
  --dataset results/orchestration/train_dataset.jsonl \
  --output-dir output/alignment-run \
  --model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
  --alignment-enabled

# Preventative-steering mainline training path
python main.py lora \
  --dataset results/orchestration/train_dataset.jsonl \
  --output-dir output/preventative-mainline \
  --model /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
  --preventative-steering \
  --probe-harmful-records results/orchestration/train_dataset.jsonl \
  --probe-mmstar-path /mnt/data/wzh/MMStar.tsv \
  --preventative-anchor-weight 0.05 \
  --preventative-overrefusal-weight 0.10 \
  --preventative-projection-margin 0.02

# MM-SafetyBench image-conditioned quick eval
python main.py eval \
  --model-path /mnt/data/llms/mllm/Qwen2.5-VL-7B-Instruct \
  --benchmark mm-safetybench \
  --benchmark-path ./data/mm-safetybench \
  --max-samples 50 \
  --output results/mm_safetybench_image_base_50.json

# JailBreakV-28K image-conditioned quick eval
python main.py eval \
  --model-path output/checkpoint-100 \
  --benchmark jailbreakv \
  --benchmark-path ./data/jailbreakv-28k \
  --max-samples 50 \
  --output results/jailbreakv_image_ckpt100_50.json

# Alignment helper tests
python -m unittest tests.test_lora_alignment -v

# Generate ablation run/contrast tables
python scripts/build_ablation_tables.py

# Generate the human-readable research conclusion table
python scripts/build_research_conclusion_table.py

# Generate the final human-readable main comparison table
python scripts/build_final_main_table.py

# Export a held-out MMStar split that excludes MMStar-derived training rows
python scripts/run_mmstar_heldout_eval.py \
  --train-dataset-path results/orchestration/train_dataset.jsonl \
  --heldout-path results/mmstar_preventative_heldout.tsv \
  --audit-output results/mmstar_preventative_heldout_audit.json

# Evaluate a checkpoint on the held-out MMStar split with GPU preflight
python scripts/run_mmstar_heldout_eval.py \
  --train-dataset-path results/orchestration/train_dataset.jsonl \
  --model-path output/planned_harmful_only_preventative_steering_r1/checkpoint-300 \
  --heldout-path results/mmstar_preventative_heldout.tsv \
  --audit-output results/mmstar_preventative_heldout_audit.json \
  --output results/mmstar_preventative_ckpt300_heldout.json \
  --wait-gpu

Records Maintenance

  • records/experiment_log.md is the append-only summary of evaluated runs and what they mean.
  • records/change_log.md is the append-only summary of project-level milestones that change interpretation of results.
  • records/templates/ contains copy-ready entry formats for new experiments and changes.
  • When a new result materially changes the current mainline status, update both the relevant record and this README in the same work session.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors