This repository contains the code and resources for our proposed method of performing adversarial attacks on black-box Neural Ranking Models (NRMs). Our approach manipulates sentence-level embeddings to enhance the ranking positions of target documents by aligning them with the query context while maintaining semantic integrity. This results in coherent, adversarial documents that seamlessly incorporate manipulated content and remain undetectable automatic and human evaluations.
EMPRA/
├── code/ # Main codebase
│ ├── attack/ # EMPRA attack implementation
│ ├── evaluation/ # Attack Performance and Content Fidelity Evaluation metrics and scripts
│ ├── preprocessing/ # Data preprocessing utilities
│ ├── scripts/ # Executable CLI scripts
│ └── utils/ # Utility functions
├── human_evaluation/ # Human evaluation study materials
│ ├── guideline/ # Annotation guidelines
│ ├── annotations/ # Human annotations
│ ├── results/ # Analysis results
│ ├── analyze_annotations_by_method.py # Annotation analysis script
│ └── README.md # Human evaluation documentation
├── run_empra_pipeline.sh # End-to-end bash script for EMPRA pipeline
└── README.md # Project documentation
Note: Step-by-step usage instructions are provided in this README below.
code/attack/: Core EMPRA attack implementation including embedding manipulation, sentence generation, and attack logiccode/scripts/: Command-line scripts for running the complete EMPRA pipelinecode/evaluation/: Evaluation scripts for attack performance, content fidelity, and linguistic acceptabilitycode/preprocessing/: Utilities for data preparation and target document selectionrun_empra_pipeline.sh: End-to-end bash script that automates adversarial text generation and adversarial document constructionhuman_evaluation/: Materials and results from the human evaluation study
In order to assess the attack performance of our surrogate-agnostic attacking method, EMPRA, we compare it with the best state-of-the-art baselines from each category including Query+, PRADA (word-level), PAT (trigger-level), Brittle-BERT (trigger-level), IDEM, LLM-Prompt (GPT-4), and AttackChain. The table below compare the attack performance in terms of attack success rate, boosted top-10, boosted top-50, average boost rank, perplexity and readability across target documents randomly selected from positions 51-100 (Easy-5). The comparison was made across thes best-performing surrogate model, MS(best), and the best-performing generic model, MG(best), targeting the victim NRM cross-encoder/ms-marco-MiniLM-L-12-v2.
| Model | Method | ASR | %r≤10 | %r≤50 | Boost | PPL | Readibility |
|---|---|---|---|---|---|---|---|
| - | Original | - | - | - | - | 37.3 | 9.8 |
| - | Query+ | 100.0 | 86.9 | 99.2 | 70.3 | 45.4 | 9.6 |
| - | LLM-Prompt | 94.1 | 65.0 | 90.1 | 49.9 | 49.0 | 11.0 |
| MS(best) | PRADA | 77.9 | 3.52 | 46.2 | 23.2 | 94.4 | 9.9 |
| Brittle-BERT | 98.7 | 81.3 | 96.7 | 67.3 | 107.9 | 10.7 | |
| PAT | 89.6 | 30.6 | 73.8 | 41.9 | 50.9 | 9.9 | |
| IDEM | 99.7 | 87.4 | 99.0 | 70.3 | 36.4 | 9.4 | |
| AttChain | 99.8 | 78.2 | 98.8 | 67.8 | 37.4 | 9.8 | |
| EMPRA | 99.9 | 95.6 | 99.8 | 72.5 | 34.4 | 9.2 | |
| MG(best) | PRADA | 71.5 | 1.9 | 37.5 | 19.1 | 91.5 | 9.8 |
| Brittle-BERT | 90.0 | 43.4 | 80.1 | 46.2 | 117.7 | 11.0 | |
| PAT | 51.1 | 2.7 | 22.9 | 2.0 | 46.8 | 9.8 | |
| IDEM | 98.8 | 65.3 | 93.8 | 61.9 | 37.7 | 9.4 | |
| AttChain | 87.6 | 44.2 | 76.6 | 44.5 | 38.7 | 9.8 | |
| EMPRA | 99.7 | 74.3 | 97.6 | 66.2 | 36.3 | 9.2 |
| Method | Query: "can anyone take prenatal vitamins?" | Ranking Position |
|---|---|---|
| Original | Always let your health care provider know what nutritional supplements you are taking. Prenatal vitamins consist of a variety of vitamins and minerals. During pregnancy, a woman’s daily intake requirements for certain nutrients, such as folic acid (folate), calcium, and iron, will increase. | 91 |
| Query+ | can anyone take prenatal vitamins? Always let your health care provider know what nutritional supplements you are taking. Prenatal vitamins consist of a variety of vitamins and minerals. During pregnancy, a woman’s daily intake requirements for certain nutrients, such as folic acid (folate), calcium, and iron, will increase. | 1 |
| LLM-Prompt (GPT-4) | Always consult your caregiver or medical specialist before starting any nutritional supplements, including those designed for prenatal care. Can individuals not expecting to conceive also consider prenatal vitamins? It's a common inquiry. These vitamins and minerals blends, typically referred to as prenatal vitamins, are critical during gestation. During such a crucial timeline, a female body's daily intake necessities for pivotal nutrients, such as folic acid (more commonly known by its synthetic form, folate), calcium, and iron, will see a notable escalation. | 33 |
| PRADA | Always let your health care purveyor know what nutritional supplements you are took. prenatal vitamins consist of a variety of vitamins and metallurgical. during pregnancy, a womanas daily admitting requirements for certain vitamin, such as folic acid ( folate ), calcium, and iron, will increased. | 49 |
| Brittle-BERT | aanatnat anyone can⋅va taking 167 עogan whether Always let your health care provider know what nutritional supplements you are taking. Prenatal vitamins consist of a variety of vitamins and minerals. During pregnancy, a woman’s daily intake requirements for certain nutrients, such as folic acid (folate), calcium, and iron, will increase. | 1 |
| PAT | no, if anyone could even take preca Always let your health care provider know what nutritional supplements you are taking. Prenatal vitamins consist of a variety of vitamins and minerals. During pregnancy, a woman’s daily intake requirements for certain nutrients, such as folic acid (folate), calcium, and iron, will increase. | 1 |
| IDEM | Children, not pregnant mothers, cannot take prenatal vitamins. Always let your health care provider know what nutritional supplements you are taking. Prenatal vitamins consist of a variety of vitamins and minerals. During pregnancy, a woman’s daily intake requirements for certain nutrients, such as folic acid (folate), calcium, and iron, will increase. | 2 |
| AttChain | Always inform your healthcare provider about the nutritional supplements you are taking. Prenatal vitamins, including folic acid (folate), calcium, and iron, play a crucial role during pregnancy. Can anyone take prenatal vitamins? | 2 |
| EMPRA | During pregnancy, anyone can take a prenatal vitamin (folic acid, iron, and calcium) to increase their daily requirements for these nutrients. Always let your health care provider know what nutritional supplements you are taking. Prenatal vitamins consist of a variety of vitamins and minerals. During pregnancy, a woman’s daily intake requirements for certain nutrients, such as folic acid (folate), calcium, and iron, will increase. | 1 |
🚀 Usage
This section provides step-by-step instructions to reproduce the EMPRA attack pipeline and evaluation results.
Re-rank the top-1000 retrieved BM25 documents for sampled queries using a black-box neural ranking model. This step uses code/preprocessing/rerank.py to re-rank documents. For our experiments, we use cross-encoder/ms-marco-MiniLM-L-12-v2 as the black-box model.
python code/preprocessing/rerank.py \
--model cross-encoder/ms-marco-MiniLM-L-12-v2 \
--collection path/to/MS_MARCO/collection.tsv \
--queries path/to/queries.tsv \
--run path/to/BM25_run.trec \
--res path/to/output/reranked_run.trecFor each query, select one document from each of the 5-document segments (i.e., positions 51-60, 61-70, etc.) along with the last 5 ranked documents (996-1000), resulting in ten targeted documents per query. This step uses code/preprocessing/target_doc_selector.py.
python code/preprocessing/target_doc_selector.py \
--run path/to/reranked_run.trec \
--output path/to/output/target_documents.tsvGenerate adversarial sentences using the EMPRA attack method. This step uses code/scripts/adversarial_text_generator.py to create perturbed sentences that will be merged into target documents.
python code/scripts/adversarial_text_generator.py \
--data-dir path/to/data/directory \
--collection-dir path/to/collection/directory \
--dataset-name trecdl2020 \
--target-type easy \
--output-dir path/to/output/directory \
--max-iterations 25 \
--epsilon 0.01 \
--alpha 0.1 \
--embedding-batch-size 100 \
--num-workers 3Parameters:
| Parameter | Description |
|---|---|
--data-dir |
Directory containing queries and target documents |
--collection-dir |
Directory containing collection.tsv |
--dataset-name |
Dataset name (e.g., trecdl2020) |
--target-type |
Target document type (easy or hard) |
--output-dir |
Output directory for generated adversarial sentences |
--max-iterations |
Maximum number of attack iterations (default: 25) |
--epsilon |
Epsilon constraint for perturbations (default: 0.01) |
--alpha |
Step size for gradient updates (default: 0.1) |
--embedding-batch-size |
Batch size for embedding API calls (default: 100) |
--num-workers |
Number of parallel workers (default: 3) |
Merge the generated adversarial sentences with original documents and select the best sentence based on coherence and relevance scores. This step uses code/scripts/construct_adversarial_documents.py.
python code/scripts/construct_adversarial_documents.py \
--relevance-model path/to/bert/model \
--model-tag S1 \
--connect-sent-file path/to/generated/adversarial_sentences.tsv \
--target-file path/to/target_documents.tsv \
--query-collection path/to/queries.tsv \
--doc-collection path/to/collection.tsv \
--coh-weight 0.5 \
--rel-weight 0.5 \
--batch-size 32 \
--num-labels 1Parameters:
| Parameter | Description |
|---|---|
--relevance-model |
Path to neural ranking model directory for relevance scoring |
--model-tag |
Tag for model (used in output filenames) |
--connect-sent-file |
Path to file containing generated connection sentences |
--target-file |
Path to target documents file |
--query-collection |
Path to query collection TSV file |
--doc-collection |
Path to document collection TSV file |
--coh-weight |
Weight for coherence score (default: 0.5) |
--rel-weight |
Weight for relevance score (default: 0.5) |
--batch-size |
Batch size for BERT model inference (default: 32) |
--num-labels |
Number of labels for relevance model (1 for regression, 2 for classification) |
For convenience, you can run both Step 3 and Step 4 together using the provided bash script run_empra_pipeline.sh. This script automates the complete pipeline from adversarial sentence generation to document construction.
./run_empra_pipeline.sh \
--data-dir path/to/data/directory \
--collection-dir path/to/collection/directory \
--dataset-name trecdl2020 \
--target-type easy \
--output-dir path/to/output/directory \
--relevance-model path/to/bert/model \
--query-collection path/to/queries.tsv \
--doc-collection path/to/collection.tsv \
--max-iterations 25 \
--epsilon 0.01 \
--alpha 0.1 \
--embedding-batch-size 100 \
--model-tag S1 \
--coh-weight 0.5 \
--rel-weight 0.5 \
--batch-size 32 \
--num-labels 1 \
--device autoKey Features:
- Runs both adversarial sentence generation and document construction in sequence
- Validates all inputs and file paths before execution
- Provides colored logging output for better visibility
- Supports skipping individual steps with
--skip-step1or--skip-step2 - Automatically handles file paths between steps
Required Arguments:
--data-dir: Directory containing queries and target documents--collection-dir: Directory containing collection.tsv--dataset-name: Dataset name (e.g., trecdl2020)--target-type: Target document type (easyorhard)--output-dir: Output directory for all results--relevance-model: Path to neural ranking model directory for relevance scoring--query-collection: Path to query collection TSV file--doc-collection: Path to document collection TSV file
Optional Arguments:
--max-iterations: Maximum attack iterations (default: 25)--epsilon: Epsilon constraint for perturbations (default: 0.01)--alpha: Step size for gradient updates (default: 0.1)--embedding-batch-size: Batch size for embedding (default: 100)--model-tag: Tag for model used in output filenames (default: S1)--coh-weight: Weight for coherence score (default: 0.5)--rel-weight: Weight for relevance score (default: 0.5)--batch-size: Batch size for BERT model inference (default: 32)--num-labels: Number of labels for relevance model (default: 1)--device: Device for BERT model:cuda,cpu, orauto(default: auto)--skip-step1: Skip adversarial sentence generation--skip-step2: Skip adversarial document construction
For detailed help, run: ./run_empra_pipeline.sh --help
Evaluate the attack performance by computing rank promotion metrics, perplexity, and readability scores. This step uses code/scripts/process_adv_perturbations_pipeline.py which processes adversarial documents and calls code/evaluation/attack_result_calculator.py to compute metrics.
python code/scripts/process_adv_perturbations_pipeline.py \
--dataset trecdl2020 \
--tag S1 \
--input_dir path/to/adversarial/documents/ \
--data_dir path/to/data/directory \
--output_base_dir path/to/output/directory \
--model cross-encoder/ms-marco-MiniLM-L-12-v2 \
--batch_size 64 \
--device auto \
--gpu_id 0 \
--rank_list_len 1000Parameters:
| Parameter | Description |
|---|---|
--dataset |
Dataset name (e.g., trecdl2019, trecdl2020) |
--tag |
Tag identifier for the NRM |
--input_dir |
Directory containing adversarial document TSV files |
--data_dir |
Base data directory containing target and rank score files |
--output_base_dir |
Base output directory for results |
--model |
CrossEncoder model name for scoring (default: cross-encoder/ms-marco-MiniLM-L-12-v2) |
--batch_size |
Batch size for CrossEncoder scoring (default: 64) |
--device |
Device to run CrossEncoder on (auto/cpu/cuda) |
--gpu_id |
GPU device ID for attack_result_calculator |
--rank_list_len |
Length of rank list (1000 or 100) |
Output Files:
{tag}-{method}_adv_docs.tsv- Adversarial documents{tag}-{method}_adv_scores.tsv- Relevance scores{tag}-{method}_metrics.json- Detailed metrics (attack success rate, rank promotion, perplexity, readability)metrics_summary.tsv- Aggregated metrics summary
Evaluate content fidelity by computing ROUGE-L recall, backward NLI entailment, and BERTScore F1. This step uses code/evaluation/content_fidelity.py.
python code/evaluation/content_fidelity.py \
--collection_file path/to/collection.tsv \
--input_folder path/to/adversarial/documents/ \
--output_csv path/to/output/content_fidelity_results.csv \
--nli_model microsoft/deberta-large-mnli \
--bertscore_model roberta-large \
--batch_size 32 \
--device auto \
--gpu_id 0Parameters:
| Parameter | Description |
|---|---|
--collection_file |
Path to collection TSV file (doc_id, doc_text) |
--input_folder |
Path to folder containing adversarial TSV files (or use --input_file for single file) |
--output_csv |
Path to output CSV file |
--nli_model |
NLI model name (default: microsoft/deberta-large-mnli) |
--bertscore_model |
BERTScore model type (default: roberta-large) |
--batch_size |
Batch size for processing (default: 32) |
--device |
Device to run models on (auto/cpu/cuda) |
--gpu_id |
CUDA device ID |
Output CSV Columns:
filename- Name of the adversarial document filenum_docs- Number of document pairs evaluatedrougeL_recall- Average ROUGE-L recall scorebackward_entailment- Average backward NLI entailment scorebertscore_f1- Average BERTScore F1 score
Evaluate linguistic acceptability using the CoLA (Corpus of Linguistic Acceptability) model. This step uses code/evaluation/linguistic_acceptability.py.
python code/evaluation/linguistic_acceptability.py \
--input_file path/to/adversarial/documents.tsv \
--output_file path/to/output/acceptability_results.tsv \
--document_column document \
--model_name textattack/roberta-base-CoLA \
--batch_size 32 \
--device auto \
--gpu_id 0 \
--threshold 0.5Parameters:
| Parameter | Description |
|---|---|
--input_file |
Path to input TSV file containing documents |
--output_file |
Path to output TSV file (default: input_file with _cola suffix) |
--document_column |
Name of column containing documents (default: document) |
--model_name |
HuggingFace model name for CoLA (default: textattack/roberta-base-CoLA) |
--batch_size |
Batch size for processing documents (default: 32) |
--device |
Device to run model on (auto/cpu/cuda) |
--gpu_id |
CUDA device ID |
--threshold |
Acceptability threshold for counting acceptable documents (default: 0.5) |
Output:
- TSV file with added
acceptability_scorecolumn (0.0 to 1.0, where 1.0 is most acceptable) - Console statistics: total documents, acceptable documents (score ≥ threshold), average/min/max scores
For human evaluation study materials and analysis, see the human_evaluation/ directory. This includes:
- Annotation guidelines (
human_evaluation/guideline/) - Human annotations (
human_evaluation/annotations/) - Analysis scripts (
human_evaluation/analyze_annotations_by_method.py)
See human_evaluation/README.md for detailed information about the human evaluation study.
