SpellXpert

Model

You can download the SpellXpert model from our Hugging Face repository.

Note

The Hugging Face repository is private for now due to double-blind review policy.

Dataset

All datasets can be found at this Google Drive link.

Note

The Google Drive link is private for now due to double-blind review policy.

SpellXpert Pipeline

1. Make your own Llama-Factory compatible dataset

1.1. Create a config file under `configs/datasets/` directory.

Example config file:

{
  "name": "cscd-ns",
  "root": "datasets/original/cscd-ns",
  "files": {
    "train": "train.tsv",
    "valid": "dev.tsv",
    "test": "test.tsv",
    "all": "all.tsv"
  }
}

1.2. Create the dataset processing module under `csc/data/datasets/` directory.

1.3. Create the dataset processing script under `scripts/datasets/` directory.

The dataset creation script is located in the scripts/datasets directory.

Example dataset processing script:

python make-dataset.py \
   --dataset-config=../../configs/datasets/cscd-ns.json \
   --template=3 \  # Template 3 is the best performing input template for SpellXpert
   --variant=reasoning \  # Allow SpellXpert to reason about the input
   --input-root=...

1.4. Add the dataset to `dataset_info.json`, which is used by Llama-Factory to load the dataset.

Example entry in dataset_info.json:

{
  "cscd_ns": {
    "file_name": "cscd-ns/template-3/reasoning-test.jsonl",
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output"
    }
  }
}

2. Run vLLM inference

The inference script is located in the scripts/inference directory. The vllm-infer.py script is deviated from Llama-Factory's vllm_infer.py script to support the SpellXpert dataset format and cascade verification. You have to install Llama-Factory and vLLM first to run the inference script.

Example command to run vLLM inference:

model_name_or_path=...
dataset_dir=...  # Path to the directory where the `dataset_info.json` file is located
template=deepseek3
cutoff_len=13312
max_new_tokens=4096
dataset=cscd_ns  # Replace with your dataset name (defined in `dataset_info.json`)

python vllm-infer.py \
   --model_name_or_path=${model_name_or_path} \
   --dataset=${dataset} \
   --dataset_dir=${dataset_dir} \
   --template=${template} \
   --cutoff_len=${cutoff_len} \
   --max_new_tokens=${max_new_tokens} \
   --save_name=generated_predictions.jsonl \  # Change this to your desired output file path
   --n 8  # Number of output candidates to generate for each input

3. Collect and filter results using the cascade verification module

3.1. Build your own vocabulary/dictionary

Make your own vocabulary/dictionary by creating a text file with one word/phrase per line. Put the file in the scripts/evaluation/dictionaries directory.

3.2. Run the cascade verification script (stage 1)

In scripts/evaluation directory, run:

python evaluate.py \
   --path=generated_predictions.jsonl \  # Path to the generated predictions file from the inference step
   --template=1 \  # Template 1 is the best performing output template for SpellXpert
   --run_name=your_run_name \  # Name of the run, used for saving results
   --filter_output_label_whitelist_path='["dictionaries/whitelist.txt"]' \  # Path(s) to the dictionary file(s)
   --filter_output_predict_whitelist_path='["dictionaries/whitelist.txt"]' \  # Path(s) to the dictionary file(s)
   --filter_output_context_path=../../datasets/context/cscd-ns/reasoning-context.pkl

Note that the context file is optionally created in step 1.2. Usually, it is the article where the input sentence is extracted from.

The stage 1 output is presented in <project root>/reports/evaluation/<run name>/ directory.

3.3. Extract the verification dataset for verification stage 2

In scripts/verification directory, run:

python make-dataset.py \
   --path=../../reports/evaluation/<run_name>/extract-output-FP-TP.cleaned.jsonl \
   --template=0 \  # Template 0 is the best performing output template for SpellXpert

The verification dataset will be saved in the datasets/run folder. An entry will be automatically added to the datasets/run/dataset_info.json file for the verification dataset.

3.4. Use vLLM to run inference on the verification dataset

The inference script is located in the scripts/inference directory.

Example command to run vLLM inference:

model_name_or_path=...  # Path to any open-source LLM model, e.g., `deepseek3`
dataset_dir=...  # Path to the directory where the `dataset_info.json` file is located
template=deepseek3  # Change this according to your model
cutoff_len=13312
max_new_tokens=4096
dataset=...  # Replace with your dataset name (defined in `dataset_info.json`)

python vllm-infer.py \
   --model_name_or_path=${model_name_or_path} \
   --dataset=${dataset} \
   --dataset_dir=${dataset_dir} \
   --template=${template} \
   --cutoff_len=${cutoff_len} \
   --max_new_tokens=${max_new_tokens} \
   --save_name=generated_predictions.jsonl \  # Change this to your desired output file path
   --n 1  # Use 1 for verification stage 2

3.5. Run the cascade verification script (stage 2)

In scripts/verification directory, run:

python verify.py \
   --csc-output-path=generated_predictions.jsonl \  # Path to the generated predictions file from the inference step 2
   --csc-output-template=1 \  # Template 1 is the best performing output template for SpellXpert
   --verification-input-path=../../datasets/run/....jsonl \  # Path to the verification dataset created in step 3.3
   --verification-output-path=generated_predictions.jsonl \  # Path to the generated predictions file from the inference step 3.4
   --verification-output-template=0 \  # Template 0 is the best performing output template for SpellXpert'
   --run_name=your_run_name  # Name of the run, used for saving results

The stage 2 output is presented in <project root>/reports/verification/<run name>/ directory.

This is the final output of the SpellXpert pipeline. All detected errors are marked with <csc></csc> tags in the output text.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
configs/datasets		configs/datasets
csc		csc
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpellXpert

Model

Dataset

SpellXpert Pipeline

1. Make your own Llama-Factory compatible dataset

1.1. Create a config file under `configs/datasets/` directory.

1.2. Create the dataset processing module under `csc/data/datasets/` directory.

1.3. Create the dataset processing script under `scripts/datasets/` directory.

1.4. Add the dataset to `dataset_info.json`, which is used by Llama-Factory to load the dataset.

2. Run vLLM inference

3. Collect and filter results using the cascade verification module

3.1. Build your own vocabulary/dictionary

3.2. Run the cascade verification script (stage 1)

3.3. Extract the verification dataset for verification stage 2

3.4. Use vLLM to run inference on the verification dataset

3.5. Run the cascade verification script (stage 2)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpellXpert

Model

Dataset

SpellXpert Pipeline

1. Make your own Llama-Factory compatible dataset

1.1. Create a config file under configs/datasets/ directory.

1.2. Create the dataset processing module under csc/data/datasets/ directory.

1.3. Create the dataset processing script under scripts/datasets/ directory.

1.4. Add the dataset to dataset_info.json, which is used by Llama-Factory to load the dataset.

2. Run vLLM inference

3. Collect and filter results using the cascade verification module

3.1. Build your own vocabulary/dictionary

3.2. Run the cascade verification script (stage 1)

3.3. Extract the verification dataset for verification stage 2

3.4. Use vLLM to run inference on the verification dataset

3.5. Run the cascade verification script (stage 2)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1.1. Create a config file under `configs/datasets/` directory.

1.2. Create the dataset processing module under `csc/data/datasets/` directory.

1.3. Create the dataset processing script under `scripts/datasets/` directory.

1.4. Add the dataset to `dataset_info.json`, which is used by Llama-Factory to load the dataset.

Packages