Skip to content

DB825/spokencrs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpokenCRS

SpokenCRS is a lightweight Python API for normalizing conversational recommendation datasets into one consistent interface. It currently supports INSPIRED, ReDial, and ConvApparel, exposing each dataset as split-level CRSDataFrame objects with normalized user/system turns, item context, ground-truth items, recommendation labels, rejection labels, and dataset-specific metadata.

Quickstart

Install the package in editable mode:

git clone https://github.com/DB825/spokencrs.git
cd spokencrs
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e .

Load each supported dataset through the same entry point:

from spokencrs import load_dataset

inspired = load_dataset("inspired", data_dir="data/inspired")
redial = load_dataset("redial", data_dir="data/redial")
convapparel = load_dataset("convapparel", data_dir="data/convapparel")

train = redial["train"]
conversation = train[train.index[0]]

print(conversation.get_ground_truth_items())
print(conversation.system_turn[0].recommended_items())
print(conversation.system_turn[0].rejected_items())

Use list_datasets() to see registered loader names.

Expected Data Layout

Raw datasets are not included in this repository. Download them from the original sources, extract them locally, and point load_dataset at the matching directory.

data/
  inspired/
    train.tsv
    dev.tsv
    test.tsv

  redial/
    train_data.jsonl
    test_data.jsonl

  convapparel/
    ConvApparel.json

ConvApparel is distributed as a zipped JSON file on Hugging Face; unzip it so ConvApparel.json is directly inside data/convapparel.

Supported Datasets

Dataset Domain Splits Notes
INSPIRED Movie recommendation train, dev, test Uses movie_id as the ground-truth token. The dataset does not expose reliable per-turn accept/reject labels, so recommendation/rejection methods fall back to item context.
ReDial Movie recommendation train, test Resolves @movieId tokens to titles. Recommendation and rejection labels come from respondentQuestions and are exposed on system turns.
ConvApparel Apparel shopping train Splits each raw user/assistant turn pair into separate user and system turns. Uses per-turn recommendations, purchase-likelihood ratings, and matched which_product answers for recommendation/rejection semantics.

Paper Links

Normalized Schema

load_dataset(...) returns a dictionary mapping split names to CRSDataFrame objects. Each CRSDataFrame row represents one conversation and contains:

  • conversation_id: stable dataset-specific conversation key
  • user_turn: list of user/seeker TurnWrapper objects
  • system_turn: list of system/recommender TurnWrapper objects
  • ground_truth_items: item identifiers used for evaluation
  • raw record fields when useful, such as raw_record or raw_conversation_df

Each TurnWrapper is domain neutral:

  • items: ordered item titles or labels referenced at the turn
  • item_dict: {item_title: index} lookup for the turn
  • recommended_item_titles: items explicitly recommended at the turn when the dataset exposes that signal
  • rejected_item_titles: recommended items inferred as not accepted when the dataset exposes that signal
  • has_recommendation_signal and has_rejection_signal: flags that distinguish an explicit empty label from a dataset with no available signal
  • metadata: dataset-specific extras such as movie genre dictionaries, ConvApparel item descriptions, image URLs, feature tags, and turn ratings

item_context() returns the full structured context for a turn. When a loader has reliable recommendation or rejection signals, recommended_items() and rejected_items() return only those inferred subsets. When a dataset does not expose the signal, these methods fall back to item_context() as placeholder behavior.

Validation

The package includes reusable validation helpers:

from spokencrs import (
    load_dataset,
    summarize_loaded_datasets,
    validate_loaded_datasets,
)

loaded = {
    "INSPIRED": load_dataset("inspired", "data/inspired"),
    "ReDial": load_dataset("redial", "data/redial"),
    "ConvApparel": load_dataset("convapparel", "data/convapparel"),
}

summary = summarize_loaded_datasets(loaded)
issues = validate_loaded_datasets(loaded)

print(summary)
print(issues)

The companion notebook spokencrs_validation.ipynb mirrors this workflow and includes dataset-specific spot checks.

Adding A New Dataset

Future loaders should:

  1. Convert raw records into CRSDataFrame rows with the shared columns above.
  2. Normalize speakers to user and system.
  3. Keep core item fields domain neutral: items, item_dict, recommended_item_titles, and rejected_item_titles.
  4. Set recommendation/rejection signal flags when those labels are available.
  5. Put dataset-specific annotations in TurnWrapper.metadata.
  6. Document ground-truth, recommendation, and rejection assumptions in the loader docstring.
  7. Register the loader in spokencrs/loader.py.
  8. Add the dataset to the validation notebook and run validate_loaded_datasets.

About

Domain-neutral loaders for conversational recommendation datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors