Skip to content

DSAIL-SKKU/LEDE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💡 LEDE : A large-scale benchmark for AI-generated news detection


Sample deepfake images of HiDF
[AI-generated news construction pipeline]


LEDE is a large-scale benchmark dataset for AI-generated news detection, comprising over 337K articles and approximately 4.3M sentences. It addresses the limitations of existing benchmarks by providing broader generator diversity and news-specific coverage across 21 state-of-the-art LLMs, two languages, and 17 news categories. LEDE serves as a valuable resource for advancing research on AI-generated text detection, cross-model generalization, multilingual robustness, and domain-aware evaluation.The dataset repository includes AI-generated news articles spanning multiple prompting strategies and news categories. For access to the full dataset, please refer to the Hugging Face repository below: https://huggingface.co/datasets/NeurIPS-2026-LEDE/LEDE-dataset

💡 Quantitative comparison of LEDE and existing AI-Gen News datasets

Dataset Venue Including News # News # LLMs # Category # Language
M4 [paper] EACL 2024 ✓ (N%) 12,000 2 3
MAGE [paper] ACL 2024 ✓ (N%) 58,391 27 1
M4GT-Bench [paper] ACL 2024 ✓ (N%) 19,100 4 6
RAID [paper] ACL 2024 ✓ (N%) 726,240 11 5 1
DetectRL [paper] NeurIPS 2024 D&B ✓ (N%) 33,600 4 1
Beemo [paper] NAACL 2025 -- -- -- --
M-DAIGT [paper] RANLP 2025 Shared Task ✓ (N%) 7,000 6 2
LEDE -- ✓ (100%) 337,322 21 17 2


💡 Data Description

LEDE is a large-scale multilingual benchmark for AI-generated news detection, designed to support robust evaluation across diverse LLMs, news categories, generation strategies, and languages.

📈 LEDE Dataset Statistics

AI-generated News

  • # of LLMs : 21
  • # of Languages : 2 (Eng, Kor)
  • # of Articles : 337,322
  • # of Sentences : 4,309,153
  • # of News Category : 17
  • # of News Strategy : 4 (sc, ib, ng, we)
  • # English Sentences : 2,393,518
  • # Korean Sentences : 1,915,635

📑 Configuration of LEDE Metadata

Field Description
human_rid Identifier for the original human-written article.
• AIHub datasets: uses the official AIHub dataset ID
• English datasets: constructed as {first 4 words}-{last 4 words} from the original article
human_fid Identifier for the corresponding fake/generated counterpart.
• AIHub datasets: uses the official AIHub dataset ID
• English datasets: constructed as {first 4 words}-{last 4 words} from the original article
title Title of the AI-generated news article
summary Summary of the AI-generated news article
ai_article Full text of the AI-generated news article
category News category/domain of the article (17 categories in total; e.g., politics, health, law, economy, sports)
model Large Language Model (LLM) used for article generation (21 models in total)
strategy Generation strategy used for article creation (sc, ib, ng, we)
language Language of the generated article (Kor or Engs)
num_sentences Number of sentences in the generated article
num_words Number of words in the generated article

💡 Evaluation

1. Data preparation

1.1. Download LEDE Datasets

To access the LEDE dataset, please visit the following link.

The LEDE dataset is available under the Creative Commons Attribution-NonCommercial 4.0 International Public License. Any violation of this license agreement may result in legal action. By downloading the HiDF, the user agrees to the terms of the CC BY-NC 4.0 license.

1.2. Download Human-written News Datasets

Please download all of the following datasets and store them in the human-written/ directory.

1.3.Mapping Human-written News

Each human-written article is aligned with its corresponding AI-generated article using the human_rid field.

  • AI-Hub datasets: The original dataset ID is used directly.
  • English datasets: IDs are constructed in the format {first 4 words}-{last 4 words} from the original article.

This mapping enables direct and consistent comparison between human-written and AI-generated texts during evaluation.


2. Baseline Evaluation

Run baseline model evaluation using either a single CSV file or a CSV directory. Below are sample commands for running zero-shot baseline evaluations.

$ git clone https://github.com/DSAIL-SKKU/LEDE.git

Installation

  • You can follow the official Fast-DetectGPT GitHub repository for installation details.
  • Python3.8
  • PyTorch1.10.0

Evaluate a CSV Directory

$ cd src/baselines/fast-detect-gpt
$ bash scripts/eval.sh --csv_dir /path/to/csv_dir

Each file prints metrics in the following format:

n_pairs: XXXX
ROC AUC (criterion): 0.XXXX
PR AUC (criterion): 0.XXXX

The aggregated per-file metrics are saved to ./outputs/batch_eval/roc/ by default.

Installation

  • You can follow the official Binoculars GitHub repository for installation details.
  • Python3.8
  • PyTorch1.10.0

Evaluate a Single CSV File

$ cd src/baselines/Binoculars/
$ bash eval.sh --csv_path /path/to/file.csv

Evaluate a CSV Directory

$ cd src/baselines/Binoculars/
$ bash eval.sh --csv_dir /path/to/csv_dir

Each file prints metrics in the following format:

[OK] <file>.csv | n=<rows> (eval=<evaluated_rows>) | ACC=0.XXXX ROC_AUC=0.XXXX PR_AUC=0.XXXX

The aggregated per-file metrics are saved to binoculars_csv_folder_metrics.csv by default.

2-3. Additional Models

In addition to the two base models described above, other AI-generated text detection models can be explored through their official GitHub repositories.

Zero-shot Modles

Supervised Models

💡 License

The LEDE dataset is available under the Creative Commons Attribution-NonCommercial 4.0 International Public License: https://creativecommons.org/licenses/by-nc/4.0/. The code is released under the MIT license.

About

[NeurIPS 2026 E&D - under review] A large-scale benchmark for AI-generated news detection, covering 21 LLMs, 4 generation strategies, 17 news categories, and 2 languages (English, Korean).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors