ENIMD: European National Implementing Measures Dataset

Pairing EU directives and their national implementing measures: A dataset for semantic search

Roger Ferrod, Denys Amore Bondarenko, Davide Audrito, and Giovanni Siragusa

Published in Computer Law & Security Review, Volume 51, 2023.

📌 Overview

European Directives (EUDs) are binding upon Member States regarding the results to be achieved, but they leave the choice of form and methods to national authorities. Member States adopt ad hoc National Implementing Measures (NIMs) to transpose EUDs into domestic legislation, a process known as "legal harmonization".

ENIMD is a multilingual dataset designed specifically for legal semantic search and harmonization analysis. It pairs European Directives (EUDs) with their corresponding National Implementing Measures (NIMs) across five Member States.

The dataset provides an essential foundation for training models to automatically identify national laws that implement EU directives, distinguishing them from domestic legislation that does not.

⚖️ Legal Harmonization Task

The primary task enabled by ENIMD is Semantic Search / Retrieval:

Query: An article from an EU Directive.
Target: The specific article(s) in National Law that implement that directive.
Challenge: The model must retrieve the correct implementation from a pool of ~900k national articles, most of which are unrelated.

💾 Dataset Structure

The dataset is organized into three components, catering to different research needs:

1. `ML-dataset`

A shuffled, machine-learning-ready collection of articles split into Train and Test sets.

Content: Pairs of EUD articles (Queries) and National Law articles (Documents).
Labels: Includes positive examples (NIMs) and negative examples (irrelevant national laws).
Metadata: Articles are labeled with the CELEX number, country code, and transposition hash.
Preprocessing: Filtered using an IDF-based method to remove boilerplate text (e.g., entry into force dates, financial clauses).

2. `filtered`

The parsed collection of articles where irrelevant/boilerplate provisions have been removed using the method described in the paper. This split is useful for analysis without the noise of administrative clauses.

3. `raw`

The full parsed collection of Directives and National Laws in their original structure (articles/paragraphs), without any filtering.

📊 Statistics

The dataset covers legislation from five EU Member States:

Country	Language	EUD Articles (Queries)	National Corpus Articles
Italy	Italian	11,514	135,221
France	French	11,386	236,762
Spain	Spanish	11,249	209,795
Ireland	English	11,344	157,601
Austria	German	11,837	199,781
Total	Multilingual	57,330	939,160

Total Directives: 906
Total National Documents: 9,016
Challenge Ratio: ~88% of the national corpus consists of "irrelevant" laws (negative examples), providing a realistic and robust retrieval challenge.

💻 Usage

You can easily load the dataset using the Hugging Face datasets library:

from datasets import load_dataset

# Load the ML-ready dataset
dataset = load_dataset("rogerferrod/ENIMD", data_dir="ML-dataset")

# Example: Inspect the first training example
print(dataset['train'][0])

Alternatively, you can download the raw files directly from Mendeley Data.

📖 Citation

If you use this dataset in your research, please cite the original paper:

@article{FERROD2023105862,
  title = {Pairing EU directives and their national implementing measures: A dataset for semantic search},
  journal = {Computer Law & Security Review},
  volume = {51},
  pages = {105862},
  year = {2023},
  issn = {2212-473X},
  author = {Roger Ferrod and Denys Amore Bondarenko and Davide Audrito and Giovanni Siragusa}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ENIMD: European National Implementing Measures Dataset

📌 Overview

⚖️ Legal Harmonization Task

💾 Dataset Structure

1. `ML-dataset`

2. `filtered`

3. `raw`

📊 Statistics

💻 Usage

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ENIMD: European National Implementing Measures Dataset

📌 Overview

⚖️ Legal Harmonization Task

💾 Dataset Structure

1. ML-dataset

2. filtered

3. raw

📊 Statistics

💻 Usage

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `ML-dataset`

2. `filtered`

3. `raw`

Packages