Pairing EU directives and their national implementing measures: A dataset for semantic search
Roger Ferrod, Denys Amore Bondarenko, Davide Audrito, and Giovanni Siragusa
Published in Computer Law & Security Review, Volume 51, 2023.
European Directives (EUDs) are binding upon Member States regarding the results to be achieved, but they leave the choice of form and methods to national authorities. Member States adopt ad hoc National Implementing Measures (NIMs) to transpose EUDs into domestic legislation, a process known as "legal harmonization".
ENIMD is a multilingual dataset designed specifically for legal semantic search and harmonization analysis. It pairs European Directives (EUDs) with their corresponding National Implementing Measures (NIMs) across five Member States.
The dataset provides an essential foundation for training models to automatically identify national laws that implement EU directives, distinguishing them from domestic legislation that does not.
The primary task enabled by ENIMD is Semantic Search / Retrieval:
- Query: An article from an EU Directive.
- Target: The specific article(s) in National Law that implement that directive.
- Challenge: The model must retrieve the correct implementation from a pool of ~900k national articles, most of which are unrelated.
The dataset is organized into three components, catering to different research needs:
A shuffled, machine-learning-ready collection of articles split into Train and Test sets.
- Content: Pairs of EUD articles (Queries) and National Law articles (Documents).
- Labels: Includes
positiveexamples (NIMs) andnegativeexamples (irrelevant national laws). - Metadata: Articles are labeled with the CELEX number, country code, and transposition hash.
- Preprocessing: Filtered using an IDF-based method to remove boilerplate text (e.g., entry into force dates, financial clauses).
The parsed collection of articles where irrelevant/boilerplate provisions have been removed using the method described in the paper. This split is useful for analysis without the noise of administrative clauses.
The full parsed collection of Directives and National Laws in their original structure (articles/paragraphs), without any filtering.
The dataset covers legislation from five EU Member States:
| Country | Language | EUD Articles (Queries) | National Corpus Articles |
|---|---|---|---|
| Italy | Italian | 11,514 | 135,221 |
| France | French | 11,386 | 236,762 |
| Spain | Spanish | 11,249 | 209,795 |
| Ireland | English | 11,344 | 157,601 |
| Austria | German | 11,837 | 199,781 |
| Total | Multilingual | 57,330 | 939,160 |
- Total Directives: 906
- Total National Documents: 9,016
- Challenge Ratio: ~88% of the national corpus consists of "irrelevant" laws (negative examples), providing a realistic and robust retrieval challenge.
You can easily load the dataset using the Hugging Face datasets library:
from datasets import load_dataset
# Load the ML-ready dataset
dataset = load_dataset("rogerferrod/ENIMD", data_dir="ML-dataset")
# Example: Inspect the first training example
print(dataset['train'][0])Alternatively, you can download the raw files directly from Mendeley Data.
If you use this dataset in your research, please cite the original paper:
@article{FERROD2023105862,
title = {Pairing EU directives and their national implementing measures: A dataset for semantic search},
journal = {Computer Law & Security Review},
volume = {51},
pages = {105862},
year = {2023},
issn = {2212-473X},
author = {Roger Ferrod and Denys Amore Bondarenko and Davide Audrito and Giovanni Siragusa}
}