A pipeline for recognizing nested, discontinuous, and multi-type entities
Our method consists of two steps (which can be executed separately, or in a pipeline to improve results):
- Deep learning model using BERT to recognize named entities, using NER QA-based approach
- CRF Multilabel, where were defined a threshold to classify each token in a class.
Module 1 of BioNestedNER adopts a QA-based approach to Named Entity Recognition (NER), where a BERT model is fine-tuned to identify entity boundaries using tagging schemes like IOB2 or IOBES. Instead of returning entity spans as start–end indices (as in traditional QA), the model predicts structured labels for each token in the input.
Each training instance includes a query indicating the entity type, followed by the sentence, separated by special tokens. The training process involves three steps: preprocessing the corpus into QA-NER format, generating contextual word representations using a pre-trained language model, and fine-tuning with a linear classification layer for token-level predictions.
Entity type is not predicted directly but is implied by the query. This setup allows flexible handling of multiple entity types by treating NER as a question-answering task.
We have perfomed the experiments on the corpora:
- NestedClinBr, a novel corpus containing nested and discontinuous entities in Brazillian Portuguese clinical texts;
- SemClinBr, corpus containing multi-type entities in Brazillian Portuguese clinical texts;
- Genia, corpus containing nested and discontinuous entities in English bimedical texts;
- RareDisease, corpus containing nested and discontinuous entities in English clinical texts.
1 - Finding nested entities
# using NestedClinBr corpus
python src/run_ner.py
2 - Finding nested and discontiguous entities
# using a NestedClinBr version with discontiguous entities
python src/run_ner_desc.py
3 - Finding discontiguous and/or nested entities of the same type (an end-to-end model)
# using a Genia version with discontiguous entities
python src/run_ner_desc.py
Open multilabel-CRF.ipynb and execute in a jupyter notebook or in Google Colab.
- Claudia Moro
- Elisa Terumi Rubel Schneider
- Emerson Cabrera Paraiso
- Paloma Martínez
- Yohan Bonescki Gumiel
*** soon ***

