Skip to content

HAILab-PUCPR/BioNestedNER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BioNestedNER

A pipeline for recognizing nested, discontinuous, and multi-type entities

Logo BioNestedNER

Our method consists of two steps (which can be executed separately, or in a pipeline to improve results):

  1. Deep learning model using BERT to recognize named entities, using NER QA-based approach
  2. CRF Multilabel, where were defined a threshold to classify each token in a class.

📌 Table of Contents

Method

Module 1 of BioNestedNER adopts a QA-based approach to Named Entity Recognition (NER), where a BERT model is fine-tuned to identify entity boundaries using tagging schemes like IOB2 or IOBES. Instead of returning entity spans as start–end indices (as in traditional QA), the model predicts structured labels for each token in the input.

Each training instance includes a query indicating the entity type, followed by the sentence, separated by special tokens. The training process involves three steps: preprocessing the corpus into QA-NER format, generating contextual word representations using a pre-trained language model, and fine-tuning with a linear classification layer for token-level predictions.

Entity type is not predicted directly but is implied by the query. This setup allows flexible handling of multiple entity types by treating NER as a question-answering task.

BioNestedNER Method

Corpora

We have perfomed the experiments on the corpora:

  • NestedClinBr, a novel corpus containing nested and discontinuous entities in Brazillian Portuguese clinical texts;
  • SemClinBr, corpus containing multi-type entities in Brazillian Portuguese clinical texts;
  • Genia, corpus containing nested and discontinuous entities in English bimedical texts;
  • RareDisease, corpus containing nested and discontinuous entities in English clinical texts.

How to execute

QA-based approach

1 - Finding nested entities

# using NestedClinBr corpus

python src/run_ner.py

2 - Finding nested and discontiguous entities

# using a NestedClinBr version with discontiguous entities

python src/run_ner_desc.py

3 - Finding discontiguous and/or nested entities of the same type (an end-to-end model)

# using a Genia version with discontiguous entities

python src/run_ner_desc.py

CRF multilabel

Open multilabel-CRF.ipynb and execute in a jupyter notebook or in Google Colab.

Contributors

  • Claudia Moro
  • Elisa Terumi Rubel Schneider
  • Emerson Cabrera Paraiso
  • Paloma Martínez
  • Yohan Bonescki Gumiel

How to cite

*** soon ***

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors