Skip to content

Hugging Face token-classification pipeline for NER #13

@justinmadison

Description

@justinmadison

Summary

Implement a Hugging Face token‐classification pipeline to perform NER on normalized news articles, with both a Celery task and a CLI wrapper calling a shared core library.

Motivation

  • Transformer‐based NER models (e.g. dslim/bert-base-NER) deliver higher accuracy on diverse news text.
  • organizing code into reusable functions, Celery tasks, and CLI commands.
  • Enables both scheduled/asynchronous processing and on-demand debugging.

Scope

None

Acceptance Criteria

  • Core functions load and run without errors
  • Celery task enqueues and processes an article end-to-end
  • CLI runs successfully and outputs entity count
  • Unit tests cover core, task, and CLI layers and pass in CI
  • README clearly shows both execution paths (Celery & CLI)

Additional Context

  • Dependencies:
    • Articles normalized and stored in the database
    • Database connection module (nlp/db.py) available
    • Redis (or other) broker configured for Celery

Architecture

  1. Core logic (nlp/core.py)
    • run_ner_hf(text: str) → List[Entity]
    • process_article(article_id: str, db) → List[Entity]
  2. Celery task (nlp/tasks.py)
    • @app.task def ner_task(article_id: str) → process_article(article_id, db)
    • Hook into Celery Beat schedule for periodic batch runs or call via ner_task.delay(id)
  3. CLI wrapper (nlp/cli.py)
    • Thin click or argparse command that calls process_article() and prints summary
    • Executable via python -m nlp.cli --article-id=<id>

Tasks

  1. Add dependencies
    • Install transformers, torch, celery, click (update /nlp/requirements.txt)
  2. Core module
    • Create /nlp/core.py with run_ner_hf() and process_article() as described
  3. Celery integration
    • Create /nlp/tasks.py with a ner_task Celery task
    • Configure broker URL and include a sample beat schedule entry in project docs
  4. CLI command
    • Create /nlp/cli.py with a ner_cli command (using click or argparse)
    • Add entry to /nlp/README.md showing both Celery and CLI usage
  5. Unit tests
    • In /nlp/tests/test_core.py, test run_ner_hf() on sample text
    • In /nlp/tests/test_tasks.py, mock db to verify ner_task calls process_article()
    • In /nlp/tests/test_cli.py, invoke CLI with a dummy --article-id and assert exit code 0
  6. Documentation
    • Update /nlp/README.md with:
      • Installation steps
      • How to run ner_task via Celery
      • How to invoke python -m nlp.cli
      • Sample Celery Beat schedule snippet

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions