Skip to content

AILAB-CEFET-RJ/nerdd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NERDD

NER pipeline for the Disque Denúncia context, organized into training, calibration, and pseudolabelling subpipelines.

Current Structure

  • src/: main source code.
  • src/base_model_training/: base training and evaluation with nested CV.
  • src/pseudolabelling/: pseudolabel generation, score-based split, and refit.
  • src/calibration/: fit/apply reusable probability calibrators for the base model scores.
  • src/tools/: auxiliary utilities.
  • docs/: operational and architectural documentation.
  • data/: training, test, and calibration datasets.

Prerequisites

  • Git
  • Python 3.11+
  • pip

Quick Setup

git clone https://github.com/MLRG-CEFET-RJ/nerdd.git
cd nerdd
cd src
pip install -r requirements.txt

Next Steps

  • Detailed installation: docs/INSTALL.md
  • Runbook: docs/RUNBOOK.md
  • Pipeline overview: docs/PIPELINE_OVERVIEW.md
  • Architecture: docs/ARCHITECTURE.md
  • Architectural decisions: docs/ARCHITECTURAL_DECISIONS.md
  • Migration: docs/MIGRATION.md

Canonical Flow

  1. Train the base model in src/base_model_training/.
  2. Build a labeled calibration subset and fit a reusable calibrator artifact in src/calibration/.
  3. Run large-corpus prediction in src/pseudolabelling/, optionally applying the calibrator during inference.

Contributing

Open an issue or PR with fixes and improvements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages