NER pipeline for the Disque Denúncia context, organized into training, calibration, and pseudolabelling subpipelines.
src/: main source code.src/base_model_training/: base training and evaluation with nested CV.src/pseudolabelling/: pseudolabel generation, score-based split, and refit.src/calibration/: fit/apply reusable probability calibrators for the base model scores.src/tools/: auxiliary utilities.docs/: operational and architectural documentation.data/: training, test, and calibration datasets.
- Git
- Python 3.11+
- pip
git clone https://github.com/MLRG-CEFET-RJ/nerdd.git
cd nerdd
cd src
pip install -r requirements.txt- Detailed installation:
docs/INSTALL.md - Runbook:
docs/RUNBOOK.md - Pipeline overview:
docs/PIPELINE_OVERVIEW.md - Architecture:
docs/ARCHITECTURE.md - Architectural decisions:
docs/ARCHITECTURAL_DECISIONS.md - Migration:
docs/MIGRATION.md
- Train the base model in
src/base_model_training/. - Build a labeled calibration subset and fit a reusable calibrator artifact in
src/calibration/. - Run large-corpus prediction in
src/pseudolabelling/, optionally applying the calibrator during inference.
Open an issue or PR with fixes and improvements.