A production-ready AI backend and beautiful single-page dashboard built to instantly extract, classify, and intelligently structure data from unstructured files (Images, PDFs, Word Documents).
Crucially, the system features a dedicated Medical Intelligence Engine capable of mapping parsed clinical data directly into HL7 FHIR R4 compliant JSON logic (Patient, Condition, MedicationStatement, Observation resources).
| Feature | Detail |
|---|---|
| 🗂️ Multi-Format Processing | Seamlessly handles JPG, PNG, PDF, and DOCX files. |
| 👁️ Robust OCR Engine | Powered by high-speed Tesseract OCR with OpenCV noise reduction. Native pypdf text extraction fallback for digitally crafted PDFs. |
| 🏷️ Smart Classification | Automatically categorises files (Invoices, Medical Reports, Prescriptions). |
| 🔍 Hybrid NLP Extraction | Leverages deterministic Regex (phones, dates, currency) combined with pre-cached spaCy NER models for entities like Organizations and People. |
| ⚕️ FHIR R4 Healthcare Standard | Translates unstructured medical text straight into interoperable FHIR bundles via specialized mapping rules. |
| 🔐 Enterprise Security | JWT-based user authentication, role-based access, and secure bcrypt password hashing. |
| 📊 Premium Dark UI Dashboard | Interactive SPA featuring drag-and-drop uploads, real-time polling, status indicators, and a clean side-by-side JSON data explorer. |
- Python 3.10+
- Tesseract OCR Framework:
- Requires a local installation of the Tesseract C++ binary.
- Download for Windows: UB-Mannheim Tesseract Installers.
- Note your exact installation directory (default is usually
C:\Program Files\Tesseract-OCR\tesseract.exe).
This project includes a one-click setup script to build your virtual environment, install all python requirements, and initialise the SQLite database locally.
# Clone the repository and navigate into it
# Run the automated bootstrap script:
python setup.pyThe setup script will generate a local .env file for you based on the template.
Open .env and make sure the Tesseract Path is accurate for your machine:
TESSERACT_CMD="C:\Program Files\Tesseract-OCR\tesseract.exe"To allow the AI to extract complex named entities (Organizations, Locations, Specific names), download the English language model inside your new virtual environment:
# Windows
venv\Scripts\python -m spacy download en_core_web_sm
# Mac / Linux
source venv/bin/activate
python -m spacy download en_core_web_smBoot up the Uvicorn standard ASGI web server:
# Ensure your virtual environment is active
python run.py🎉 Open http://localhost:8000 in a web browser to view your shiny new DataExtractor dashboard.
A fully interactive Swagger UI is available at /api/docs.
| Method | Route | Purpose |
|---|---|---|
POST |
/api/upload |
Process a document asynchronously. Returns an tracking id. |
GET |
/api/result/{id} |
Long-poll endpoint to retrieve parsed JSON data. |
GET |
/api/documents |
Fetch a list of all documents belonging to a user. |
POST |
/api/fhir |
Instantly convert a raw string of text into a FHIR R4 Bundle. |
{
"raw_text": "Invoice Number: 9942. ABC Corp. Total Amount: $1,500.00...",
"entities": {
"organization": "ABC Corp",
"date": "2025-01-15",
"money": "$1,500.00"
},
"doc_class": "invoice",
"confidence": 0.92
}Directly generated from text mapping: "Patient: John Doe. Diagnosis: Diabetes."
{
"resourceType": "Bundle",
"type": "collection",
"entry": [
{
"resource": {
"resourceType": "Patient",
"name": [{"family": "Doe", "given": ["John"]}]
}
},
{
"resource": {
"resourceType": "Condition",
"code": {"text": "Diabetes"},
"subject": {"reference": "Patient/John"}
}
}
]
}The system is highly modularized under the /app/pipeline directory so that specific tools (like shifting the exact FHIR resources or altering the Preprocessing cv2 blur intensity) can be swapped independently:
preprocessor.py: Intercepts images to run CLAHE contrast scaling, skew correction, and noise reduction viacv2before reading.ocr_engine.py: Interacts directly with Tesseract andpypdf. Implemented global Singletons (_get_tesseract()) to avoid horrendous reload latency overheads across requests.extractor.py: Uses cached spaCyen_corepipelines overlaying absolute regex masks.medical_extractor.py&fhir_mapper.py: Clinical keyword heuristics mapping to standard R4 JSON rules natively via native python logic sets.
Authored for the Next Generation of Medical Data Parsing.