🧠 Universal AI Document Extractor & FHIR Generator

A production-ready AI backend and beautiful single-page dashboard built to instantly extract, classify, and intelligently structure data from unstructured files (Images, PDFs, Word Documents).

Crucially, the system features a dedicated Medical Intelligence Engine capable of mapping parsed clinical data directly into HL7 FHIR R4 compliant JSON logic (Patient, Condition, MedicationStatement, Observation resources).

✨ Core Capabilities

Feature	Detail
🗂️ Multi-Format Processing	Seamlessly handles `JPG`, `PNG`, `PDF`, and `DOCX` files.
👁️ Robust OCR Engine	Powered by high-speed Tesseract OCR with OpenCV noise reduction. Native `pypdf` text extraction fallback for digitally crafted PDFs.
🏷️ Smart Classification	Automatically categorises files (Invoices, Medical Reports, Prescriptions).
🔍 Hybrid NLP Extraction	Leverages deterministic Regex (phones, dates, currency) combined with pre-cached spaCy NER models for entities like Organizations and People.
⚕️ FHIR R4 Healthcare Standard	Translates unstructured medical text straight into interoperable FHIR bundles via specialized mapping rules.
🔐 Enterprise Security	JWT-based user authentication, role-based access, and secure bcrypt password hashing.
📊 Premium Dark UI Dashboard	Interactive SPA featuring drag-and-drop uploads, real-time polling, status indicators, and a clean side-by-side JSON data explorer.

🚀 Quick Start Guide

1. Prerequisites

Python 3.10+
Tesseract OCR Framework:
- Requires a local installation of the Tesseract C++ binary.
- Download for Windows: UB-Mannheim Tesseract Installers.
- Note your exact installation directory (default is usually C:\Program Files\Tesseract-OCR\tesseract.exe).

2. Automated Setup

This project includes a one-click setup script to build your virtual environment, install all python requirements, and initialise the SQLite database locally.

# Clone the repository and navigate into it
# Run the automated bootstrap script:
python setup.py

3. Provide Your Custom Configurations

The setup script will generate a local .env file for you based on the template. Open .env and make sure the Tesseract Path is accurate for your machine:

TESSERACT_CMD="C:\Program Files\Tesseract-OCR\tesseract.exe"

4. Enable the Natural Language Processor (NER)

To allow the AI to extract complex named entities (Organizations, Locations, Specific names), download the English language model inside your new virtual environment:

# Windows
venv\Scripts\python -m spacy download en_core_web_sm

# Mac / Linux
source venv/bin/activate
python -m spacy download en_core_web_sm

5. Start the Application

Boot up the Uvicorn standard ASGI web server:

# Ensure your virtual environment is active
python run.py

🎉 Open http://localhost:8000 in a web browser to view your shiny new DataExtractor dashboard.

📡 Essential API Routes

A fully interactive Swagger UI is available at /api/docs.

Method	Route	Purpose
`POST`	`/api/upload`	Process a document asynchronously. Returns an tracking `id`.
`GET`	`/api/result/{id}`	Long-poll endpoint to retrieve parsed JSON data.
`GET`	`/api/documents`	Fetch a list of all documents belonging to a user.
`POST`	`/api/fhir`	Instantly convert a raw string of text into a FHIR R4 Bundle.

🧾 Sample Extractions

Standard Extraction Result

{
  "raw_text": "Invoice Number: 9942. ABC Corp. Total Amount: $1,500.00...",
  "entities": {
    "organization": "ABC Corp",
    "date": "2025-01-15",
    "money": "$1,500.00"
  },
  "doc_class": "invoice",
  "confidence": 0.92
}

FHIR R4 Healthcare Interoperability Payload

Directly generated from text mapping: "Patient: John Doe. Diagnosis: Diabetes."

{
  "resourceType": "Bundle",
  "type": "collection",
  "entry": [
    {
      "resource": {
        "resourceType": "Patient",
        "name": [{"family": "Doe", "given": ["John"]}]
      }
    },
    {
       "resource": {
         "resourceType": "Condition",
         "code": {"text": "Diabetes"},
         "subject": {"reference": "Patient/John"}
       }
    }
  ]
}

🏗️ Technical Architecture Details

The system is highly modularized under the /app/pipeline directory so that specific tools (like shifting the exact FHIR resources or altering the Preprocessing cv2 blur intensity) can be swapped independently:

preprocessor.py: Intercepts images to run CLAHE contrast scaling, skew correction, and noise reduction via cv2 before reading.
ocr_engine.py: Interacts directly with Tesseract and pypdf. Implemented global Singletons (_get_tesseract()) to avoid horrendous reload latency overheads across requests.
extractor.py: Uses cached spaCy en_core pipelines overlaying absolute regex masks.
medical_extractor.py & fhir_mapper.py: Clinical keyword heuristics mapping to standard R4 JSON rules natively via native python logic sets.

Authored for the Next Generation of Medical Data Parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
logs		logs
static		static
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Universal AI Document Extractor & FHIR Generator

✨ Core Capabilities

🚀 Quick Start Guide

1. Prerequisites

2. Automated Setup

3. Provide Your Custom Configurations

4. Enable the Natural Language Processor (NER)

5. Start the Application

📡 Essential API Routes

🧾 Sample Extractions

Standard Extraction Result

FHIR R4 Healthcare Interoperability Payload

🏗️ Technical Architecture Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Universal AI Document Extractor & FHIR Generator

✨ Core Capabilities

🚀 Quick Start Guide

1. Prerequisites

2. Automated Setup

3. Provide Your Custom Configurations

4. Enable the Natural Language Processor (NER)

5. Start the Application

📡 Essential API Routes

🧾 Sample Extractions

Standard Extraction Result

FHIR R4 Healthcare Interoperability Payload

🏗️ Technical Architecture Details

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages