Skip to content

codedbyasim/DataExtractor-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Universal AI Document Extractor & FHIR Generator

FastAPI Tesseract OCR spaCy SQLite FHIR R4

A production-ready AI backend and beautiful single-page dashboard built to instantly extract, classify, and intelligently structure data from unstructured files (Images, PDFs, Word Documents).

Crucially, the system features a dedicated Medical Intelligence Engine capable of mapping parsed clinical data directly into HL7 FHIR R4 compliant JSON logic (Patient, Condition, MedicationStatement, Observation resources).


✨ Core Capabilities

Feature Detail
🗂️ Multi-Format Processing Seamlessly handles JPG, PNG, PDF, and DOCX files.
👁️ Robust OCR Engine Powered by high-speed Tesseract OCR with OpenCV noise reduction. Native pypdf text extraction fallback for digitally crafted PDFs.
🏷️ Smart Classification Automatically categorises files (Invoices, Medical Reports, Prescriptions).
🔍 Hybrid NLP Extraction Leverages deterministic Regex (phones, dates, currency) combined with pre-cached spaCy NER models for entities like Organizations and People.
⚕️ FHIR R4 Healthcare Standard Translates unstructured medical text straight into interoperable FHIR bundles via specialized mapping rules.
🔐 Enterprise Security JWT-based user authentication, role-based access, and secure bcrypt password hashing.
📊 Premium Dark UI Dashboard Interactive SPA featuring drag-and-drop uploads, real-time polling, status indicators, and a clean side-by-side JSON data explorer.

🚀 Quick Start Guide

1. Prerequisites

  • Python 3.10+
  • Tesseract OCR Framework:
    • Requires a local installation of the Tesseract C++ binary.
    • Download for Windows: UB-Mannheim Tesseract Installers.
    • Note your exact installation directory (default is usually C:\Program Files\Tesseract-OCR\tesseract.exe).

2. Automated Setup

This project includes a one-click setup script to build your virtual environment, install all python requirements, and initialise the SQLite database locally.

# Clone the repository and navigate into it
# Run the automated bootstrap script:
python setup.py

3. Provide Your Custom Configurations

The setup script will generate a local .env file for you based on the template. Open .env and make sure the Tesseract Path is accurate for your machine:

TESSERACT_CMD="C:\Program Files\Tesseract-OCR\tesseract.exe"

4. Enable the Natural Language Processor (NER)

To allow the AI to extract complex named entities (Organizations, Locations, Specific names), download the English language model inside your new virtual environment:

# Windows
venv\Scripts\python -m spacy download en_core_web_sm

# Mac / Linux
source venv/bin/activate
python -m spacy download en_core_web_sm

5. Start the Application

Boot up the Uvicorn standard ASGI web server:

# Ensure your virtual environment is active
python run.py

🎉 Open http://localhost:8000 in a web browser to view your shiny new DataExtractor dashboard.


📡 Essential API Routes

A fully interactive Swagger UI is available at /api/docs.

Method Route Purpose
POST /api/upload Process a document asynchronously. Returns an tracking id.
GET /api/result/{id} Long-poll endpoint to retrieve parsed JSON data.
GET /api/documents Fetch a list of all documents belonging to a user.
POST /api/fhir Instantly convert a raw string of text into a FHIR R4 Bundle.

🧾 Sample Extractions

Standard Extraction Result

{
  "raw_text": "Invoice Number: 9942. ABC Corp. Total Amount: $1,500.00...",
  "entities": {
    "organization": "ABC Corp",
    "date": "2025-01-15",
    "money": "$1,500.00"
  },
  "doc_class": "invoice",
  "confidence": 0.92
}

FHIR R4 Healthcare Interoperability Payload

Directly generated from text mapping: "Patient: John Doe. Diagnosis: Diabetes."

{
  "resourceType": "Bundle",
  "type": "collection",
  "entry": [
    {
      "resource": {
        "resourceType": "Patient",
        "name": [{"family": "Doe", "given": ["John"]}]
      }
    },
    {
       "resource": {
         "resourceType": "Condition",
         "code": {"text": "Diabetes"},
         "subject": {"reference": "Patient/John"}
       }
    }
  ]
}

🏗️ Technical Architecture Details

The system is highly modularized under the /app/pipeline directory so that specific tools (like shifting the exact FHIR resources or altering the Preprocessing cv2 blur intensity) can be swapped independently:

  1. preprocessor.py: Intercepts images to run CLAHE contrast scaling, skew correction, and noise reduction via cv2 before reading.
  2. ocr_engine.py: Interacts directly with Tesseract and pypdf. Implemented global Singletons (_get_tesseract()) to avoid horrendous reload latency overheads across requests.
  3. extractor.py: Uses cached spaCy en_core pipelines overlaying absolute regex masks.
  4. medical_extractor.py & fhir_mapper.py: Clinical keyword heuristics mapping to standard R4 JSON rules natively via native python logic sets.

Authored for the Next Generation of Medical Data Parsing.

About

End-to-end AI document extraction system with multi-format support, instant OCR, NLP entity recognition, and FHIR healthcare standard generation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors