Skip to content

PrialaRadu/VisionDoc-AI

Repository files navigation

VISIONDOC AI

A GenAI system that interprets and semantically links embedded images within narrative documents for visual question answering and retrieval.

  1. Accepts input formats: PDF, DOCX.
  2. Uses a vision-language model (GEMMA3) to extract semantic meaning from images.
  3. Implements chunking logic that binds pre-image and post-image text with image metadata.
  4. Stores enriched chunks in a vector store (FAISS).
  5. Enables semantic search and retrieval using vector similarity.
  6. Builds a chatbot using retrieval-augmented generation (RAG) over vector data.
  7. Ensures chunk provenance (page, image position) is preserved.
  8. Implements access control and role-based user permissions.

PURPOSE

Let's take for example an auto brochure for Porsche Cayenne Turbo 2006. Project Purpose Image 1

It identifies each image and relevant text near it. Project Purpose Image 2

A description for each image is also created. Whenever the user asks for an image, he receives the most relevant picture. Project Purpose Image 3

The projects supports Q&A retrieval on multiple docx and pdf files. Project Purpose Image 4

LOGIC:

Workflow Image

USAGE:

Step 1: Add your desired documents into VisionDOC-AI/documents directory (remove any existing documents if you don't intend to use them)

Step 2: Run extraction/documents_extraction.py (extracts information about documents)

Step 3: Run db_build.py (storing information into vectorstore)

Step 4: Run app.py for web chatbot interface

  • use user: 'admin' and password: 'adminpwd' for admin permissions
  • now, you can ask the model about any image that exists in documents

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages