AI Role Validator: XML to PDF Job Role Comparison

Workflow

A smart tool to compare structured job role definitions (from XML) with roles found in unstructured PDF job descriptions using AI, RAG, and fuzzy logic.

Overview

The AI Role Validator is an intelligent Python application designed to automate and streamline the validation of job roles.

It compares a definitive list of roles specified in a structured XML file against roles extracted from unstructured PDF documents. Using Generative AI (Google Gemini), Retrieval-Augmented Generation (RAG), and fuzzy string matching, it ensures consistency, highlights discrepancies, and eliminates hours of manual effort.

Features

XML Role Extraction – Parses structured XML files for the authoritative list of job roles.
PDF Content Extraction – Reads text and tables from PDFs using PyMuPDF.
LLM-Powered Role Extraction – Uses Google Gemini AI to extract roles from messy, unstructured text.
RAG-Based Enhancement – Integrates Pinecone vector DB for chunk-level retrieval from PDFs.
Fuzzy Matching – Accounts for typos, abbreviations, and formatting inconsistencies.
Validation Report – Clearly classifies:
- Matched Roles (exact & fuzzy)
- Unmatched/Incorrect Roles
Configurable Thresholds – Tune fuzzy matching sensitivity.

How Fuzzy Matching Works

Used to catch typos and minor word-level errors.

1️⃣ Levenshtein Distance – `fuzz.ratio()`

Example: XML: Tester PDF: Tester → Edit distance = 1 substitution → Similarity ≈ 83.33%

2️⃣ Ratcliff-Obershelp – `fuzz.partial_ratio()`

Used to catch abbreviations or substring matches.

Example: XML: Software Engineer PDF: Software Eng. → Partial ratio = 100% (as it's a near-perfect subset)

The intelligent combination of both methods ensures robust matching, even across formatting variations and abbreviations.

🛠️ Technologies Used

Python 3.9+
Google Gemini API
Pinecone Vector DB
PyMuPDF (fitz)
thefuzz (fuzzywuzzy)
lxml
langchain-text-splitters
python-dotenv

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
data		data
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
ui.py		ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Role Validator: XML to PDF Job Role Comparison

Workflow

Overview

Features

How Fuzzy Matching Works

1️⃣ Levenshtein Distance – `fuzz.ratio()`

2️⃣ Ratcliff-Obershelp – `fuzz.partial_ratio()`

🛠️ Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Role Validator: XML to PDF Job Role Comparison

Workflow

Overview

Features

How Fuzzy Matching Works

1️⃣ Levenshtein Distance – fuzz.ratio()

2️⃣ Ratcliff-Obershelp – fuzz.partial_ratio()

🛠️ Technologies Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1️⃣ Levenshtein Distance – `fuzz.ratio()`

2️⃣ Ratcliff-Obershelp – `fuzz.partial_ratio()`

Packages