An intelligent invoice parsing pipeline using Databricks AI capabilities to automatically extract structured data from raw invoice documents.
Invoices come in all shapes, sizes, and layouts. Whether from vendors, partners, or internal systems, each document may have a unique format, making traditional rule-based parsing fragile and expensive to maintain. This project demonstrates how to use Databricks' ai_parse_document capability combined with Spark to build a scalable and intelligent document processing pipeline.
- Accept raw invoices (PDFs/images) from diverse sources
- Understand layout and content using AI-powered document parsing
- Extract key business fields such as:
- Invoice Number
- Invoice Date
- Due Date
- Line Items (description, quantity, unit price, amount)
- Subtotal, Tax, and Total Due
- Convert unstructured documents into structured, analytics-ready tables
Raw invoice PDFs are stored in Databricks Volumes and loaded as binary content to preserve their structure:
base_path = "/Volumes/workspace/gs_invoices/raw_invoices"
raw_df = (
spark.read.format("binaryFile")
.option("pathGlobFilter", "*.pdf")
.load(base_path)
)Each PDF is processed using Databricks' ai_parse_document function, which converts raw binary content into a structured document representation:
from pyspark.sql.functions import expr
parsed_df = raw_df.select(
"path",
expr("ai_parse_document(content) as parsed_document")
)The parsed output contains nested elements including:
- Text blocks
- Table cells
- Headers and footers
- Page-level metadata
These elements are extracted and flattened for downstream processing.
Key invoice fields are extracted from the parsed elements into a clean, analytics-ready schema suitable for reporting and analysis.
The pipeline extracts structured data like:
| Field | Example Value |
|---|---|
| Invoice Number | INV-30016 |
| Invoice Date | 2025-10-19 |
| Due Date | 2025-11-02 |
| Subtotal | $2,115.00 |
| Tax (5.5%) | $116.33 |
| Total Due | $2,314.32 |
- Databricks Runtime with AI Functions enabled
- Access to Unity Catalog Volumes for document storage
- PySpark for distributed processing
- Upload your invoice PDFs to a Databricks Volume
- Update the
base_pathvariable in the notebook to point to your documents - Run the
invoice_parser.ipynbnotebook cells sequentially - View the extracted structured data in the output tables
invoice_parser.ipynb- Main notebook with the complete parsing pipelineidp_architectural_flow.png- Architecture diagram showing the data flow
This project is for demonstration purposes.
