Blog: Getting structured data out of images with Granite Vision 4.1#48
Blog: Getting structured data out of images with Granite Vision 4.1#48planetf1 wants to merge 6 commits into
Conversation
Blog post covering m.instruct() + format= + ImageBlock for typed receipt extraction, building up through requirements= and IVR validation_fn. Includes a synthetic receipt image generated with PIL. Assisted-by: Claude Code
Assists-by: Claude Code
- Add `text` language tag to output fence (fixes MD040 lint failure) - Wrap check_line_totals with simple_validate() — validation_fn expects Callable[[Context], ValidationResult], not str directly - pip install → uv add (consistent with other Mellea blogs) - Add conclusion section with recap and cross-references to docs.mellea.ai Assisted-by: Claude Code
…on blog - Replace line-item arithmetic check with subtotal+tax=total verification; the old check failed because granite3.2-vision reads discounts as positive - Rewrite 'What we covered' as narrative 'From narration to data' section Assisted-by: Claude Code
- New receipt image: 6 line items with smudged subtotal digit - Expanded editorial note: marks as draft, notes scenario still being iterated, clarifies Ollama not yet available but expected soon - Sync blog body to new receipt values ($79.86 total, no discounts) - IVR section references smudged subtotal as the failure trigger Assisted-by: Claude Code
|
Don't have a strong opinion here, but assuming it takes a while to get the vision model into Ollama should we consider using vllm in the blog instead? |
Switch from RejectionSamplingStrategy to RepairTemplateStrategy in both the requirements= and IVR sections. RejectionSamplingStrategy just retries with the same prompt; RepairTemplateStrategy injects the validation failure reason into the repair prompt — which is what the surrounding prose already describes. Also promote "Going further" from bold text to a ## heading, and add a paragraph to the conclusion making detection vs. repair guarantees explicit. Assisted-by: Claude Code
ajbozarth
left a comment
There was a problem hiding this comment.
I'll walk through the blog and try it out myself when I have bandwidth, but to start heres a small review from Claude:
Code checks out against current mellea source — APIs, imports, signatures, and the RepairTemplateStrategy switch all verify. Front matter and asset are good. Snippet syntax checks pass; live execution skipped (model not in Ollama yet, per the editorial note). de-llmify score 1.
A few inline notes below. Pre-publish blockers (editorial note removal, Ollama availability) are already tracked in the PR description.
| order #2231. It lists three cold brew coffees at $4.75 each, two grain bowls | ||
| at $12.95 each, four granola bars at $2.95 each, three oat milk add-ons at | ||
| $0.75 each, one avocado toast at $11.50, and two blueberry muffins at $3.95 | ||
| each. The subtotal is $73.60, tax at 8.5% is $6.26, for a total of $79.86." |
There was a problem hiding this comment.
Consider a one-line caption after this block flagging it as a representative example, not a verbatim response — so a reader running the snippet and getting different prose doesn't think they've broken something.
| validation layer will always surface a mismatch — if the arithmetic is wrong, you'll know. | ||
| Repair success depends on the model's capacity. A 4b model working from a partially obscured | ||
| image will not always correct itself in three tries; a larger model usually will. The value of | ||
| wiring the check programmatically isn't that repair always succeeds — it's that a silent wrong |
There was a problem hiding this comment.
Light de-llmify nit — the "isn't X — it's Y" construction is a Tier 1 phrase tell. Doing real semantic work here, so optional, but if you want to drop it: commit to the stronger half.
| wiring the check programmatically isn't that repair always succeeds — it's that a silent wrong | |
| The point of wiring the check programmatically is that a silent wrong answer is no longer possible. Repair success is a separate question. |
|
|
||
| The gap this closes is a real one. Vision models are already good at reading documents — | ||
| they just default to telling you about them rather than handing you the data. Mellea's | ||
| `format=` parameter shifts that: the return type becomes a contract, constrained decoding | ||
| enforces it, and you get a typed Python object the rest of your code can actually use. | ||
|
|
||
| `requirements=` and `validation_fn` extend that contract beyond structure. Plain-English | ||
| requirements catch semantic problems the type system can't — negative totals, badly | ||
| formatted dates, values that are plausible individually but wrong together. A `validation_fn` | ||
| pushes further still, running the kind of check you'd write in post-processing anyway and | ||
| folding it directly into the generation loop rather than bolting it on after. | ||
|
|
||
| One thing worth being clear about: detection and repair are separate guarantees. The | ||
| validation layer will always surface a mismatch — if the arithmetic is wrong, you'll know. | ||
| Repair success depends on the model's capacity. A 4b model working from a partially obscured | ||
| image will not always correct itself in three tries; a larger model usually will. The value of | ||
| wiring the check programmatically isn't that repair always succeeds — it's that a silent wrong | ||
| answer is no longer possible. | ||
|
|
||
| All of this composes with any backend. Swap from a local model to a cloud endpoint, or to a | ||
| different local runtime, and the extraction logic doesn't change — only the session setup does. |
There was a problem hiding this comment.
The detection-vs-repair point in paragraph 3 is the genuinely new framing here; paragraphs 1, 2, and 4 recap ground already covered (and the backend portability paragraph overlaps with "Swapping backends" above). Consider tightening to lead with the detection-vs-repair point and dropping the recaps. Take or leave.
What this is showing off
Vision models return prose. The point of this post is that they don't have to.
The blog demonstrates Mellea's extraction pattern — pass a
format=Pydantic model tom.instruct(), get a typed Python object back instead of a string. No JSON promptengineering, no
json.JSONDecodeErrorhandlers, no post-processing regex. The return typeis the contract, and constrained decoding enforces it.
It then builds up two layers of validation on top:
requirements=— plain-English semantic constraints (date format, positive totals).The model retries with the failed requirement injected into the repair prompt.
validation_fn— programmatic arithmetic check (line items × quantities = subtotal).The failure reason gets fed back into the repair prompt verbatim.
The receipt image is synthetic (PIL-generated) with a thermal-printer smudge over part of
the subtotal, giving the validation layers something realistic to catch.
Strategy fix (latest commit): Both validation sections now use
RepairTemplateStrategyinstead of
RejectionSamplingStrategy.RejectionSamplingStrategy.repair()returns theunchanged action/context — same prompt, no feedback.
RepairTemplateStrategybuilds arepair prompt with the failed requirement or
ValidationResult.reasoninjected, which iswhat the surrounding blog prose describes.
Status: Draft — scenario still being refined
The Mellea API usage and code structure are stable. The receipt scenario is still being
iterated. Detection is reliable: the date format requirement consistently detects
22/03/2026and the arithmetic check confirms extractions are correct. Repair of the date format issue
at 4b scale is not guaranteed (the blog's conclusion now says this explicitly). Receipt values
may change before publication.
Model availability — why this is blocked
The blog is written for Ollama, which is the right default for a local-first post. The
problem: Ollama requires GGUF format, and Granite Vision 4.1 is only available as full
bfloat16 safetensors on Hugging Face right now (~8 GB download, not a 4-bit quantized GGUF
like you'd get from
ollama pull).Ollama cannot load safetensors directly — it needs IBM to publish a GGUF to the Ollama
library (or a community conversion to appear). Until then, the testing path is mlx-vlm on
Apple Silicon or vLLM, both of which can serve safetensors directly.
Watch https://ollama.com/library for
granite-vision-4.1. When it lands: remove theeditorial note, verify
ollama pull granite-vision-4.1works, flip to ready.Reviewing now
Follow the editorial note at the top of the post. Short version:
Model downloads ~8 GB on first run (full bfloat16 safetensors — larger than an Ollama pull).
Serves at http://localhost:8080/v1.
to:
Test plan
npm run dev— confirm post renders at/blogs/granite-vision-structured-extraction🤖 Generated with Claude Code