Week 1: Document Understanding Layer by Abhishek-Kumar-Rai5 · Pull Request #2 · PecanProject/sage

Abhishek-Kumar-Rai5 · 2026-06-17T16:01:29Z

Overview

This PR implements the Week 1 milestone of the BETYdb extraction pipeline: the complete Document Understanding Layer.

The pipeline converts Marker's raw document representation into the project's immutable Document Object while preserving structural provenance, deterministic identifiers, and document hierarchy.

This work establishes the foundation required for the retrieval and extraction stages in later milestones.

What this PR includes

Raw Marker Model

Immutable Pydantic models representing Marker JSON
Lossless mapping from Marker output
Validation and serialization support

Document Object

Immutable document schema
Deterministic object identifiers
Provenance tracking
Metadata and statistics models
Structural invariants

Normalizer

Implemented the complete staged normalization pipeline:

Stage	Responsibility
0	Front matter detection
1	Page shell construction
1.5	Wrapper unwrapping
2	Block classification
3	Builder construction
4	Caption resolution
5	Table structure construction
6	Footnote attachment
7	Section tree assembly
8	Reading order assignment
9	Canonical path generation
10	Materialization into Document objects
11	Validation

Public API

The Document Understanding Layer exposes a single public entry point through the normalize() function. Given a validated MarkerDocument together with processing metadata, it executes the complete normalization pipeline and returns the project's immutable Document representation.

document = normalize(
    marker_document=marker_document,
    source_pdf_identifier="pecan",
    processing_context=context,
)

Repository Architecture

sage/
├── src/
│   ├── betydb_extraction/
│   │   ├── marker_adapter/        # Immutable Marker raw data model
│   │   │   ├── raw_model.py
│   │   │   └── __init__.py
│   │   │
│   │   ├── document/             # Immutable Document Object schema
│   │   │   ├── document.py
│   │   │   ├── page.py
│   │   │   ├── section.py
│   │   │   ├── paragraph.py
│   │   │   ├── table.py
│   │   │   ├── figure.py
│   │   │   ├── equation.py
│   │   │   ├── footnote.py
│   │   │   ├── reference.py
│   │   │   ├── caption.py
│   │   │   ├── metadata.py
│   │   │   ├── statistics.py
│   │   │   ├── provenance.py
│   │   │   └── ...
│   │   │
│   │   └── normalizer/           # Marker → Document transformation
│   │       ├── api.py            # Public normalize() entry point
│   │       ├── context.py
│   │       ├── builders/         # Temporary mutable builders
│   │       ├── internal/         # Internal normalization stages
│   │       └── logging_util.py
│   │
│   ├── ir/                       # Intermediate Representation (Phase 2)
│   ├── validation/               # Validation pipeline (future)
│   ├── review/                   # Human review workflow (future)
│   └── document_processing/      # Shared processing utilities
│
├── tests/
│   ├── document/
│   ├── marker_adapter/
│   └── normalizer/
│
├── docs/
│   ├── document_schema_specification_v1.1.md
│   ├── normalizer.md
│   ├── marker_empirical_findings_paper1.md
│   └── ...
│
├── scripts/                      # Smoke tests and helper scripts
│
├── data/
│   ├── raw_papers/
│   ├── marker_output/
│   └── normalized_output/
│
├── requirements.txt
├── pyproject.toml
└── README.md

Design decisions

The Raw Marker Model remains a lossless representation of Marker output.
Document objects are immutable.
Builder objects exist only during normalization.
Deterministic IDs are derived from canonical paths.
Structural provenance is preserved for every generated object.
Normalization is implemented as an explicit staged pipeline, with each stage responsible for a single transformation.

Current scope

This PR focuses on the document understanding layer only.

Validation

The complete normalization pipeline was exercised end-to-end using the smoke test (scripts/run_normalize_smoke_test.py) against three real Marker outputs:

pecan.pdf
Nutrient-cycling.pdf
culti-mixtures.pdf

For each document, the pipeline successfully:

constructed the Document object,
preserved structural provenance,
generated deterministic identifiers,
materialized the section hierarchy,
serialized the final document for inspection.

Screenshots

Video recording of the normalizer smoke test :-
https://github.com/user-attachments/assets/84322b80-5ad3-4f67-aa9b-10220b7f23a6

Documentation

Added:

Document Schema Specification
Normalizer Design Specification

Architecture flow chart

Repository Navigation Guide

This repository is organized around the major stages of the extraction pipeline. The recommended order below provides enough context to understand how information flows from a PDF to a structured scientific representation.

1. Project Documentation (`docs/`)

Start here before reading the implementation.

document_schema_specification_v1.1.md
- Defines every immutable Document Object used throughout the project.
normalizer.md
- Describes the normalization pipeline, stage responsibilities, design decisions, and implementation constraints.
marker_empirical_findings_paper1.md
- Documents observations from evaluating Marker output that influenced the document model and normalizer design.

2. Marker Adapter (`src/betydb_extraction/marker_adapter/`)

The Marker Adapter is responsible for representing Marker output as validated Python objects.

Key entry points:

raw_model.py
- Immutable representation of Marker JSON.

Purpose:

Marker JSON
        │
        ▼
MarkerDocument

3. Document Object (`src/betydb_extraction/document/`)

Defines the project's canonical immutable document representation.

Major components include:

Document
Page
Section
Paragraph
Table
Figure
Equation
Footnote
Reference
Caption
Metadata
Statistics
Provenance

These models are the shared representation used by every downstream stage.

4. Normalizer (`src/betydb_extraction/normalizer/`)

Responsible for transforming Marker documents into immutable Document Objects.

Important files:

api.py
- Public normalization entry point.
builders/
- Temporary mutable builders used during normalization.
internal/
- Internal transformation stages implementing the normalization pipeline.

5. Tests (`tests/`)

Organized alongside the major project components.

marker_adapter/
document/
normalizer/

These verify correctness, invariants, serialization, and deterministic behaviour.

6. Scripts (`scripts/`)

Contains helper scripts for running smoke tests and validating the complete normalization pipeline on real Marker outputs.

7. Data (`data/`)

Development resources used throughout the project.

raw_papers/
- Original PDFs.
marker_output/
- Marker-generated JSON.
normalized_output/
- Serialized Document Objects generated by the normalization pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Week 1: Document Understanding Layer#2

Week 1: Document Understanding Layer#2
Abhishek-Kumar-Rai5 wants to merge 2 commits into
PecanProject:mainfrom
Abhishek-Kumar-Rai5:gsoc/week1-document-layer

Abhishek-Kumar-Rai5 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Abhishek-Kumar-Rai5 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What this PR includes

Raw Marker Model

Document Object

Normalizer

Public API

Repository Architecture

Design decisions

Current scope

Validation

Screenshots

Architecture flow chart

Repository Navigation Guide

1. Project Documentation (docs/)

2. Marker Adapter (src/betydb_extraction/marker_adapter/)

3. Document Object (src/betydb_extraction/document/)

4. Normalizer (src/betydb_extraction/normalizer/)

5. Tests (tests/)

6. Scripts (scripts/)

7. Data (data/)

Recommended Reading Order

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abhishek-Kumar-Rai5 commented Jun 17, 2026 •

edited

Loading

1. Project Documentation (`docs/`)

2. Marker Adapter (`src/betydb_extraction/marker_adapter/`)

3. Document Object (`src/betydb_extraction/document/`)

4. Normalizer (`src/betydb_extraction/normalizer/`)

5. Tests (`tests/`)

6. Scripts (`scripts/`)

7. Data (`data/`)