Skip to content

Week 1: Document Understanding Layer#2

Open
Abhishek-Kumar-Rai5 wants to merge 2 commits into
PecanProject:mainfrom
Abhishek-Kumar-Rai5:gsoc/week1-document-layer
Open

Week 1: Document Understanding Layer#2
Abhishek-Kumar-Rai5 wants to merge 2 commits into
PecanProject:mainfrom
Abhishek-Kumar-Rai5:gsoc/week1-document-layer

Conversation

@Abhishek-Kumar-Rai5

@Abhishek-Kumar-Rai5 Abhishek-Kumar-Rai5 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Overview

This PR implements the Week 1 milestone of the BETYdb extraction pipeline: the complete Document Understanding Layer.

The pipeline converts Marker's raw document representation into the project's immutable Document Object while preserving structural provenance, deterministic identifiers, and document hierarchy.

This work establishes the foundation required for the retrieval and extraction stages in later milestones.


What this PR includes

Raw Marker Model

  • Immutable Pydantic models representing Marker JSON
  • Lossless mapping from Marker output
  • Validation and serialization support

Document Object

  • Immutable document schema
  • Deterministic object identifiers
  • Provenance tracking
  • Metadata and statistics models
  • Structural invariants

Normalizer

Implemented the complete staged normalization pipeline:

Stage Responsibility
0 Front matter detection
1 Page shell construction
1.5 Wrapper unwrapping
2 Block classification
3 Builder construction
4 Caption resolution
5 Table structure construction
6 Footnote attachment
7 Section tree assembly
8 Reading order assignment
9 Canonical path generation
10 Materialization into Document objects
11 Validation

Public API

The Document Understanding Layer exposes a single public entry point through the normalize() function. Given a validated MarkerDocument together with processing metadata, it executes the complete normalization pipeline and returns the project's immutable Document representation.

document = normalize(
    marker_document=marker_document,
    source_pdf_identifier="pecan",
    processing_context=context,
)

Repository Architecture

sage/
├── src/
│   ├── betydb_extraction/
│   │   ├── marker_adapter/        # Immutable Marker raw data model
│   │   │   ├── raw_model.py
│   │   │   └── __init__.py
│   │   │
│   │   ├── document/             # Immutable Document Object schema
│   │   │   ├── document.py
│   │   │   ├── page.py
│   │   │   ├── section.py
│   │   │   ├── paragraph.py
│   │   │   ├── table.py
│   │   │   ├── figure.py
│   │   │   ├── equation.py
│   │   │   ├── footnote.py
│   │   │   ├── reference.py
│   │   │   ├── caption.py
│   │   │   ├── metadata.py
│   │   │   ├── statistics.py
│   │   │   ├── provenance.py
│   │   │   └── ...
│   │   │
│   │   └── normalizer/           # Marker → Document transformation
│   │       ├── api.py            # Public normalize() entry point
│   │       ├── context.py
│   │       ├── builders/         # Temporary mutable builders
│   │       ├── internal/         # Internal normalization stages
│   │       └── logging_util.py
│   │
│   ├── ir/                       # Intermediate Representation (Phase 2)
│   ├── validation/               # Validation pipeline (future)
│   ├── review/                   # Human review workflow (future)
│   └── document_processing/      # Shared processing utilities
│
├── tests/
│   ├── document/
│   ├── marker_adapter/
│   └── normalizer/
│
├── docs/
│   ├── document_schema_specification_v1.1.md
│   ├── normalizer.md
│   ├── marker_empirical_findings_paper1.md
│   └── ...
│
├── scripts/                      # Smoke tests and helper scripts
│
├── data/
│   ├── raw_papers/
│   ├── marker_output/
│   └── normalized_output/
│
├── requirements.txt
├── pyproject.toml
└── README.md

Design decisions

  • The Raw Marker Model remains a lossless representation of Marker output.
  • Document objects are immutable.
  • Builder objects exist only during normalization.
  • Deterministic IDs are derived from canonical paths.
  • Structural provenance is preserved for every generated object.
  • Normalization is implemented as an explicit staged pipeline, with each stage responsible for a single transformation.

Current scope

This PR focuses on the document understanding layer only.

Validation

The complete normalization pipeline was exercised end-to-end using the smoke test (scripts/run_normalize_smoke_test.py) against three real Marker outputs:

  • pecan.pdf
  • Nutrient-cycling.pdf
  • culti-mixtures.pdf

For each document, the pipeline successfully:

  • constructed the Document object,
  • preserved structural provenance,
  • generated deterministic identifiers,
  • materialized the section hierarchy,
  • serialized the final document for inspection.

Screenshots

Screenshot 2026-07-01 185706

Video recording of the normalizer smoke test :-
https://github.com/user-attachments/assets/84322b80-5ad3-4f67-aa9b-10220b7f23a6

Documentation

Added:

  • Document Schema Specification
  • Normalizer Design Specification

Architecture flow chart

image

Repository Navigation Guide

This repository is organized around the major stages of the extraction pipeline. The recommended order below provides enough context to understand how information flows from a PDF to a structured scientific representation.

1. Project Documentation (docs/)

Start here before reading the implementation.

  • document_schema_specification_v1.1.md

    • Defines every immutable Document Object used throughout the project.
  • normalizer.md

    • Describes the normalization pipeline, stage responsibilities, design decisions, and implementation constraints.
  • marker_empirical_findings_paper1.md

    • Documents observations from evaluating Marker output that influenced the document model and normalizer design.

2. Marker Adapter (src/betydb_extraction/marker_adapter/)

The Marker Adapter is responsible for representing Marker output as validated Python objects.

Key entry points:

  • raw_model.py
    • Immutable representation of Marker JSON.

Purpose:

Marker JSON
        │
        ▼
MarkerDocument

3. Document Object (src/betydb_extraction/document/)

Defines the project's canonical immutable document representation.

Major components include:

  • Document
  • Page
  • Section
  • Paragraph
  • Table
  • Figure
  • Equation
  • Footnote
  • Reference
  • Caption
  • Metadata
  • Statistics
  • Provenance

These models are the shared representation used by every downstream stage.


4. Normalizer (src/betydb_extraction/normalizer/)

Responsible for transforming Marker documents into immutable Document Objects.

Important files:

  • api.py

    • Public normalization entry point.
  • builders/

    • Temporary mutable builders used during normalization.
  • internal/

    • Internal transformation stages implementing the normalization pipeline.

5. Tests (tests/)

Organized alongside the major project components.

  • marker_adapter/
  • document/
  • normalizer/

These verify correctness, invariants, serialization, and deterministic behaviour.


6. Scripts (scripts/)

Contains helper scripts for running smoke tests and validating the complete normalization pipeline on real Marker outputs.


7. Data (data/)

Development resources used throughout the project.

  • raw_papers/

    • Original PDFs.
  • marker_output/

    • Marker-generated JSON.
  • normalized_output/

    • Serialized Document Objects generated by the normalization pipeline.

Recommended Reading Order

For new contributors:

  1. README
  2. docs/document_schema_specification_v1.1.md
  3. docs/normalizer.md
  4. marker_adapter/
  5. document/
  6. normalizer/api.py
  7. normalizer/internal/
  8. tests/

@Abhishek-Kumar-Rai5 Abhishek-Kumar-Rai5 force-pushed the gsoc/week1-document-layer branch from ff3279f to ba43846 Compare July 1, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant