Skip to content

feat: add native SQLBot-powered data extraction workflow#11

Merged
asd765973346 merged 1 commit intoOpenDCAI:mainfrom
2212-spc:feat/intelligent-data-extract
Mar 23, 2026
Merged

feat: add native SQLBot-powered data extraction workflow#11
asd765973346 merged 1 commit intoOpenDCAI:mainfrom
2212-spc:feat/intelligent-data-extract

Conversation

@2212-spc
Copy link
Copy Markdown
Contributor

PR Description

Summary

This PR introduces a native structured-data extraction workflow into Open-NotebookLM and embeds the SQLBot core as an internal backend module.

The goal is to let notebook users work with structured CSV data in a multi-turn workflow:

  • select notebook data sources
  • ask natural-language data questions
  • get SQL + result table + CSV export
  • reuse generated outputs in later turns
  • import generated outputs back into notebook sources

What’s included

1. Intelligent data extraction workflow

  • Add a dedicated data_extract workflow for notebook CSV sources
  • Support natural-language querying over structured data
  • Return:
    • answer summary
    • generated SQL
    • tabular preview
    • CSV export

2. Multi-datasource querying

  • Support selecting multiple structured data sources
  • Distinguish between:
    • primary datasource
    • linked datasources
  • Forward linked datasource context into the extraction flow

3. Session / turn / artifact persistence

  • Persist notebook-scoped extraction sessions
  • Persist per-turn messages and results
  • Persist reusable output artifacts
  • Support restoring historical extraction sessions

4. Artifact reuse and source import

  • Generated CSV outputs can be reused as input in later turns
  • Generated artifacts can be imported back into notebook sources

5. Embedded SQLBot backend

  • Vendor SQLBot core under sqlbot_backend/
  • Rewrite imports into isolated namespace
  • Copy required SQLBot resources into the repository:
    • templates
    • few-shot examples
    • terminology data
  • Add external / embedded SQLBot adapter modes
  • Default extraction mode is now embedded

Main implementation areas

Backend

  • Added notebook-scoped data_extract routes and service layer
  • Added adapter layer for SQLBot execution
  • Added embedded SQLBot execution path
  • Added artifact/session persistence and export handling

Frontend

  • Added intelligent data extraction UI to notebook workspace
  • Added multi-datasource workflow interaction
  • Added extraction history and reusable output artifact strip
  • Added artifact import-back-to-sources flow

Embedded engine

  • Added sqlbot_backend/ as a vendored structured-data engine
  • Reduced import-time side effects for embedded usage
  • Added configuration/adapter groundwork for in-process execution

Why this change

Open-NotebookLM currently works well for document-centric workflows, but structured data analysis needs a different interaction model.

This PR adds a notebook-native workflow for structured data so users can:

  • query CSV data without writing SQL
  • iterate across multiple turns
  • reuse generated outputs as workflow inputs
  • keep structured-data work inside the notebook experience

Current behavior / scope

This PR is focused on the data extraction workflow.

It does not attempt to expose the full SQLBot system as a general-purpose backend replacement for all Open-NotebookLM workflows.

Current scope:

  • structured data extraction
  • CSV-oriented multi-turn workflow
  • artifact reuse/import
  • embedded SQLBot path for extraction flow

Compatibility notes

  • SQLBOT_MODE supports both:
    • external
    • embedded
  • This PR defaults the extraction workflow to embedded
  • Existing notebook features outside the data extraction workflow are intended to remain unaffected

Notes for reviewers

This is a large PR because it combines:

  1. user-facing workflow additions
  2. notebook-scoped persistence for extraction sessions/artifacts
  3. vendoring and embedding of SQLBot core

Recommended review order:

  1. fastapi_app/routers/data_extract.py
  2. fastapi_app/services/data_extract_service.py
  3. fastapi_app/workflow_adapters/*
  4. frontend_zh/src/pages/NotebookView.tsx
  5. sqlbot_backend/

Follow-up work

Potential follow-ups after this PR:

  • continue reducing embedded-mode dependency surface
  • refine SQL generation quality for count/order/top-N queries
  • improve legacy artifact/history display cleanup
  • split large frontend panel into smaller components

@asd765973346 asd765973346 merged commit c3fe261 into OpenDCAI:main Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants