Skip to content

SQLicious/agentbuilder-class5-hw1-rubygunna

 
 

Repository files navigation

Agent Builder: NL-to-SQL Pipeline + ReAct Agent

Two notebooks. Run them in order. The first builds a fixed pipeline that translates English to SQL. The second turns it into a real ReAct agent that can choose between two tools.

Architecture diagrams

Class 1 - the pipeline (diagrams/pipeline_architecture.drawio.png)

Pipeline

Class 2 - the agent (diagrams/agent_architecture.drawio.png)

Agent

Homework runs summary (diagrams/agent_runs_infographic.png)

Runs infographic

Open the .drawio files in https://app.diagrams.net or the draw.io desktop app to edit them.

Notebooks

Order Notebook What it teaches
1 nl_sql_pipeline.ipynb Fixed pipeline: question -> SQL -> answer. Three prompt versions measured against a test suite. The hard lesson on domain terms.
2 agent.ipynb Same data + a second source. Build a multi-tool ReAct agent by hand. The LLM picks the tool.
2+ agent_final_hw.ipynb Homework extension: adds compare() and random_number() tools to test tool selection and decoy resistance.

What's in this repo

nl_sql_pipeline.ipynb    Class 1 - the pipeline
agent.ipynb              Class 2 - the agent
spacex_launches.db       SQLite database (18 SpaceX missions) - used by both
seed.sql                 SQL to recreate the database
schema.md                schema description for the LLM
                         (column meanings + allowed values + domain terms + few-shot)
vehicle_specs.json       rocket specs (thrust, height, reusability...)
                         used as the agent's second data source
diagrams/                draw.io source + PNG exports for both architectures
pyproject.toml           uv project file (so `uv add ...` works)
README.md                this file

Setup

You need:

  1. uv to manage Python and packages. Install once:

    curl -LsSf https://astral.sh/uv/install.sh | sh

    Windows: see https://docs.astral.sh/uv/getting-started/installation/.

  2. An OpenAI API key with at least a few cents of credit. Create one at https://platform.openai.com/api-keys. The notebooks prompt for it on first run (the input is hidden as you type).

Run

git clone https://github.com/fnusatvik07/agentbuilder-class4-nlsql.git
cd agentbuilder-class4-nlsql

# install jupyter and the LangChain packages (one-time)
uv add jupyter langchain-openai langchain-core pandas

# class 1 - the pipeline
uv run jupyter lab nl_sql_pipeline.ipynb

# class 2 - the agent (run this AFTER class 1)
uv run jupyter lab agent.ipynb

What you will learn

From the pipeline notebook

  • How an LLM uses a schema description to write SQL.
  • Why schema.md (for the LLM) is separate from seed.sql (for SQLite).
  • How to validate LLM output before executing it.
  • How to build a test suite of (question, expected SQL, expected answer) tuples and measure prompt changes against it.
  • The hard lesson: when business vocabulary collides with column values (heavy launch vs the literal 'Falcon Heavy' enum), the LLM will confidently produce the wrong answer unless you write the term down.
  • The limitations of a fixed-order pipeline.

From the agent notebook

  • Why a fixed pipeline cannot answer multi-source questions.
  • How LangChain's @tool decorator and bind_tools() work.
  • How to write the ReAct loop by hand: Thought -> Action -> Observation -> repeat -> Final Answer.
  • How to read an agent's trace and spot bad tool choices.
  • The new failure modes that come with agents (running in circles, premature termination, hallucinated facts).
  • How adding a compare() helper tool triggers interesting behavior: the agent uses it correctly but echoes its raw output as the final answer rather than synthesizing prose around it.
  • How a decoy tool (random_number()) is ignored when the system prompt clearly labels it as irrelevant — the LLM reads docstrings and acts on them.

Optional: rebuild the database

rm -f spacex_launches.db
sqlite3 spacex_launches.db < seed.sql

What i Learned from this homework

  1. Pipeline vs agent: when does each win?

    The agent wins when a question spans two data sources. In Run 7 I asked whether the rocket with the most missions is reusable — the agent used SQL to find Falcon 9 (10 missions), then immediately called vehicle_specs to check the reusable field. A pipeline has no mechanism to chain across sources like that; it would have stopped at the SQL result with no way to look up reusability. The pipeline would have been the safer choice in Run 6 — "How many missions did NASA fly?" is a single WHERE customer = 'NASA' query with one right answer. The agent answered it correctly but added unnecessary overhead and a chance to pick the wrong tool, with zero benefit over a fixed SQL call.

  2. Where did the agent surprise you? Either something it got right that you did not expect, or something it got wrong in an interesting way.

    In Run 10 I asked "What was the heaviest payload we ever flew, and is that rocket still active?". The SQL came back clean — Falcon 9, 13,620 kg, Starlink Group 4-7. But in the same step the agent also called vehicle_specs for Falcon Heavy, not Falcon 9. It pulled the wrong rocket entirely for the follow-up lookup. The final answer got the payload fact right but then pivoted to talking about Falcon Heavy's capabilities, which had nothing to do with the question. The interesting part is that it didn't hallucinate a number — it hallucinated a lookup target, which is a subtler failure and harder to catch without reading the full trace.

    In Run 12 I asked which rocket is designed for Mars, has been on test missions, and when its first flight was. The question has two criteria — "designed for Mars" points to Starship, but "test missions" in the launches DB are actually Falcon 1 flights. I expected the agent to query both sources and potentially get confused. Instead it called vehicle_specs('Starship') in a single step and answered correctly, never touching the SQL tool at all. It silently resolved the ambiguity using only the Mars clue and got lucky — if the two criteria had genuinely pointed to different rockets it would have given a confident wrong answer with no error signal.

    Run 20 showed the same pattern at the model level. Swapping to gpt-3.5-turbo on the question "What is the thrust of the rocket we use for crewed missions?" — which requires query_launches then vehicle_specs — the weaker model skipped SQL entirely and called vehicle_specs('Falcon 9') directly from training knowledge. The answer was coincidentally correct because Falcon 9 is the crewed rocket, but the data layer was bypassed. A stronger model chains tools in order; a weaker model collapses the chain into a single knowledge lookup and gives no signal that it did so.

  3. What did you change from the workbook defaults, and why? Prompt tweaks, tool descriptions, model choice, retry logic, anything.

    Added crew_capacity to vehicle_specs.json — in Run 13 the agent picked it up through a multi-tool chain without any prompt changes. Also iterated the system prompt across two versions after Run 14's failure. Run 15 added explicit strict comparison rules — enumeration improved but the > vs >= boundary still failed. Run 16 replaced that with a chain-of-thought structure (identify candidates → fetch specs → apply as strict predicate → answer) and the boundary condition finally held — the agent correctly answered "none" for "over 10" when max capacity is 10. CoT fixed what explicit rules could not.

    Added two extra tools for Runs 17 and 18: compare(a, b, label) to format numbers side by side, and random_number() as a decoy with no relation to SpaceX data. The random_number() docstring originally contained an explicit prohibition ("do not call for mission or rocket questions"). That line was removed and Run 18 was re-run — the agent still ignored the tool. Decoy resistance is semantic, not rule-based: the LLM infers irrelevance from the tool name and description alone without needing an explicit guard.

    Raised temperature to 0.7 and ran the same comparison question three times (Runs 19a/b/c). Tool selection was identical across all three — same sequence, same numeric values. The only thing that varied was the label string passed to compare: 'thrust (kN)' / 'thrust' / 'Thrust (kN)'. Temperature affects free-text slots like string arguments and prose, but has no influence on which tools are called or what values are extracted from observations.

  4. What is one thing you would do differently next time?

    Run 14 showed that the agent handles numbers fine when it can push logic to SQL — COUNT, ORDER BY, MAX all worked correctly across runs because the database engine does the comparison. It failed when data came back as JSON from vehicle_specs and the LLM had to apply the > 10 filter itself in its reasoning step with no SQL engine to offload to. Next time I would either express the condition as SQL inside the tool call, or build the filter into the tool itself, rather than asking the LLM to compare JSON values in its head.

    The docstring experiment in Run 18 changed how I would write tool descriptions next time. I had added an explicit prohibition to random_number() — "do not call for mission or rocket questions" — then removed it and the agent still ignored the tool. That means I was writing defensive docstrings that weren't doing anything. Next time I would keep docstrings focused purely on what the tool does and trust the LLM to infer relevance from context, rather than cluttering descriptions with negative instructions.

    Run 21 showed that the messages list — the agent's only memory — grows by two items per tool call and one for the final answer. A 4-step chain that used SQL, two spec lookups, and a compare call ended at 10 messages, with the full system prompt re-sent on every API call. Next time I would set max_iters deliberately based on expected chain length, not just as a safety valve, and consider which tool calls can be parallelised (Run 21 showed the agent firing query_launches and vehicle_specs('Starship') simultaneously in step 1, saving a round trip) versus which must be sequential.

About

Workbook 2 homework for the Agent Builder course — re-implementing an NL-to-SQL pipeline as a real ReAct agent with tools, a model loop, and failure analysis across 5 live test runs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%