feat: python tools requirement by akihikokuroda · Pull Request #1040 · generative-computing/mellea

akihikokuroda · 2026-05-07T21:22:36Z

Requirement PR

Use this template when adding or modifying requirements in mellea/stdlib/requirements/.

Description

Link to Issue: fix: Requirements library for the Python tool #1023

Add requirements for Python code generation.

Implementation Checklist

Base Class

Extends appropriate base class:
- Requirement - standard requirement
- ALoraRequirement - uses specialized Intrinsic/Adapter for generation-based validation

Validation Logic

validation_fn defined (if using Python-based validation)
- re-usable functionality within the validation_fn should be separated out into mellea/stdlib/tools/
validate returns a ValidationResult with
- a thunk and context if using a backend to generate
- a specific reason and score when possible

Integration

Requirement exported in mellea/stdlib/requirements/__init__.py or, if you are adding a library of requirements, from your sub-module

Testing

Tests added to tests/requirements/
New code has 100% coverage
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used: claude

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

github-actions · 2026-05-07T21:22:56Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

AngeloDanducci · 2026-05-08T20:26:34Z

+                result=False,
+                reason="Your code creates plots with pyplot but never calls `plt.savefig()` to save them.\n\n"
+                "Add this before your plotting code or at the end:\n"
+                "  plt.savefig('{output_path}')\n"


I think this should match the approach in _make_output_artifacts_validator in the way it handles path/output_path.

I assume this was intended to be a f" string instead of a " string.

AngeloDanducci · 2026-05-08T20:26:49Z

+                "Fix this by adding to the top of your code:\n"
+                "  import matplotlib\n"
+                "  matplotlib.use('Agg')\n\n"
+                "Then replace `plt.show()` with `plt.savefig('{output_path}'); plt.close()`",


Same as below ie

I think this should match the approach in _make_output_artifacts_validator in the way it handles path/output_path.

I assume this was intended to be a f" string instead of a " string.

Think this and the the other f" string still need to be changed - the strings starting on line 287 and 311.

My other review comments have been address 👍

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

jakelorocco

I think these new requirements also feel like they aren't building on pre-existing things in our library: Built on top of the existing uses_tool, tool_arg_validator, and PythonExecutionReq scaffolding. If they can't be, please detail why / improve the underlying implementations.

jakelorocco · 2026-05-11T17:27:25Z

@@ -0,0 +1,174 @@
+# pytest: ollama, e2e, qualitative
+"""Granite 4.1 repairs the three canonical plotting failures with Python tool.


Since this is an example that lives in our main repo, the underlying model might change (especially since you are using start_session). Can you instead refer to the model or find some other wording to keep maintenance cost lower?

jakelorocco · 2026-05-11T17:29:01Z

+Requirements:
+- Use the python tool to execute your code
+- Import numpy and matplotlib
+- Generate x values from 0 to 2π
+- Plot sin(x) against x
+- Save the plot to the specified file path
+
+Use the python tool with your complete code."""
+        instruction = Instruction(description=description)


Is there a reason that these requirements aren't appended as actual "requirements" to the instruction but instead embedded in the description? If possible, I think the preferred way of writing this code would be:

m.instruct(<description>, <requirements>, <sampling_strategy>

We tend not to directly invoke the sampling strategy. Are you doing so just so that you can utilize the bundled python reqs from above?

jakelorocco · 2026-05-11T17:31:56Z

+def get_unauthorized_imports(
+    code: str, allowed_imports: list[str] | None = None
+) -> list[str]:
+    r"""Extract unauthorized imports from Python code.


Does this need to be a raw string? It's just a docstring.

jakelorocco · 2026-05-11T17:33:53Z

+FAILURE MATRIX — How each requirement catches the canonical plotting failures:
+
+Scenario: Model generates plotting code with matplotlib
+
+Attempt 1: No tool call
+  → MustInvokePythonTool fails
+  → Repair: "Call the `python` tool with your code"
+
+Attempt 2: Tool called but no 'code' arg
+  → PythonToolHasCodeArg fails
+  → Repair: "The python tool requires a 'code' argument"
+
+Attempt 3: Code has syntax error
+  → PythonCodeParses fails
+  → Repair: "Your code has a syntax error at line X: {error}"
+
+Attempt 4: Code imports matplotlib (not in allowed_imports)
+  → PythonImportsAllowed fails
+  → Repair: "matplotlib is not allowed. Use only: {allowed_list}"
+
+Attempt 5: Code uses plt.show() without headless backend
+  → MatplotlibHeadless fails
+  → Repair: "Add matplotlib.use('Agg') and replace plt.show() with plt.savefig(...)"
+
+Attempt 6: Code has plt.plot() but no plt.savefig()
+  → PlotsAreSaved fails
+  → Repair: "Add plt.savefig('{output_path}') to save the plot"
+
+Attempt 7: Code runs, but output file not created
+  → OutputArtifactsExist fails
+  → Repair: "File '{output_path}' was not created. Check plt.savefig() call"
+
+Attempt 8: Success
+  → All requirements pass
+  → Result: plot file exists and is non-empty


I think you can drop this part of the docstring. The sampling strategies typically determine what actually happens.

jakelorocco · 2026-05-11T17:42:50Z

+def _sets_headless_backend(code: str) -> bool:
+    """Check if code sets matplotlib to use a headless backend."""
+    clean_code = _strip_comments(code)
+    headless_backends = ("Agg", "Svg", "Cairo", "PDF", "PS", "WebAgg", "nbAgg")
+    for backend in headless_backends:
+        if (
+            f"matplotlib.use('{backend}')" in clean_code
+            or f'matplotlib.use("{backend}")' in clean_code
+        ):
+            return True
+    return False


Same as above. If we want some matplotlib specific requirements, then we should introduce structure into our requirements folder so that the specificity is easy to find and makes sense.

jakelorocco · 2026-05-11T17:43:02Z

+def _uses_pyplot_plot(code: str) -> bool:
+    """Check if code calls pyplot plotting functions."""
+    plot_functions = (
+        "plt.plot",
+        "plt.bar",
+        "plt.scatter",
+        "plt.hist",
+        "plt.imshow",
+        "plt.figure",
+        "plt.subplot",
+        ".plot(",
+        ".bar(",
+        ".scatter(",
+        ".hist(",
+    )
+    return any(func in code for func in plot_functions)
+
+
+def _calls_savefig(code: str) -> bool:
+    """Check if code calls plt.savefig() or fig.savefig()."""
+    return "savefig" in code


Same as above.

jakelorocco · 2026-05-11T17:45:16Z

+class PythonToolRequirements:
+    """Pre-composed bundle of requirements for Python code generation via the tool.
+
+    This bundle validates the complete Python code generation flow: tool invocation,
+    syntax, imports, execution, and output. It's designed to work with repair loops
+    (SOFAI, MultiTurnStrategy) to iteratively fix common plotting failures.
+
+    Markers:
+    - **Deterministic** (unit-testable): tool invocation, syntax, imports, headless backend,
+      savefig presence, file existence, output limits
+    - **Qualitative** (needs model to evaluate): execution without error (captured via stderr)
+
+    Args:
+        output_path (str | None): Path where plots should be saved. If specified, enables
+            output artifact validation. Defaults to None.
+        allowed_imports (list[str] | None): Allowlist of importable top-level modules.
+            None (default) allows any import. Set to list like ["numpy", "matplotlib"]
+            to restrict imports.
+        output_limit_bytes (int): Maximum bytes of stdout/stderr allowed. Defaults to 50000.
+        check_output_artifacts (bool): If True, validate that output file exists and is
+            non-empty after execution. Defaults to True if output_path is specified.
+
+    Attributes:
+        requirements (list[Requirement]): The composed list of requirements, suitable
+            for use with sampling strategies.
+    """
+
+    def __init__(
+        self,
+        output_path: str | None = None,
+        allowed_imports: list[str] | None = None,
+        output_limit_bytes: int = 50_000,
+        check_output_artifacts: bool | None = None,
+    ):
+        """Initialize the Python tool requirements bundle."""
+        self.output_path = output_path
+        self.allowed_imports = allowed_imports
+        self.output_limit_bytes = output_limit_bytes
+
+        # Auto-enable output artifact checking if output_path is specified
+        if check_output_artifacts is None:
+            check_output_artifacts = output_path is not None
+
+        self._check_output_artifacts = check_output_artifacts
+
+        self.requirements = self._build_requirements()
+
+    def _build_requirements(self) -> list[Requirement]:
+        """Build the list of requirements for this bundle."""
+        reqs: list[Requirement] = []
+
+        # Tool invocation requirements (deterministic)
+        reqs.append(
+            Requirement(
+                description="Use the python tool to execute code.",
+                validation_fn=_validate_python_tool_invoked,
+                check_only=False,
+            )
+        )
+
+        reqs.append(
+            Requirement(
+                description="The python tool call must include a code argument.",
+                validation_fn=_validate_python_tool_has_code_arg,
+                check_only=False,
+            )
+        )
+
+        # Code quality requirements (deterministic)
+        reqs.append(
+            Requirement(
+                description="The Python code must parse correctly.",
+                validation_fn=_make_code_parses_validator(),
+                check_only=False,
+            )
+        )
+
+        # Import validation (deterministic)
+        if self.allowed_imports is not None:
+            reqs.append(
+                Requirement(
+                    description=f"Imports must be from allowed list: {', '.join(self.allowed_imports)}",
+                    validation_fn=_make_imports_allowed_validator(self.allowed_imports),
+                    check_only=False,
+                )
+            )
+
+        # Matplotlib-specific requirements (deterministic)
+        reqs.append(
+            Requirement(
+                description=(
+                    "If using pyplot, must set headless backend and use savefig."
+                ),
+                validation_fn=_make_matplotlib_headless_validator(),
+                check_only=False,
+            )
+        )
+
+        reqs.append(
+            Requirement(
+                description="If creating plots, must call savefig to save them.",
+                validation_fn=_make_plots_saved_validator(),
+                check_only=False,
+            )
+        )
+
+        # Output artifact validation (deterministic, post-execution)
+        if self._check_output_artifacts and self.output_path:
+            reqs.append(
+                Requirement(
+                    description=f"Output file must be created at {self.output_path}",
+                    validation_fn=_make_output_artifacts_validator(self.output_path),
+                    check_only=False,
+                )
+            )
+
+        # Output limiting (deterministic)
+        reqs.append(
+            Requirement(
+                description=f"Output must not exceed {self.output_limit_bytes} bytes.",
+                validation_fn=_make_output_limit_validator(self.output_limit_bytes),
+                check_only=False,
+            )
+        )
+
+        return reqs
+
+    def __repr__(self) -> str:
+        """Return a developer-readable representation."""
+        return (
+            f"PythonToolRequirements("
+            f"output_path={self.output_path!r}, "
+            f"allowed_imports={self.allowed_imports!r}, "
+            f"output_limit_bytes={self.output_limit_bytes}, "
+            f"requirements={len(self.requirements)} items"
+            f")"
+        )


This is a completely new class / notation. I'm not sure we want to introduce something bundled like this. If anything, it should likely just be exported as a list of requirements. I don't know if it makes sense to introduce this grouping though. A lot of these python requirements are specific to matplotlib, not python in general.

jakelorocco · 2026-05-11T17:46:20Z

I'm not sure this is the correct place for this file. Lets move it closer to python_tools or in the stdlib.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda · 2026-05-12T01:06:40Z

I believe I addressed all comments so far.

jakelorocco

A few more comments. I like how you handled the requirement bundling; I think that's a nice touch.

jakelorocco · 2026-05-12T14:53:19Z

+    Handles various matplotlib import styles and fallback to string matching.
+    """
+    if _find_function_calls(code, ["matplotlib.use"]):
+        headless_backends = {"Agg", "Svg", "Cairo", "PDF", "PS", "WebAgg", "nbAgg"}


Can this be extracted as a constant? I imagine that we would always want to use the same set of headless backends no matter the context?

jakelorocco · 2026-05-12T14:54:15Z

+        except (SyntaxError, ValueError):
+            return _code_contains_strings(
+                code, [f"matplotlib.use('{b}')" for b in headless_backends]
+            )


Can you change the control flow so that this error is caught elsewhere, and we only have the code_contains_strings call at the bottom of the function?

jakelorocco · 2026-05-12T14:55:57Z

+def _uses_pyplot_show(code: str) -> bool:
+    """Check if code calls plt.show() or similar show() methods.
+
+    Uses AST analysis to robustly detect show() calls regardless of import
+    aliases (e.g., `import matplotlib.pyplot as mpl`). AST approach detects
+    actual method calls, avoiding false positives from string literals.
+    Falls back to string matching only if code doesn't parse.
+    """
+    if _find_attribute_calls(code, ["show"]):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(code, ["plt.show", ".show()"])
+    return False


Are we okay with this having false-positives? Doesn't this check for any invocation of .show even if it's not plt?

I think there's still a large risk of false positives here. Also, it looks like it just calls ast.parse(code) without doing any additional checks on that parsed code?

This has still not been addressed.

jakelorocco · 2026-05-12T14:56:57Z

+def _uses_pyplot_plot(code: str) -> bool:
+    """Check if code calls pyplot plotting functions.
+
+    Uses AST analysis to detect plot-related method calls. Handles import
+    aliases and detects actual method calls, avoiding false positives from
+    string literals or method references. Falls back to string matching
+    only if code doesn't parse.
+    """
+    plot_methods = {"plot", "bar", "scatter", "hist", "imshow", "figure", "subplot"}
+    if _find_attribute_calls(code, list(plot_methods)):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(
+            code, [f".{m}(" for m in plot_methods] + [f"plt.{m}" for m in plot_methods]
+        )
+    return False


Same as above, there's a risk that false-positives are detected. Can this be combined with one of your other checks that checks for matplotlib being imported as well?

Still the same as my above comment. Seems like false positives and the ast.parsing isn't doing anything here.

This has still not been addressed.

jakelorocco · 2026-05-12T14:57:13Z

+def _calls_savefig(code: str) -> bool:
+    """Check if code calls plt.savefig() or fig.savefig().
+
+    Uses AST analysis to robustly detect savefig() calls regardless of
+    how matplotlib was imported. Detects actual method calls, avoiding
+    false positives from string literals. Falls back to string matching
+    only if code doesn't parse.
+    """
+    if _find_attribute_calls(code, ["savefig"]):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(code, ["savefig"])
+    return False


Same as above.

This has still not been addressed.

jakelorocco · 2026-05-12T15:03:03Z

+def get_unauthorized_imports(
+    code: str, allowed_imports: list[str] | None = None
+) -> list[str]:
+    """Extract unauthorized top-level imports from Python code."""
+    if allowed_imports is None:
+        return []
+
+    unauthorized: set[str] = set()
+    try:
+        tree = ast.parse(code)
+    except (SyntaxError, ValueError):
+        # Syntax errors are validated separately by dedicated validators.
+        return []
+
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Import):
+            for alias in node.names:
+                module = alias.name.split(".")[0]
+                if module not in allowed_imports:
+                    unauthorized.add(module)
+        elif isinstance(node, ast.ImportFrom) and node.module:
+            module = node.module.split(".")[0]
+            if module not in allowed_imports:
+                unauthorized.add(module)
+
+    return sorted(unauthorized)


I believe I mentioned this previously, but is there a reason not to utilize our pre-existing allowed imports checker function?

Yes, I can remove this file and use the get_unauthorized_imports in interpreter.py by making it public.

jakelorocco · 2026-05-12T15:04:50Z

    return score


+def extract_python_code(ctx: Context) -> ValidationResult:


This looks very similar to the function below it. Can you please describe why it's needed / why the two functions can't be combined?

Why Both Are Needed:

extract_python_code - For validators that check code properties (syntax, imports, plot requirements) which can be done on either source:
- If user called the python tool directly → use that code
- If user wrote markdown → use that code
- These validators run before or after tool invocation

_has_python_code_listing - For validators that actually execute code, which should only work with markdown:
- Execution validators must extract from markdown (the actual code text)
- Tool calls are metadata/structured, not plain text to execute
- This validator runs in the execution path (_python_executes_without_error)

Refactor to eliminate duplication

jakelorocco · 2026-05-12T15:05:58Z

+"""Requirement factories for Python tool invocation and code validation.
+
+This module provides generic requirements for Python-tool usage and code
+correctness. Plotting-specific checks are exposed separately through
+``plotting.python_plotting_requirements(...)`` so they are not implied to be
+universal Python-tool requirements.
+"""


Please modify this docstring. I don't think this encompasses the plotting tools anymore.

This has still not been addressed.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda · 2026-05-14T16:20:18Z

I believe all comment addressed. Thanks!

AngeloDanducci

LGTM I'll let Jake give the final thumbs up given the changes there were more in depth.

markstur

issues with example

markstur

suggested change in example

markstur · 2026-05-14T20:25:40Z

+    if _find_attribute_calls(code, list(PYPLOT_PLOT_METHODS)):
+        return True
+    try:
+        ast.parse(code)


I think this is parsing code that has already been parsed, but I didn't look around much to see if it was worth reuse

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

I think my requested changes are covered

markstur · 2026-05-15T00:04:45Z

+
+Prerequisites:
+    matplotlib must be installed for code execution to succeed:
+    $ uv pip install matplotlib


I just run it like this:

uv run --with matplotlib python docs/examples/tools/python_plotting_repair.py

And I think it is really cool that temporary deps work like that but I'm not sure how many people have bought into using --with

I believe this is actually how we document our other examples. Please confirm @akihikokuroda and change if so.

Our examples do use uv run <example>. I think adding the --with there makes sense instead of this uv pip install matplotlib approach.

markstur

I think my feedback was resolved.

I added a new inline about running with --with which I happen to like but not sure how accepted it is (try it :) )

I don't know if Jake's feedback was covered. I didn't track it enough to know.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda · 2026-05-15T00:35:05Z

@jakelorocco I tried to address all you comments. I might miss some but please review, again. Thanks!

jakelorocco · 2026-05-18T19:21:49Z

+if __name__ == "__main__":
+    main()
+
+# Made with Bob


This is the only one I found, but please search and remove any references.

jakelorocco · 2026-05-18T19:23:40Z

+def _uses_pyplot_show(code: str) -> bool:
+    """Check if code calls plt.show() or similar show() methods.
+
+    Uses AST analysis to robustly detect show() calls regardless of import
+    aliases (e.g., `import matplotlib.pyplot as mpl`). AST approach detects
+    actual method calls, avoiding false positives from string literals.
+    Falls back to string matching only if code doesn't parse.
+    """
+    if _find_attribute_calls(code, ["show"]):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(code, ["plt.show", ".show()"])
+    return False


I think there's still a large risk of false positives here. Also, it looks like it just calls ast.parse(code) without doing any additional checks on that parsed code?

jakelorocco · 2026-05-18T19:24:21Z

+def _uses_pyplot_plot(code: str) -> bool:
+    """Check if code calls pyplot plotting functions.
+
+    Uses AST analysis to detect plot-related method calls. Handles import
+    aliases and detects actual method calls, avoiding false positives from
+    string literals or method references. Falls back to string matching
+    only if code doesn't parse.
+    """
+    plot_methods = {"plot", "bar", "scatter", "hist", "imshow", "figure", "subplot"}
+    if _find_attribute_calls(code, list(plot_methods)):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(
+            code, [f".{m}(" for m in plot_methods] + [f"plt.{m}" for m in plot_methods]
+        )
+    return False


Still the same as my above comment. Seems like false positives and the ast.parsing isn't doing anything here.

jakelorocco · 2026-05-18T19:24:51Z

+def _calls_savefig(code: str) -> bool:
+    """Check if code calls plt.savefig() or fig.savefig().
+
+    Uses AST analysis to robustly detect savefig() calls regardless of
+    how matplotlib was imported. Detects actual method calls, avoiding
+    false positives from string literals. Falls back to string matching
+    only if code doesn't parse.
+
+    Note: May have false positives if code calls savefig() on non-matplotlib objects.
+    Combine with matplotlib import checks in validators to eliminate false positives.
+    """
+    if _find_attribute_calls(code, ["savefig"]):
+        return True
+    return _code_contains_strings(code, ["savefig"])


Same as the other false positives.

jakelorocco · 2026-05-18T19:26:12Z

+        show_patterns: Deprecated; ignored in favor of AST-based detection
+        backend_patterns: Deprecated; ignored in favor of AST-based detection


If these aren't used, let's just remove them. We don't need to deprecate since this hasn't been released yet.

This has still not been addressed.

jakelorocco · 2026-05-18T19:32:54Z

+    def test_output_within_limit_passes(self):
+        """Output under limit should pass."""
+        output = ModelOutputThunk(value="response")
+        setattr(output, "stdout", "x" * 500)
+        setattr(output, "stderr", "")
+        ctx = ChatContext().add(output)
+
+        limit_reqs = requirements_matching("exceed", output_limit_bytes=1000)
+        limit_req = limit_reqs[0]
+
+        result = validation_fn(limit_req)(ctx)
+        assert result.as_bool() is True
+
+    def test_stdout_exceeds_limit_fails(self):
+        """Stdout exceeding limit should fail."""
+        output = ModelOutputThunk(value="response")
+        setattr(output, "stdout", "x" * 1500)
+        setattr(output, "stderr", "")
+        ctx = ChatContext().add(output)
+
+        limit_reqs = requirements_matching("exceed", output_limit_bytes=1000)
+        limit_req = limit_reqs[0]
+
+        result = validation_fn(limit_req)(ctx)
+        assert result.as_bool() is False
+        assert "exceeding" in validation_reason(result).lower()
+        assert "1500" in validation_reason(result)
+
+    def test_stderr_exceeds_limit_fails(self):
+        """Stderr exceeding limit should fail."""
+        output = ModelOutputThunk(value="response")
+        setattr(output, "stdout", "")
+        setattr(output, "stderr", "e" * 1500)
+        ctx = ChatContext().add(output)
+
+        limit_reqs = requirements_matching("exceed", output_limit_bytes=1000)
+        limit_req = limit_reqs[0]
+
+        result = validation_fn(limit_req)(ctx)
+        assert result.as_bool() is False
+        assert "exceeding" in validation_reason(result).lower()
+
+    def test_combined_output_exceeds_limit_fails(self):
+        """Combined stdout+stderr exceeding limit should fail."""
+        output = ModelOutputThunk(value="response")
+        setattr(output, "stdout", "x" * 600)
+        setattr(output, "stderr", "e" * 600)  # Combined: 1200 bytes
+        ctx = ChatContext().add(output)
+
+        limit_reqs = requirements_matching("exceed", output_limit_bytes=1000)
+        limit_req = limit_reqs[0]
+
+        result = validation_fn(limit_req)(ctx)
+        assert result.as_bool() is False
+        assert "1200" in validation_reason(result)
+
+    def test_utf8_multibyte_characters_counted_correctly(self):
+        """Multibyte UTF-8 characters should be counted in bytes, not chars."""
+        output = ModelOutputThunk(value="response")
+        setattr(output, "stdout", "🎉" * 100)  # 4 bytes per emoji = 400 bytes
+        setattr(output, "stderr", "")
+        ctx = ChatContext().add(output)
+
+        limit_reqs = requirements_matching("exceed", output_limit_bytes=300)
+        limit_req = limit_reqs[0]
+
+        result = validation_fn(limit_req)(ctx)
+        assert result.as_bool() is False  # 400 > 300
+        assert "exceeding" in validation_reason(result).lower()
+
+    def test_limit_at_boundary(self):
+        """Output exactly at limit should pass."""
+        output = ModelOutputThunk(value="response")
+        setattr(output, "stdout", "x" * 1000)  # Exactly 1000 bytes
+        setattr(output, "stderr", "")
+        ctx = ChatContext().add(output)
+
+        limit_reqs = requirements_matching("exceed", output_limit_bytes=1000)
+        limit_req = limit_reqs[0]
+
+        result = validation_fn(limit_req)(ctx)
+        assert result.as_bool() is True
+
+    def test_limit_one_byte_over_boundary_fails(self):
+        """Output one byte over limit should fail."""
+        output = ModelOutputThunk(value="response")
+        setattr(output, "stdout", "x" * 1001)  # 1001 bytes
+        setattr(output, "stderr", "")
+        ctx = ChatContext().add(output)
+
+        limit_reqs = requirements_matching("exceed", output_limit_bytes=1000)
+        limit_req = limit_reqs[0]
+
+        result = validation_fn(limit_req)(ctx)
+        assert result.as_bool() is False


These tests don't make sense. When would we ever expect a ModelOutputThunk to have these attributes?

jakelorocco · 2026-05-18T19:33:13Z

+        if isinstance(stdout, str):
+            total_output += stdout
+        if isinstance(stderr, str):
+            total_output += stderr


How does this work? Won't output be a ModelOutputThunk in most (/all) cases? So it won't have these attributes? See my comments on the test cases.

jakelorocco · 2026-05-18T19:35:20Z

+
+Prerequisites:
+    matplotlib must be installed for code execution to succeed:
+    $ uv pip install matplotlib


I believe this is actually how we document our other examples. Please confirm @akihikokuroda and change if so.

jakelorocco · 2026-05-18T19:37:09Z

+                strategy=sampling_strategy,
+                return_sampling_results=True,
+                tool_calls=True,
+                model_options={ModelOption.TOOLS: [MelleaTool.from_callable(python)]},


I had previously made a comment about when a python tool would exist. I understand now how you are using it. Can you please clarify if this is expected? ie when models generate python code is it common practice for it to be done through a tool like this? I would've assumed they just generate it through their normal generation mode without a structured output.

Additionally, to support the current workflow where we only look for python in the python tool calls, we should have a much more built out python tool that gets used for this then. Otherwise, we should just default to looking at the model output and asking it to generate python.

Here is the response by claudecode. It seems correct and explains well.

Yes, this is expected behavior and follows standard LLM patterns. Here's what's happening:

Why use a tool?
When tool_calls=True is passed to instruct() (line 83), the model generates structured tool calls as part of its normal generation. This is different from free-form code generation because:

Sampling validation comes first — The code inside the tool call is validated by requirements (lines 55-58) before execution

Explicit execution control — Tools are only invoked after passing validation (line 108: _call_tools())

Is this common practice?
Yes. Most production LLM systems that generate code use this pattern:

Model generates code wrapped in tool calls

The tool call args are validated structurally

Tool is explicitly invoked by the caller

Results are inspected/handled by application logic

Without a tool, you'd have to parse code from free-form text and trust it immediately, with no structured validation step.

This example shows the full lifecycle — if you just wanted direct code generation without the tool wrapper, you'd skip tool_calls=True and _call_tools(), but you'd lose the safety validation layer.

The crux of my comment is still unaddressed, that we don't have a standard Mellea python tool. If we want to go this route, we ought to have that.

Why use a tool?
When tool_calls=True is passed to instruct() (line 83), the model generates structured tool calls as part of its normal generation. This is different from free-form code generation because:

Sampling validation comes first — The code inside the tool call is validated by requirements (lines 55-58) before execution

Sampling strategies / Mellea doesn't call tools by default. So this is true of either approach.

Explicit execution control — Tools are only invoked after passing validation (line 108: _call_tools())

Same as above.

Is this common practice?
Yes. Most production LLM systems that generate code use this pattern:

Model generates code wrapped in tool calls
The tool call args are validated structurally
Tool is explicitly invoked by the caller
Results are inspected/handled by application logic
Without a tool, you'd have to parse code from free-form text and trust it immediately, with no structured validation step.

But the output is functionally free-form. Your code tool only requests a string which could be any text. We are still parsing it. We aren't accepting some json schema that defines the total grammar of the python language.

jakelorocco · 2026-05-18T19:40:34Z

+def _strip_comments(code: str) -> str:
+    """Remove Python comments from code while preserving strings.
+
+    Splits code by lines and removes comments (text after # that's not in a string).
+    Handles both single and double quoted strings.
+    """
+    lines = code.split("\n")
+    result = []
+    for line in lines:
+        in_string = False
+        string_char = None
+        for i, char in enumerate(line):
+            if char in ('"', "'") and (i == 0 or line[i - 1] != "\\"):
+                if not in_string:
+                    in_string = True
+                    string_char = char
+                elif char == string_char:
+                    in_string = False
+                    string_char = None
+            elif char == "#" and not in_string:
+                result.append(line[:i])
+                break
+        else:
+            result.append(line)
+    return "\n".join(result)


I did farm out my thinking to Claude when looking at this function. It seems risky to roll our own comment / code detector. It appears there are a few potential errors with this:

The hand-rolled version gets all of these wrong: - "a \" # not a comment" — the current code sees \" and toggles in_string off, then treats # as a comment start. - '''multi\nline # still a string''' — the current code is line-by-line, so the # on line 2 looks like it's outside a string. - f"x = {1 # comment in expr}" — f-string expression comments need tokenizer awareness.

Claude suggests using a built in python function or substring searching:

def _strip_comments(code: str) -> str: """Remove Python comments from code, preserving strings and structure. Uses the stdlib tokenizer so triple-quoted strings, escaped quotes, and f-strings are handled correctly. Falls back to the original code if it can't be tokenized (e.g., contains a syntax-level lex error). """ try: tokens = list(tokenize.generate_tokens(io.StringIO(code).readline)) except tokenize.TokenizeError: return code kept = [tok for tok in tokens if tok.type != tokenize.COMMENT] return tokenize.untokenize(kept)

or

def _code_contains_strings(code: str, patterns: list[str]) -> bool: try: tokens = tokenize.generate_tokens(io.StringIO(code).readline) non_comment_text = " ".join( tok.string for tok in tokens if tok.type != tokenize.COMMENT ) except tokenize.TokenizeError: non_comment_text = code return any(pattern in non_comment_text for pattern in patterns)

Otherwise, there must be some standardized way of doing this that has all the edge cases already thought out.

The issue confirmed and fixed.

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

jakelorocco

I believe many of my comments haven't been addressed.

jakelorocco · 2026-05-20T17:43:21Z

+
+Prerequisites:
+    matplotlib must be installed for code execution to succeed:
+    $ uv pip install matplotlib


Our examples do use uv run <example>. I think adding the --with there makes sense instead of this uv pip install matplotlib approach.

jakelorocco · 2026-05-20T17:46:58Z

+                strategy=sampling_strategy,
+                return_sampling_results=True,
+                tool_calls=True,
+                model_options={ModelOption.TOOLS: [MelleaTool.from_callable(python)]},


The crux of my comment is still unaddressed, that we don't have a standard Mellea python tool. If we want to go this route, we ought to have that.

Why use a tool?
When tool_calls=True is passed to instruct() (line 83), the model generates structured tool calls as part of its normal generation. This is different from free-form code generation because:

Sampling validation comes first — The code inside the tool call is validated by requirements (lines 55-58) before execution

Sampling strategies / Mellea doesn't call tools by default. So this is true of either approach.

Explicit execution control — Tools are only invoked after passing validation (line 108: _call_tools())

Same as above.

Is this common practice?
Yes. Most production LLM systems that generate code use this pattern:

Model generates code wrapped in tool calls
The tool call args are validated structurally
Tool is explicitly invoked by the caller
Results are inspected/handled by application logic
Without a tool, you'd have to parse code from free-form text and trust it immediately, with no structured validation step.

But the output is functionally free-form. Your code tool only requests a string which could be any text. We are still parsing it. We aren't accepting some json schema that defines the total grammar of the python language.

jakelorocco · 2026-05-20T17:47:40Z

+def _uses_pyplot_show(code: str) -> bool:
+    """Check if code calls plt.show() or similar show() methods.
+
+    Uses AST analysis to robustly detect show() calls regardless of import
+    aliases (e.g., `import matplotlib.pyplot as mpl`). AST approach detects
+    actual method calls, avoiding false positives from string literals.
+    Falls back to string matching only if code doesn't parse.
+    """
+    if _find_attribute_calls(code, ["show"]):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(code, ["plt.show", ".show()"])
+    return False


This has still not been addressed.

jakelorocco · 2026-05-20T17:47:56Z

+def _uses_pyplot_plot(code: str) -> bool:
+    """Check if code calls pyplot plotting functions.
+
+    Uses AST analysis to detect plot-related method calls. Handles import
+    aliases and detects actual method calls, avoiding false positives from
+    string literals or method references. Falls back to string matching
+    only if code doesn't parse.
+    """
+    plot_methods = {"plot", "bar", "scatter", "hist", "imshow", "figure", "subplot"}
+    if _find_attribute_calls(code, list(plot_methods)):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(
+            code, [f".{m}(" for m in plot_methods] + [f"plt.{m}" for m in plot_methods]
+        )
+    return False


This has still not been addressed.

jakelorocco · 2026-05-20T17:48:06Z

+def _calls_savefig(code: str) -> bool:
+    """Check if code calls plt.savefig() or fig.savefig().
+
+    Uses AST analysis to robustly detect savefig() calls regardless of
+    how matplotlib was imported. Detects actual method calls, avoiding
+    false positives from string literals. Falls back to string matching
+    only if code doesn't parse.
+    """
+    if _find_attribute_calls(code, ["savefig"]):
+        return True
+    try:
+        ast.parse(code)
+    except (SyntaxError, ValueError):
+        return _code_contains_strings(code, ["savefig"])
+    return False


This has still not been addressed.

jakelorocco · 2026-05-20T17:48:16Z

+        show_patterns: Deprecated; ignored in favor of AST-based detection
+        backend_patterns: Deprecated; ignored in favor of AST-based detection


This has still not been addressed.

jakelorocco · 2026-05-20T17:49:02Z

+"""Requirement factories for Python tool invocation and code validation.
+
+This module provides generic requirements for Python-tool usage and code
+correctness. Plotting-specific checks are exposed separately through
+``plotting.python_plotting_requirements(...)`` so they are not implied to be
+universal Python-tool requirements.
+"""


This has still not been addressed.

jakelorocco · 2026-05-20T17:52:55Z

I'm going to close this PR. You can re-open if you want, but I feel like there are fundamental questions about how python code generation should work and about how we should analyze that code that have yet to be addressed. I think we might be better off discussing those in the issue / through a discussion so that we can better understand the pros / cons. I feel like a lot of the nuance / discussion is getting lost in this PR format.

akihikokuroda · 2026-05-20T18:39:51Z

@jakelorocco OK, let's start from scratch. I added the design doc in #1023 (comment).

python tools requirement

dbef526

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from a team as a code owner May 7, 2026 21:22

akihikokuroda requested review from jakelorocco and markstur May 7, 2026 21:22

github-actions Bot added the enhancement New feature or request label May 7, 2026

AngeloDanducci requested changes May 8, 2026

View reviewed changes

AngeloDanducci reviewed May 8, 2026

View reviewed changes

Comment thread mellea/stdlib/requirements/python_tools.py Outdated

Comment thread mellea/stdlib/requirements/python_tools.py Outdated

markstur reviewed May 8, 2026

View reviewed changes

review comments

119699e

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from AngeloDanducci May 9, 2026 12:47

review comments

9055101

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from markstur May 9, 2026 13:28

jakelorocco reviewed May 11, 2026

View reviewed changes

akihikokuroda added 6 commits May 11, 2026 15:40

review comments

036cfee

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comments

e3d00f3

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comments

577eb00

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

matplotlib restructure

41f09cc

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review cmments

85c5c67

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

fix doc strings

8f391da

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from jakelorocco May 12, 2026 01:06

jakelorocco reviewed May 12, 2026

View reviewed changes

review comments

8e24e97

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from jakelorocco May 13, 2026 00:10

review comment

b7261ea

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

AngeloDanducci approved these changes May 14, 2026

View reviewed changes

markstur requested changes May 14, 2026

View reviewed changes

Comment thread docs/examples/tools/python_plotting_repair.py

Comment thread docs/examples/tools/python_plotting_repair.py Outdated

markstur previously requested changes May 14, 2026

View reviewed changes

markstur reviewed May 14, 2026

View reviewed changes

Comment thread mellea/stdlib/requirements/plotting/matplotlib.py Outdated

markstur reviewed May 14, 2026

View reviewed changes

review comments

94051c9

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

markstur reviewed May 15, 2026

View reviewed changes

review comments

acbef75

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

jakelorocco requested changes May 18, 2026

View reviewed changes

akihikokuroda added 4 commits May 18, 2026 20:17

review comment

7b3f561

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comment

7437c97

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comments

dece270

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

review comments

b094918

Signed-off-by: Akihiko Kuroda <akihikokuroda2020@gmail.com>

akihikokuroda requested a review from jakelorocco May 19, 2026 15:58

jakelorocco requested changes May 20, 2026

View reviewed changes

jakelorocco closed this May 20, 2026

akihikokuroda mentioned this pull request May 20, 2026

Requirements library for the Python tool #1023

Open

		@@ -0,0 +1,174 @@
		# pytest: ollama, e2e, qualitative
		"""Granite 4.1 repairs the three canonical plotting failures with Python tool.

		return score


		def extract_python_code(ctx: Context) -> ValidationResult:

		show_patterns: Deprecated; ignored in favor of AST-based detection
		backend_patterns: Deprecated; ignored in favor of AST-based detection

Conversation

akihikokuroda commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirement PR

Description

Implementation Checklist

Base Class

Validation Logic

Integration

Testing

Attribution

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akihikokuroda commented May 12, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akihikokuroda May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

akihikokuroda commented May 7, 2026 •

edited

Loading

akihikokuroda May 12, 2026 •

edited

Loading