Skip to content

F062: Built-in Regex Transform Operation #209

@pocky

Description

@pocky

F062: Built-in Regex Transform Operation

User Stories

US1: Match and Extract Text with Capture Groups (P1 - Must Have)

As a workflow author,
I want to extract structured data from step outputs using regular expressions with named capture groups,
So that I can parse unstructured text (logs, command output, API responses) into discrete values for downstream steps.

Acceptance Scenarios:

  • Given a step output containing "version: 3.14.2", when I run regex.match with pattern version:\s+(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+), then the outputs contain major=3, minor=14, patch=2, and matched=true
  • Given a step output containing "no version info", when I run regex.match with the same pattern, then matched=false and capture group outputs are empty strings
  • Given a pattern with unnamed capture groups (\d+)-(\d+) applied to "42-99", then outputs contain group_1=42, group_2=99

Independent Test: Create a workflow with a command step that echoes structured text, followed by a regex.match operation step that extracts values, followed by a terminal step that interpolates {{states.extract.outputs.major}}

US2: Replace Text Using Patterns (P1 - Must Have)

As a workflow author,
I want to replace matched text in a string using a regex pattern and replacement template,
So that I can transform step outputs (sanitize data, reformat strings, redact sensitive content) before passing them downstream.

Acceptance Scenarios:

  • Given input text "Hello World 2026", when I run regex.replace with pattern \d+ and replacement YEAR, then output contains "Hello World YEAR"
  • Given input text "foo-bar-baz", when I run regex.replace with pattern (\w+)-(\w+)-(\w+) and replacement $3_$2_$1, then output contains "baz_bar_foo"
  • Given input text with no matches for the pattern, when I run regex.replace, then the output equals the original input unchanged

Independent Test: Create a workflow with a regex.replace step that redacts email addresses from command output, verify the output contains [REDACTED] instead of email strings

US3: Find All Matches in Text (P2 - Should Have)

As a workflow author,
I want to find all occurrences of a pattern in text and access them as a list,
So that I can iterate over extracted values using loop constructs (for-each) in subsequent steps.

Acceptance Scenarios:

  • Given text "error at line 10, error at line 25, error at line 42", when I run regex.find_all with pattern line (\d+), then matches output is a JSON array ["line 10","line 25","line 42"] and groups output is [["10"],["25"],["42"]]
  • Given text with zero matches, when I run regex.find_all, then matches is an empty array [] and count is 0

Independent Test: Create a workflow that extracts all URLs from a text block using regex.find_all, then loops over them with a for-each step

US4: Split Text by Pattern (P3 - Nice to Have)

As a workflow author,
I want to split a string using a regex delimiter pattern,
So that I can break multi-value outputs into individual items for parallel or sequential processing.

Acceptance Scenarios:

  • Given text "one::two:::three", when I run regex.split with pattern :+, then parts output is ["one","two","three"] and count is 3
  • Given a limit input of 2, when splitting "a-b-c-d" by -, then parts is ["a","b-c-d"]

Independent Test: Create a workflow that splits CSV-like command output by a regex separator and verifies the resulting array length


Requirements

Functional Requirements

  • FR-001: The system shall provide a regex.match operation that applies a Go regexp pattern to input text and returns a boolean matched flag, the full match string, and all named/unnamed capture group values as individual outputs
  • FR-002: The system shall provide a regex.replace operation that applies a Go regexp pattern to input text, replaces matches using a replacement template supporting $1/${name} backreferences, and returns the transformed text
  • FR-003: The system shall provide a regex.find_all operation that returns all non-overlapping matches of a pattern in the input text as a JSON array, with a configurable limit (default: unlimited) to cap the number of matches
  • FR-004: The system shall provide a regex.split operation that splits input text by a regex delimiter pattern and returns the resulting parts as a JSON array, with a configurable limit for maximum splits
  • FR-005: All regex operations shall validate the pattern input at execution time and return a structured error with code EXEC.PLUGIN.OPERATION if the pattern is invalid, including the regexp compilation error message
  • FR-006: All regex operations shall accept a text input (required, string) and a pattern input (required, string) at minimum
  • FR-007: The regex.match operation shall support both named groups (?P<name>...) and positional groups (...), exposing named groups as named outputs and positional groups as group_1, group_2, etc.
  • FR-008: All regex operations shall register with the CompositeOperationProvider under the regex namespace, following the established F054/F056 built-in provider pattern

Non-Functional Requirements

  • NFR-001: Regex compilation and execution shall complete in < 10ms for patterns up to 1KB and input text up to 1MB
  • NFR-002: All regex types shall remain internal to internal/infrastructure/regex/ with no new domain entities (following F054/F056 no-domain-pollution principle)
  • NFR-003: Invalid regex patterns shall produce actionable error messages including the original pattern and the Go regexp compilation error
  • NFR-004: The implementation shall use Go's regexp package (RE2 syntax) — no PCRE or backtracking engines — guaranteeing linear-time execution and preventing ReDoS

Success Criteria

  • All P1 user stories implemented and tested
  • All P2 user stories implemented and tested
  • Unit test coverage >= 80%
  • No lint errors
  • Documentation updated
  • go-arch-lint check passes with new infra-regex component
  • Integration tests validate cross-step interpolation of regex outputs

Key Entities

Entity Description Attributes
RegexOperationProvider Built-in OperationProvider for regex namespace operations map, Execute dispatch
MatchResult Internal result of a regex.match execution matched bool, full_match string, groups map
ReplaceResult Internal result of a regex.replace execution result string, count int
FindAllResult Internal result of a regex.find_all execution matches []string, groups [][]string, count int
SplitResult Internal result of a regex.split execution parts []string, count int

Metadata

  • Status: backlog
  • Version: v0.4.0
  • Priority: medium
  • Estimation: M

Dependencies

  • Blocked by: F057
  • Unblocks: none

Clarifications

Section populated during clarify step with resolved ambiguities.

Notes

  • Uses Go regexp package exclusively (RE2 syntax). This guarantees linear-time matching and eliminates ReDoS risk but means lookaheads/lookbehinds and backreferences in patterns are not supported. This is a deliberate trade-off: safety over power.
  • The regex namespace joins github and notify in the CompositeOperationProvider. Adding a third namespace requires only map registration — no API changes.
  • Capture group outputs are strings. Workflow authors needing numeric values can use expression syntax (int(states.step.outputs.group_1)) for type conversion.
  • The regex.replace replacement template uses Go's regexp.Expand syntax: $1, ${name}, $$ for literal dollar sign.
  • F057 (file operations) is a prerequisite because regex transforms are most valuable when operating on file content read by file.read. Without F057, regex operations are limited to command step outputs and hardcoded strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions