From 853a0020795bdd4c8d9daaa6d8abffdc9eeced29 Mon Sep 17 00:00:00 2001 From: Ubuntu Date: Tue, 19 May 2026 08:06:19 +0000 Subject: [PATCH 1/3] =?UTF-8?q?fix:=20strengthen=20orchestrator=20discipli?= =?UTF-8?q?ne=20=E2=80=94=20dispatch=20enforcement,=20branch=20verificatio?= =?UTF-8?q?n,=20todo=20discipline,=20property=20patterns=20for=20BDD=20exa?= =?UTF-8?q?mples?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Post-session analysis of cex-mm project revealed three systemic failures: 1. Orchestrator routinely bypasses owner dispatch and does work directly. The todo template had no Dispatch step — it jumped from Preparation to Load Skills, so the orchestrator never saw the instruction to dispatch. Fix: added explicit Dispatch step (#2) in todo template with MUST NOT do the work itself constraint and owner mapping table. 2. Branch discipline not enforced at state entry. Agents entered states declaring git:dev while on feature branches and vice versa. Fix: Preparation step now verifies branch matches attrs.git. Golden rule 7 now says 'Verify before starting'. New golden rule 8: feature branches must be merged back to dev before new work starts. 3. Todo list goes stale or disappears mid-state as agents focus on work. Fix: added Todo discipline paragraph requiring update after every step and regeneration if missing. 4. Review-gate skill loaded smell-catalogue at #key-takeaways but detecting violations needs the full document (per progressive knowledge loading rules in AGENTS.md). Fix: step 5 now loads full docs for detection, #key-takeaways only for recall. 5. No guidance for choosing Example vs Scenario Outline during BDD example creation. Agents either over-used Scenario Outlines or under-used them. Fix: added property-patterns knowledge file (Wlaschin, 2014) with seven patterns and a decision tree. Updated write-bdd-features skill step 4 to apply patterns systematically. Added research reference. --- .../requirements/property-patterns.md | 107 ++++++++++++++++++ .opencode/skills/review-gate/SKILL.md | 2 +- .opencode/skills/write-bdd-features/SKILL.md | 16 +-- AGENTS.md | 19 ++-- .../quality/wlaschin_2014.md | 50 ++++++++ 5 files changed, 178 insertions(+), 16 deletions(-) create mode 100644 .opencode/knowledge/requirements/property-patterns.md create mode 100644 docs/research/software-engineering/quality/wlaschin_2014.md diff --git a/.opencode/knowledge/requirements/property-patterns.md b/.opencode/knowledge/requirements/property-patterns.md new file mode 100644 index 0000000..a9d9865 --- /dev/null +++ b/.opencode/knowledge/requirements/property-patterns.md @@ -0,0 +1,107 @@ +--- +domain: requirements +tags: [property-based-testing, examples, scenario-outline, test-design, bdd, hypothesis] +last-updated: 2026-05-19 +--- + +# Property Patterns for BDD Example Selection + +## Key Takeaways + +- When writing BDD Examples, use these seven property patterns (Wlaschin, 2014) to decide whether an Example should be a simple `Example:` or a `Scenario Outline:` with multiple input combinations. +- **Simple `Example:`** is appropriate when the behaviour is a single observable outcome with fixed inputs — no interesting property to generalise. +- **`Scenario Outline:`** is appropriate when the same behavioural outcome holds across multiple input/output combinations — the property pattern reveals which combinations matter. +- The seven patterns also surface missing Examples: if a property pattern applies but has no corresponding Example, the specification is incomplete. + +## Concepts + +**Seven Property Patterns** (Wlaschin, 2014). When choosing what to verify in a specification, these patterns help discover what properties (invariants, relationships) the system should satisfy: + +| Pattern | Core Idea | When to use Scenario Outline | +|---------|-----------|------------------------------| +| Different paths, same destination | Two operation sequences produce the same result | When multiple paths exist to the same outcome (e.g., different orderings, different constructors) | +| There and back again | An operation and its inverse return to the starting state | When serialise/deserialise, encode/decode, add/remove pairs exist | +| Some things never change | An invariant is preserved after a transformation | When a transform should preserve size, membership, ordering, or other invariants | +| The more things change, the more they stay the same | Applying an operation twice is the same as applying it once (idempotence) | When operations should be idempotent (e.g., deduplicate, round, normalise) | +| Solve a smaller problem first | A property true for a small case implies truth for a composed case (structural induction) | When recursive or composable structures are involved (lists, trees, nested objects) | +| Hard to prove, easy to verify | Finding the answer is complex, but checking it is simple | When output can be verified by a simpler check (e.g., sort result is a permutation, parse result concatenates to original) | +| The test oracle | An alternate implementation exists to verify results | When a brute-force or simplified reference implementation can validate the optimised version | + +**Using Patterns to Choose Example vs Scenario Outline**: During feature example creation (write-bdd-features skill), apply these patterns to each Rule: + +1. For each Rule, ask: "Does any of the seven patterns apply to this behaviour?" +2. If **no pattern applies** — the behaviour is a single discrete outcome with fixed inputs — write a simple `Example:`. +3. If a pattern applies — the behaviour holds across a range of inputs — write a `Scenario Outline:` with an `Examples:` table covering the significant input combinations surfaced by the pattern. +4. If a pattern reveals an edge case not covered by existing Examples — add the missing Example. + +**Pattern-to-Example Decision Tree**: + +``` +Does the Rule describe an invariant that holds across inputs? +├─ Yes → Scenario Outline with inputs that exercise the invariant +│ + Hypothesis property test per [[software-craft/test-design#concepts]] +└─ No → Does the Rule have "easy to verify" checkable output? + ├─ Yes → Can multiple inputs produce different valid outputs? + │ ├─ Yes → Scenario Outline with representative input/output pairs + │ └─ No → Simple Example with the key input + └─ No → Simple Example (single observable outcome) +``` + +**Pre-mortem Integration**: During the behavior-level pre-mortem per [[requirements/pre-mortem#concepts]], apply property patterns adversarially: "Given this pattern applies to this Rule, what inputs would break it?" Surface failure modes as additional Examples. + +## Content + +### Pattern Application Examples + +**Different paths, same destination**: A sort function produces the same result regardless of input order. Use Scenario Outline with different input orderings asserting identical sorted output. This also applies to commutative operations: `a + b == b + a`. + +**There and back again**: JSON serialisation round-trips: `decode(encode(obj)) == obj`. Use Scenario Outline with different object shapes. HTTP encode/decode, compression/decompress, and format conversions all fit this pattern. + +**Some things never change**: A `map` operation preserves list length. A `sort` preserves the multiset of elements. Use Scenario Outline with different input sizes and element values, asserting the invariant holds. + +**Idempotence**: Calling `distinct()` twice produces the same result as calling it once. Use Scenario Outline with different input sets, some already distinct, some with duplicates. REST PUT operations are another common case. + +**Structural induction**: If a property holds for a base case (empty list) and for appending one element, it holds for all lists. Use Scenario Outline with list sizes 0, 1, 2, N to cover induction steps. + +**Hard to prove, easy to verify**: Finding a prime factorisation is hard, but multiplying the factors back is trivial. Tokenising a string is hard, but concatenating tokens should equal the original. Use Scenario Outline with different input strings or numbers, asserting the verification check. + +**Test oracle**: A fast sorting algorithm can be verified against a naive bubble sort. A parallel computation can be verified against a sequential version. Use Scenario Outline where each row exercises a different input against both implementations. + +### Integration with BDD Workflow + +When the PO (or SE) writes Examples during `write-bdd-features`: + +1. Write the Rule's declarative behaviour first (Given/When/Then). +2. Check each of the seven patterns against the Rule. +3. For each matching pattern, determine the input combinations that exercise the property. +4. If 1-2 combinations → simple `Example:` per combination. +5. If 3+ combinations with the same step structure → `Scenario Outline:` with `Examples:` table. +6. For invariant/structural Rules → also generate a Hypothesis property test per [[software-craft/test-design#concepts]]. + +### Hypothesis Property Tests from Patterns + +Each invariant/structural Rule should produce both BDD Examples AND a Hypothesis property test. The property pattern guides the Hypothesis strategy: + +| Pattern | Hypothesis Strategy | +|---------|-------------------| +| Different paths, same destination | `@given(inputs, order=strategies.permutations)` | +| There and back again | `@given(arbitrary_input)` then round-trip assert | +| Some things never change | `@given(transform_input)` then assert invariant | +| Idempotence | `@given(input)` then `assert f(f(x)) == f(x)` | +| Structural induction | `@given(recursive_strategy)` with base + step | +| Hard to prove, easy to verify | `@given(input)` then verify output with simple check | +| Test oracle | `@given(input)` then `assert fast(input) == oracle(input)` | + +## Related + +- [[requirements/gherkin]] +- [[requirements/pre-mortem]] +- [[software-craft/test-design]] +- [[software-craft/tdd]] + +## Related + +- [[software-craft/test-design]] +- [[software-craft/tdd]] +- [[requirements/gherkin]] +- [[requirements/pre-mortem]] diff --git a/.opencode/skills/review-gate/SKILL.md b/.opencode/skills/review-gate/SKILL.md index c86e28d..8b112e2 100644 --- a/.opencode/skills/review-gate/SKILL.md +++ b/.opencode/skills/review-gate/SKILL.md @@ -15,7 +15,7 @@ Available knowledge: [[software-craft/code-review]], [[software-craft/test-desig 2. Verify implementation aligns with architectural decisions per [[software-craft/code-review#concepts]]: ADR compliance, quality attributes met. 3. Verify all `# Constraints:` in the .feature file are met in the implementation. For technology constraints, read domain_spec.md `### Technology Requirements` table and execute the Verification instruction for each row (grep imports, check file existence, inspect config). Zero evidence → FAIL. For quality attribute constraints, verify thresholds are enforced. 4. Verify implementation aligns with feature specification: all Examples have corresponding test implementations, behavior matches Gherkin steps. -5. Verify design principles adversarially per the priority order in [[software-craft/tdd#content]], loading ObjCal per [[software-craft/object-calisthenics#key-takeaways]], smells per [[software-craft/smell-catalogue#key-takeaways]], and SOLID per [[software-craft/solid#key-takeaways]]. +5. Verify design principles adversarially per the priority order in [[software-craft/tdd#content]], loading the full documents for detection: ObjCal per [[software-craft/object-calisthenics]], smells per [[software-craft/smell-catalogue]], and SOLID per [[software-craft/solid]]. Use `#key-takeaways` only when recalling principles, not when detecting violations. 6. **FAIL-FAST**: If any design violations found → exit `fail` with specific citations (file:line). Do NOT proceed to structure review. ## Tier 2: Structure Review diff --git a/.opencode/skills/write-bdd-features/SKILL.md b/.opencode/skills/write-bdd-features/SKILL.md index 4c20d06..f3523c1 100644 --- a/.opencode/skills/write-bdd-features/SKILL.md +++ b/.opencode/skills/write-bdd-features/SKILL.md @@ -5,22 +5,22 @@ description: "Write concrete Given/When/Then Example blocks for each Rule in the # Write BDD Features -Available knowledge: [[requirements/gherkin]], [[requirements/moscow]], [[requirements/pre-mortem]], [[requirements/decomposition]]. `in` artifacts: read all before starting work. +Available knowledge: [[requirements/gherkin]], [[requirements/moscow]], [[requirements/pre-mortem]], [[requirements/decomposition]], [[requirements/property-patterns]]. `in` artifacts: read all before starting work. 1. Discover and read the feature file, product definition, domain spec, and glossary from `in`. 2. Run a pre-mortem per [[requirements/pre-mortem]] for each Rule before writing any Examples. All Rules must have their pre-mortems completed before any Examples are written. 3. IF hidden failure modes surface from the pre-mortem → plan Examples to cover them per [[requirements/gherkin#key-takeaways]]. -4. For each Rule, write Example or Scenario Outline blocks directly from the Rule description and domain spec knowledge per [[requirements/gherkin#concepts]]. Do NOT use behavior hints — they have been removed from the flow. Derive Example behavior directly from: - - The Rule's behavioral description paragraph - - The domain spec's External Contracts, Data Shapes, and Invariants - - The feature's `# Constraints:` comments - - Quality attributes from product_definition.md - Write Examples per format rules in [[requirements/gherkin#concepts]]. +4. For each Rule, apply property patterns per [[requirements/property-patterns#concepts]] to determine Example structure: + a) Check each of the seven patterns against the Rule's behaviour. + b) If no pattern applies → write a simple `Example:` with fixed inputs. + c) If a pattern applies and reveals 3+ input combinations with the same step structure → write a `Scenario Outline:` with an `Examples:` table covering the significant combinations surfaced by the pattern. + d) If a pattern applies but only reveals 1-2 combinations → write simple `Example:` per combination. + Write Examples per format rules in [[requirements/gherkin#concepts]], deriving behavior from the Rule's description, domain spec External Contracts/Data Shapes/Invariants, the feature's `# Constraints:` comments, and quality attributes from product_definition.md. 5. For each Rule, verify Examples cover distinct behaviours per [[requirements/gherkin#concepts]]: a) Group Examples by `Then` outcome. Same outcome = same behaviour. Keep one representative per outcome. Discard duplicates. Exception: Scenario Outline rows are parameterized variants of the same behaviour — they are NOT duplicates. b) For each distinct outcome, run the behavior-level pre-mortem per [[requirements/pre-mortem#concepts]]. c) Add Examples targeting the failure modes surfaced. - d) Structural (invariant) rules: one representative Example suffices. Defer full coverage to a Hypothesis property test per [[software-craft/test-design#concepts]]. + d) Structural (invariant) rules: one representative Example suffices. Defer full coverage to a Hypothesis property test per [[software-craft/test-design#concepts]], using the pattern-to-strategy mapping in [[requirements/property-patterns#content]]. 6. Classify each Example per [[requirements/moscow#concepts]]; MoSCoW classification is for internal triage only: do NOT add Must/Should/Could tags to Examples in the .feature file. 7. IF a Rule has more than 8 Must behaviors (after grouping by Then-outcome and collapsing Scenario Outlines) → this is a soft flag for PO review. Do NOT split or modify the Rule — Rule structure is frozen after define-flow. Decomposition was applied during refine-features; this check catches edge cases that slipped through. A Rule with 9+ Must behaviors is acceptable if the behaviour genuinely requires that many distinct cases. 8. Evaluate each Rule's Examples for quality, checking every criterion per [[requirements/gherkin#concepts]]: diff --git a/AGENTS.md b/AGENTS.md index 1539731..b36fc40 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -8,7 +8,8 @@ Post-mortem analysis shows these practices prevent most project failures. Violat 4. **Never decompose a feature without stakeholder approval.** If a feature is too large for INVEST, propose the split to the stakeholder with rationale. They decide what's core vs. deferred. 5. **Verify inputs exist before entering a state.** Every state's `in` artifacts must be readable on disk. If they're missing, stop and reconstruct them. Don't proceed with assumed knowledge. 6. **A feature is not done until every interview requirement is traced.** Every stakeholder Q&A must map to either a passing @id test or an explicit stakeholder deferral. Untraced requirements = incomplete delivery. -7. **Respect git branch discipline.** Every state declares `git: dev`, `git: feature`, or `git: main` in its attrs. Work on the branch the state declares. Never switch branches mid-state. Before exiting a project-phase flow (discovery, architecture, branding, setup), set `committed-to-dev-locally: ==verified` evidence. Changes must be committed to dev before advancing. +7. **Respect git branch discipline.** Every state declares `git: dev`, `git: feature`, or `git: main` in its attrs. **Verify the current branch matches `attrs.git` before starting any work.** If the branch is wrong, checkout or create the correct branch before proceeding. Never switch branches mid-state. Before exiting a project-phase flow (discovery, architecture, branding, setup), set `committed-to-dev-locally: ==verified` evidence. Changes must be committed to dev before advancing. +8. **Every feature branch must be merged back to dev.** A feature is not delivered until its commits are squash-merged into local dev and `task test-fast` passes on dev. The develop-flow exits to deliver-flow which handles the merge, but the orchestrator must never leave a feature branch dangling — if the session ends mid-feature, resume and complete the merge before starting new work. ## Project Structure - `.flowr/flows/`: YAML state machine definitions (source of truth for routing) @@ -163,16 +164,20 @@ Exception: The polish-code skill explicitly runs convention commands (`task conv ### Todo-Driven State Execution -At state entry, generate a procedural todo list from the state's metadata using the todowrite tool. Format: `[X]` completed, `[ ]` pending, `[~]` anchor (always last). +At state entry, generate a procedural todo list using the todowrite tool. Format: `[X]` completed, `[ ]` pending, `[~]` anchor (always last). -1. **Preparation** (`[ ]`): list available `in` artifacts -2. **Dispatch** (`[ ]`): call the state's owner agent with skills loaded -3. **Output** (`[ ]`): one per `out` artifact -4. **Verification** (`[ ]`): check constraints, run tests/lint if applicable -5. **Anchor** (`[~]`, always last): flowr next → pick transition → flowr transition → rewrite todo +1. **Preparation** (`[ ]`): verify current branch matches `attrs.git` (checkout or create if wrong). List available `in` artifacts. +2. **Dispatch** (`[ ]`): dispatch to the owner agent listed in `attrs.owner` as a subagent with skills loaded. The orchestrator MUST NOT do the work itself — only route. Owner mapping: `PO` → product-owner, `DE` → domain-expert, `SE` → software-engineer, `SA` → system-architect, `R` → reviewer, `Design Agent` → design-agent, `Setup Agent` → setup-agent. +3. **Load skills** (`[ ]`): read every skill file listed in `attrs.skills` from `.opencode/skills//SKILL.md`. This step is MANDATORY — never skip it. +4. **Skill-derived work items** (`[ ]`): one todo item per numbered step in the skill, using the skill's own language verbatim. These are the substantive work items. Self-generated items are only permitted for infrastructure (read artifacts, commit) — never for the core procedure. +5. **Output** (`[ ]`): one per `out` artifact +6. **Verification** (`[ ]`): check constraints, run tests/lint if applicable +7. **Anchor** (`[~]`, always last): flowr next → pick transition → flowr transition → rewrite todo The todo is the execution contract. Every item must be marked `[X]` before the anchor fires. One state per todo; never span multiple states or collapse loop iterations. Full protocol: [[workflow/todo-anchor-protocol]]. +**Todo discipline**: After completing ANY step, update the todowrite tool to mark it `[X]` and set the next step `[ ]` to `in_progress`. If the todo list is empty or missing, regenerate it immediately — working without a todo means working without a contract. Never let the todo go stale between steps. + ### Session Init Before starting a flow, create a session to track progress: diff --git a/docs/research/software-engineering/quality/wlaschin_2014.md b/docs/research/software-engineering/quality/wlaschin_2014.md new file mode 100644 index 0000000..f24b55e --- /dev/null +++ b/docs/research/software-engineering/quality/wlaschin_2014.md @@ -0,0 +1,50 @@ +# Choosing Properties for Property-Based Testing (Wlaschin, 2014) + +## Citation + +Wlaschin, S. (2014). "Choosing properties for property-based testing" *F# for Fun and Profit*. https://fsharpforfunandprofit.com/posts/property-based-testing-2/ + +## Source Type + +Blog/Article + +## Method + +Theoretical with practical examples + +## Verification Status + +Verified + +## Confidence + +High + +## Key Insight + +Seven recurring patterns help developers discover testable properties when they cannot think of any: "Different paths, same destination" (commutative diagram), "There and back again" (inverse function), "Some things never change" (invariant under transformation), "The more things change, the more they stay the same" (idempotence), "Solve a smaller problem first" (structural induction), "Hard to prove, easy to verify", and "The test oracle". + +## Core Findings + +1. The universal problem with property-based testing is not tooling but discovering what properties to test — developers stare at a blank screen unable to think of properties +2. Seven patterns cover most common cases: commutative operations, inverse pairs, invariants, idempotence, structural induction, easy-verification, and test oracles +3. "Different paths, same destination" applies when two operation sequences should produce the same result (e.g., addition is commutative) +4. "There and back again" applies when an operation and its inverse return to the starting state (e.g., serialize/deserialize) +5. "Some things never change" applies when a transformation preserves an invariant (e.g., sort preserves multiset of elements) +6. "Hard to prove, easy to verify" applies when finding the answer is complex but checking it is simple (e.g., prime factorisation — hard to find, easy to multiply back) +7. "The test oracle" applies when an alternate (simpler, slower) implementation exists to verify the production implementation +8. Model-based testing is a variant of the test oracle pattern: a simplified model runs in parallel with the system under test, and states are compared after each operation + +## Mechanism + +Rather than trying to enumerate all possible properties of a system, developers apply each of the seven patterns as lenses to examine the system's behaviour. Each pattern asks a different question: "Are there two ways to get the same result?" "Is there an inverse?" "What stays the same?" "What happens if I do it twice?" "Can I break it into smaller parts?" "Is the answer easy to check?" "Is there a reference implementation?" This structured approach overcomes the "blank screen" problem by providing concrete starting points for property discovery. + +## Relevance + +Essential for BDD example creation: the seven patterns provide a systematic method for deciding whether a Rule should use simple Examples or Scenario Outlines with parameterised inputs. When a pattern applies, it reveals which input combinations matter and whether the behaviour holds across a range of inputs. This directly informs the Example-vs-Scenario-Outline decision in the write-bdd-features skill. The patterns also surface missing Examples: if a pattern applies to a Rule but no Example covers the revealed combination, the specification is incomplete. + +## Related Research + +- (Claessen & Hughes, 2000) - QuickCheck: automatic testing of Haskell programs +- (MacIver, 2016) - Hypothesis library extending property-based testing to Python +- (Tillmann & Schulte, 2005) - PEX team at Microsoft compiled a complementary list of property patterns From f8c074457eace900c9f0350d2654ec9826e07363 Mon Sep 17 00:00:00 2001 From: Ubuntu Date: Tue, 19 May 2026 10:31:56 +0000 Subject: [PATCH 2/3] =?UTF-8?q?fix:=20prevent=20beehave=20noise=20patterns?= =?UTF-8?q?=20=E2=80=94=20title=20rules,=20literal=20guidance,=20output=20?= =?UTF-8?q?column=20patterns?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Post-mortem analysis of two features (rate limit buckets, cache history) revealed recurring patterns where agents add noise to satisfy beehave checks instead of writing natural test code: 1. Title special characters: POs write titles with hyphens, periods, underscores which break slug generation. Added explicit 'ONLY Unicode letters, digits, and spaces' rule to Key Takeaways, Concepts, Title Conventions, and Common Mistakes in gherkin.md. 2. Literal awareness: POs write display-oriented literals (e.g. 'United States of America') that are awkward for tests. Added guidance in Key Takeaways and Concepts: choose short, code-friendly literals. 3. Output columns and Hypothesis: Scenario Outline output columns (like expected=min(a,b)) are generated as @given strategies. Agents either remove them (breaking example-mismatch) or add noise (_=expected). Documented the reassignment pattern in gherkin.md Concepts and test-stubs.md Concepts. 4. String literal helpers: Documented the _pair_from('FOO/BAR') helper pattern in test-stubs.md. Agents were stuffing literals into assert messages or assigning to _. 5. write-test skill step 3: Added explicit instructions for output column reassignment and string literal helpers. Added 'NEVER stuff literals into assert messages' prohibition. 6. write-bdd-features skill step 9: Added beehave check at write time to catch title/placeholder issues before downstream stub generation. --- .opencode/knowledge/requirements/gherkin.md | 18 +++++++----- .../knowledge/software-craft/test-stubs.md | 28 ++++++++++++++++++- .opencode/skills/write-bdd-features/SKILL.md | 3 +- .opencode/skills/write-test/SKILL.md | 5 +++- 4 files changed, 44 insertions(+), 10 deletions(-) diff --git a/.opencode/knowledge/requirements/gherkin.md b/.opencode/knowledge/requirements/gherkin.md index 189ad54..1d71631 100644 --- a/.opencode/knowledge/requirements/gherkin.md +++ b/.opencode/knowledge/requirements/gherkin.md @@ -1,7 +1,7 @@ --- domain: requirements tags: [gherkin, acceptance-criteria, specification, examples, bdd, scenario-outline, hypothesis] -last-updated: 2026-05-14 +last-updated: 2026-05-19 --- # Gherkin Specification Format @@ -10,10 +10,9 @@ last-updated: 2026-05-14 - Write declarative Examples that describe behaviour, not UI steps; use `Example:` not `Scenario:` for single-case examples (BDD, North, 2006). - Use `Scenario Outline:` with `` syntax and an `Examples:` table when the same behavioural outcome must be verified across 3+ input/output value combinations. -- Feature, Rule, and Example/Scenario Outline titles must be 2–6 words and unique within the feature file — pytest-beehave uses title-based mapping (title → `test_` function name) for traceability. -- `Then` must be a single, observable, measurable outcome; no "and" combining multiple behaviours in one `Then`. -- Quoted strings (`"value"`) and bare numbers (`42`, `-3`) in steps are extracted by beehave as literals and verified present in test function bodies via `beehave check`. -- `` names in steps become Python function parameters and Hypothesis `@given` strategies in generated stubs. Names must be valid Python identifiers (not keywords, not builtins). +- Feature, Rule, and Example/Scenario Outline titles must be 2–6 words and unique within the feature file — pytest-beehave uses title-based mapping (title → `test_` function name) for traceability. Titles must contain ONLY Unicode letters, digits, and spaces — no hyphens, periods, underscores, or special characters (they break slug generation). +- Quoted strings (`"value"`) and bare numbers (`42`, `-3`) in steps are extracted by beehave as literals and verified present in test function bodies via `beehave check`. Choose literals the test will naturally consume — for example, `"US"` is more useful than `"United States of America"` because the test can use it directly as a parameter. Avoid literals that only make sense as display text. +- `` names in steps become Python function parameters and Hypothesis `@given` strategies in generated stubs. Names must be valid Python identifiers (not keywords, not builtins). All columns in the Examples table — including output columns like `` — become Hypothesis parameters. Output columns that are deterministic functions of inputs (e.g., `expected = min(a, b)`) are reassigned in the test body. This is by design: Hypothesis requires `@given` and `@example` to share the same parameter set. - Bug Examples use `@bug` and require both a specific feature test and a Hypothesis property test. - After criteria commit, Examples are frozen; changes require `@deprecated` on the old Example and a new Example with a new unique title. - Two Examples with the same `Then` outcome but different input values test the same behaviour; partition by behaviour outcome, not by input value (Wynne, 2015; Adzic, 2011). @@ -24,14 +23,16 @@ last-updated: 2026-05-14 **Example vs Scenario Outline**: Use `Example:` for single-case examples. Use `Scenario Outline:` when the same behavioural outcome must be verified across 3+ different input/output value combinations. Scenario Outline uses `` syntax in Given/When/Then steps and an `Examples:` table with concrete data rows. This avoids repeating identical step structures with different values. -**Title Length Constraint**: Feature, Rule, and Example/Scenario Outline titles must be 2–6 words. Titles become `test_` function names — too short produces ambiguous identifiers (e.g. `test_stuff`), too long produces unwieldy ones (e.g. `test_when_the_user_submits_a_form_with_invalid_email_the_system_displays_an_error_message`). Count words by splitting on whitespace. +**Title Length and Character Constraint**: Feature, Rule, and Example/Scenario Outline titles must be 2–6 words and unique within the feature file. Titles must contain ONLY Unicode letters, digits, and spaces — no hyphens (`-`), periods (`.`), underscores (`_`), or other special characters. The title is slugified to produce the test function name (`test_`), and special characters either break slug generation or produce ambiguous identifiers. Too short produces ambiguous identifiers (e.g. `test_stuff`), too long produces unwieldy ones. Count words by splitting on whitespace. **Placeholder Syntax**: `` in Given/When/Then steps. Beehave extracts these and generates Hypothesis `@given(var_name=strategy)` decorators in test stubs. Placeholder names must be valid Python identifiers, not keywords (`for`, `class`), and not builtins (`list`, `str`). When used with Scenario Outline, the Examples table column headers must match the placeholder names. -**Literal Extraction**: Quoted strings (`"value"`, `'value'`) and bare numbers (`42`, `-3`, `3.14`) in Given/When/Then steps are extracted by beehave as literals. `beehave check` verifies these literals appear in the test function body. This provides structural traceability beyond title mapping — tests must use the exact literal values from the spec. +**Literal Extraction**: Quoted strings (`"value"`, `'value'`) and bare numbers (`42`, `-3`, `3.14`) in Given/When/Then steps are extracted by beehave as literals. `beehave check` verifies these literals appear in the test function body. This provides structural traceability beyond title mapping — tests must use the exact literal values from the spec. **Choose literals the test will naturally consume.** If a step says `When an order is placed for "Widget"`, the test will construct an object using `"Widget"` — that's natural. If a step says `When the user uploads "Annual Report Q4 2024 Final Draft"`, the test must contain that entire string, which is awkward. Prefer short, code-friendly literals. For entity identifiers (like trading pairs or product codes), use the format the code actually uses (e.g., `"US"` not `"United States"`) so the test can pass the literal directly to constructors. **Hypothesis Integration**: Scenario Outline generates `@given` decorated stubs with inferred Hypothesis strategies (`st.integers()`, `st.floats()`, `st.booleans()`, `st.text()`) plus `@example` decorators for each Examples table row. Plain Examples generate bare function stubs. For tests hitting external services, use `@settings(max_examples=N)` to control load. For unit/domain tests, Hypothesis defaults are fine. +**Output Columns and Hypothesis Constraint**: Every column in the Examples table becomes both an `@example` argument and a `@given` strategy — including output columns like ``. Hypothesis enforces that `@given` and `@example` share the same parameter set. When an output column is a deterministic function of inputs (e.g., `expected = min(size, count)`), the random `@given` value is meaningless. The SE reassigns it in the test body: `expected = min(size, count)`. The `@example` rows provide the correct concrete values for regression. This is the intended pattern — do NOT remove output columns from the Examples table or from `@given`. The output columns serve as documentation and regression anchors even though Hypothesis fuzzes a meaningless random value for them. + **Example Format and Title-Based Mapping**: Each Example uses the `Example:` keyword (not `Scenario:`), includes `Given/When/Then` in plain English. pytest-beehave maps Examples to test functions by title: the function name is `test_`. Titles must be unique within the feature file. Descriptive titles serve as the traceability link between feature specification and test code — no `@id` tags are needed. **Single Observable Outcome per Then**: `Then` must be a single, observable, measurable outcome. No "and" combining multiple behaviours in one `Then`. Split into separate Examples instead. Observable means observable by the end user, not by a test harness. @@ -58,6 +59,7 @@ last-updated: 2026-05-14 - Feature, Rule, and Example/Scenario Outline titles must be 2–6 words - Titles must be unique within the feature file +- Titles must contain ONLY Unicode letters, digits, and spaces — no hyphens, periods, underscores, or special characters - Title becomes the test function name: `test_` - Titles should be descriptive enough to serve as the test identifier - No `@id` tags — the title is the traceability link @@ -175,6 +177,8 @@ Implement both: - Using `Scenario:` keyword: use `Example:` for single cases or `Scenario Outline:` for parameterized cases - Placeholder names that are Python keywords or builtins: beehave rejects these at parse time - Paraphrasing literal values in test code instead of using exact values from the spec: fails `beehave check` +- Titles containing hyphens, periods, underscores, or special characters: only Unicode letters, digits, and spaces are allowed +- Using long or display-oriented literals in steps (e.g., `"Annual Report Q4 2024 Final Draft"`): the entire string must appear in the test body — prefer short, code-friendly literals ### Feature File Path Convention diff --git a/.opencode/knowledge/software-craft/test-stubs.md b/.opencode/knowledge/software-craft/test-stubs.md index 6d6590d..15a071c 100644 --- a/.opencode/knowledge/software-craft/test-stubs.md +++ b/.opencode/knowledge/software-craft/test-stubs.md @@ -1,7 +1,7 @@ --- domain: software-craft tags: [test-stubs, traceability, pytest-beehave, scenario-outline, hypothesis] -last-updated: 2026-05-14 +last-updated: 2026-05-19 --- # Test Stubs @@ -68,6 +68,32 @@ Stubs (functions with `...` body) are exempt from placeholder and literal checks **Test File Layout**. pytest-beehave organizes tests as: Feature title → directory, Rule → test file, Example/Scenario Outline → function name. Test files are placed in `tests/features//_test.py`. +**Derived Output Columns**. When a Scenario Outline Examples table includes output columns (e.g., `expected`) that are deterministic functions of input columns, the test body must reassign them. Hypothesis generates random values for ALL columns — including outputs — because it requires `@given` and `@example` to share the same parameter set. The reassignment pattern: + +```python +@given(size=st.integers(min_value=1, max_value=50), count=st.integers(min_value=0, max_value=100), expected=st.integers()) +@example(size=3, count=5, expected=3) +@example(size=1, count=3, expected=1) +def test_bounded_history(size, count, expected): + expected = min(size, count) + assert len(cache.history(pair, "data", count=size)) == expected +``` + +The `@example` rows provide correct concrete values for regression. The `@given` value for `expected` is random noise — the reassignment overrides it. This is the intended pattern. + +**String Literal Helpers**. When a Gherkin step references a compound identifier like `"FOO/BAR"`, the test needs this literal in the function body for `beehave check`. Use a helper that naturally consumes the literal: + +```python +def _pair_from(symbol: str) -> Pair: + base, quote = symbol.split("/") + return Pair(base=Token(base), quote=Token(quote)) + +def test_example(): + pair = _pair_from("FOO/BAR") +``` + +The literal `"FOO/BAR"` appears naturally in the function body — no noise assertions or unused variables needed. NEVER satisfy literal checks by stuffing strings into assert messages (`assert x == y, "FOO/BAR"`) or assigning to underscore (`_ = "FOO/BAR"`). + ## Related - [[requirements/gherkin]]: Example format, title conventions, Scenario Outline syntax, placeholder and literal rules diff --git a/.opencode/skills/write-bdd-features/SKILL.md b/.opencode/skills/write-bdd-features/SKILL.md index f3523c1..987b842 100644 --- a/.opencode/skills/write-bdd-features/SKILL.md +++ b/.opencode/skills/write-bdd-features/SKILL.md @@ -24,4 +24,5 @@ Available knowledge: [[requirements/gherkin]], [[requirements/moscow]], [[requir 6. Classify each Example per [[requirements/moscow#concepts]]; MoSCoW classification is for internal triage only: do NOT add Must/Should/Could tags to Examples in the .feature file. 7. IF a Rule has more than 8 Must behaviors (after grouping by Then-outcome and collapsing Scenario Outlines) → this is a soft flag for PO review. Do NOT split or modify the Rule — Rule structure is frozen after define-flow. Decomposition was applied during refine-features; this check catches edge cases that slipped through. A Rule with 9+ Must behaviors is acceptable if the behaviour genuinely requires that many distinct cases. 8. Evaluate each Rule's Examples for quality, checking every criterion per [[requirements/gherkin#concepts]]: - Evaluate Example quality per criteria in [[requirements/gherkin#concepts]]. Every criterion that fails is a hard blocker: fix before advancing. \ No newline at end of file + Evaluate Example quality per criteria in [[requirements/gherkin#concepts]]. Every criterion that fails is a hard blocker: fix before advancing. +9. Run `beehave check ` to verify structural traceability catches issues at write time — title character violations, placeholder name problems, literal format issues. Fix any errors before committing. This prevents downstream rework when the SE generates stubs. \ No newline at end of file diff --git a/.opencode/skills/write-test/SKILL.md b/.opencode/skills/write-test/SKILL.md index ed39648..501fc01 100644 --- a/.opencode/skills/write-test/SKILL.md +++ b/.opencode/skills/write-test/SKILL.md @@ -9,6 +9,9 @@ Available knowledge: [[software-craft/tdd]], [[software-craft/test-design]], [[s 1. Pick the next test function with an `...` (Ellipsis) body from the auto-generated test stubs: prefer tests within the same Rule file before moving to the next Rule, and order by fewest dependencies first per [[software-craft/tdd#concepts]]. pytest-beehave creates stubs with `test_` naming and `...` bodies (auto-skipped during collection). IF the Example belongs to a structural (invariant) Rule → also generate a Hypothesis property test in `tests/unit/` per [[software-craft/test-design#concepts]], using the counterexamples surfaced by the behavior pre-mortem per [[requirements/pre-mortem#concepts]]. 2. **External fixture gate**: IF the Example/Scenario Outline references an external adapter or API (check domain spec External Contracts section), verify the real fixture exists in `tests/fixtures//` per [[software-craft/external-fixtures#key-takeaways]]. IF the fixture is missing → flag as a blocker. Do NOT write the test against imagined data. Request fixture capture before proceeding. -3. Write a failing test that specifies the expected behavior per [[software-craft/tdd#key-takeaways]]. Replace the `...` body with the test implementation. The test function name (derived from the Example title) is immutable — it is the traceability link maintained by pytest-beehave. For Scenario Outline stubs with `@given`/`@example` decorators, use the placeholder parameters in the test body. Use the exact literal values from the spec steps — `beehave check` verifies their presence. +3. Write a failing test that specifies the expected behavior per [[software-craft/tdd#key-takeaways]]. Replace the `...` body with the test implementation. The test function name (derived from the Example title) is immutable — it is the traceability link maintained by pytest-beehave. For Scenario Outline stubs with `@given`/`@example` decorators, use the placeholder parameters in the test body: + - **Output columns**: When the Examples table has an output column that is a deterministic function of inputs (e.g., `expected = min(a, b)`), reassign it in the test body. The `@given` value is random noise — the reassignment computes the correct value. + - **String literals**: When steps contain compound identifiers (e.g., `"FOO/BAR"`), use a helper function that consumes the literal naturally (e.g., `base, quote = "FOO/BAR".split("/")`). NEVER stuff literals into assert messages or assign to `_` — these are noise. + - **All placeholders and literals** from the spec steps must appear in the test body — `beehave check` verifies their presence. 4. IF a spec gap or inconsistency is discovered → do NOT modify specification documents (domain_spec.md, glossary.md, product_definition.md, ADRs, feature files). Flag it in output notes. The SE may ONLY modify production code and test code. 5. Run `task test-fast` to confirm the test fails for the right reason (RED) per [[software-craft/tdd#key-takeaways]]. From e1702d0a6a1ea30b611cc4cf4c66fd7ecb60acfe Mon Sep 17 00:00:00 2001 From: Ubuntu Date: Tue, 19 May 2026 11:00:26 +0000 Subject: [PATCH 3/3] fix: replace beehave noise patterns with Spec Value Fidelity principle MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Anti-patterns (output column reassignment, helper functions for literals, code-friendly literal guidance) treated symptoms — the test working around awkward spec values. Root cause: every value in a spec should carry domain meaning, and both PO and SE must preserve that meaning. Introduces Spec Value Fidelity as a first-class concept in test-design: values exist with domain intent; tests must reflect that intent; noise patterns (assigning to _, assert stuffing, helpers for traceability) violate this fidelity. Cascades to gherkin.md (PO: meaningful literals and columns), test-stubs.md (SE: use values for their domain purpose), and both skills (write-bdd-features adds semantic value check, write-test replaces workaround bullets). --- .opencode/knowledge/requirements/gherkin.md | 10 +++---- .../knowledge/software-craft/test-design.md | 3 +++ .../knowledge/software-craft/test-stubs.md | 26 +------------------ .opencode/skills/write-bdd-features/SKILL.md | 6 ++++- .opencode/skills/write-test/SKILL.md | 4 +-- 5 files changed, 15 insertions(+), 34 deletions(-) diff --git a/.opencode/knowledge/requirements/gherkin.md b/.opencode/knowledge/requirements/gherkin.md index 1d71631..20a6deb 100644 --- a/.opencode/knowledge/requirements/gherkin.md +++ b/.opencode/knowledge/requirements/gherkin.md @@ -11,8 +11,8 @@ last-updated: 2026-05-19 - Write declarative Examples that describe behaviour, not UI steps; use `Example:` not `Scenario:` for single-case examples (BDD, North, 2006). - Use `Scenario Outline:` with `` syntax and an `Examples:` table when the same behavioural outcome must be verified across 3+ input/output value combinations. - Feature, Rule, and Example/Scenario Outline titles must be 2–6 words and unique within the feature file — pytest-beehave uses title-based mapping (title → `test_` function name) for traceability. Titles must contain ONLY Unicode letters, digits, and spaces — no hyphens, periods, underscores, or special characters (they break slug generation). -- Quoted strings (`"value"`) and bare numbers (`42`, `-3`) in steps are extracted by beehave as literals and verified present in test function bodies via `beehave check`. Choose literals the test will naturally consume — for example, `"US"` is more useful than `"United States of America"` because the test can use it directly as a parameter. Avoid literals that only make sense as display text. -- `` names in steps become Python function parameters and Hypothesis `@given` strategies in generated stubs. Names must be valid Python identifiers (not keywords, not builtins). All columns in the Examples table — including output columns like `` — become Hypothesis parameters. Output columns that are deterministic functions of inputs (e.g., `expected = min(a, b)`) are reassigned in the test body. This is by design: Hypothesis requires `@given` and `@example` to share the same parameter set. +- Quoted strings (`"value"`) and bare numbers (`42`, `-3`) in steps are extracted by beehave as literals and verified present in test function bodies via `beehave check`. Every literal carries domain meaning in the test — use literals that identify entities, boundaries, configurations, or error cases the reader needs to see. +- `` names in steps become Python function parameters in generated test stubs. Names must be valid Python identifiers (not keywords, not builtins). Every column in an Examples table must be referenced in the scenario steps — columns not referenced in steps are test data, not specification. If the expected outcome varies independently across examples, an output column is appropriate. If it is a deterministic function of inputs, express the relationship in the `Then` step (e.g., `Then the result is the minimum of and `) rather than adding a computed column. - Bug Examples use `@bug` and require both a specific feature test and a Hypothesis property test. - After criteria commit, Examples are frozen; changes require `@deprecated` on the old Example and a new Example with a new unique title. - Two Examples with the same `Then` outcome but different input values test the same behaviour; partition by behaviour outcome, not by input value (Wynne, 2015; Adzic, 2011). @@ -27,11 +27,11 @@ last-updated: 2026-05-19 **Placeholder Syntax**: `` in Given/When/Then steps. Beehave extracts these and generates Hypothesis `@given(var_name=strategy)` decorators in test stubs. Placeholder names must be valid Python identifiers, not keywords (`for`, `class`), and not builtins (`list`, `str`). When used with Scenario Outline, the Examples table column headers must match the placeholder names. -**Literal Extraction**: Quoted strings (`"value"`, `'value'`) and bare numbers (`42`, `-3`, `3.14`) in Given/When/Then steps are extracted by beehave as literals. `beehave check` verifies these literals appear in the test function body. This provides structural traceability beyond title mapping — tests must use the exact literal values from the spec. **Choose literals the test will naturally consume.** If a step says `When an order is placed for "Widget"`, the test will construct an object using `"Widget"` — that's natural. If a step says `When the user uploads "Annual Report Q4 2024 Final Draft"`, the test must contain that entire string, which is awkward. Prefer short, code-friendly literals. For entity identifiers (like trading pairs or product codes), use the format the code actually uses (e.g., `"US"` not `"United States"`) so the test can pass the literal directly to constructors. +**Literal Extraction**: Quoted strings (`"value"`, `'value'`) and bare numbers (`42`, `-3`, `3.14`) in Given/When/Then steps are extracted by beehave as literals. `beehave check` verifies these literals appear in the test function body, providing structural traceability. Per Spec Value Fidelity ([[software-craft/test-design#concepts]]), every literal in a step must carry domain meaning — it should identify an entity, boundary, configuration, or error case that matters to the reader. **Hypothesis Integration**: Scenario Outline generates `@given` decorated stubs with inferred Hypothesis strategies (`st.integers()`, `st.floats()`, `st.booleans()`, `st.text()`) plus `@example` decorators for each Examples table row. Plain Examples generate bare function stubs. For tests hitting external services, use `@settings(max_examples=N)` to control load. For unit/domain tests, Hypothesis defaults are fine. -**Output Columns and Hypothesis Constraint**: Every column in the Examples table becomes both an `@example` argument and a `@given` strategy — including output columns like ``. Hypothesis enforces that `@given` and `@example` share the same parameter set. When an output column is a deterministic function of inputs (e.g., `expected = min(size, count)`), the random `@given` value is meaningless. The SE reassigns it in the test body: `expected = min(size, count)`. The `@example` rows provide the correct concrete values for regression. This is the intended pattern — do NOT remove output columns from the Examples table or from `@given`. The output columns serve as documentation and regression anchors even though Hypothesis fuzzes a meaningless random value for them. +**Meaningful Examples Tables**. Every column in an Examples table must be referenced in at least one step — unreferenced columns are test data, not specification. Output columns (e.g., ``) are appropriate when the expected value varies independently across rows (e.g., different tax rates per country, different error codes per input). When the expected outcome is a deterministic function of inputs, do not add an output column; express the relationship in the `Then` step. Per Spec Value Fidelity ([[software-craft/test-design#concepts]]), every value in the table exists to be used meaningfully in the test. **Example Format and Title-Based Mapping**: Each Example uses the `Example:` keyword (not `Scenario:`), includes `Given/When/Then` in plain English. pytest-beehave maps Examples to test functions by title: the function name is `test_`. Titles must be unique within the feature file. Descriptive titles serve as the traceability link between feature specification and test code — no `@id` tags are needed. @@ -178,7 +178,7 @@ Implement both: - Placeholder names that are Python keywords or builtins: beehave rejects these at parse time - Paraphrasing literal values in test code instead of using exact values from the spec: fails `beehave check` - Titles containing hyphens, periods, underscores, or special characters: only Unicode letters, digits, and spaces are allowed -- Using long or display-oriented literals in steps (e.g., `"Annual Report Q4 2024 Final Draft"`): the entire string must appear in the test body — prefer short, code-friendly literals +- Adding literals to steps that lack domain meaning: every literal must identify an entity, boundary, configuration, or error case that justifies its presence in both the specification and the test ### Feature File Path Convention diff --git a/.opencode/knowledge/software-craft/test-design.md b/.opencode/knowledge/software-craft/test-design.md index 2c1471f..210dc2a 100644 --- a/.opencode/knowledge/software-craft/test-design.md +++ b/.opencode/knowledge/software-craft/test-design.md @@ -13,6 +13,7 @@ last-updated: 2026-05-14 - Test coupling exists on a spectrum: feature tests (most resilient) > unit contract tests > property-based tests > white-box tests (most brittle, avoid). - One observable behaviour per test: each test should fail for exactly one reason and pass for exactly one reason. - Hard-coded values are acceptable when the test only requires that value; parameterising prematurely couples the test to assumptions about future needs. +- Spec Value Fidelity: every value in the specification carries domain intent; the test must use it in a way that reflects that intent — no noise patterns to satisfy traceability. - Property tests: all invariant/structural rules, not just @bug Examples. Examples alone cannot prove an invariant (MacIver, 2016). ## Concepts @@ -27,6 +28,8 @@ last-updated: 2026-05-14 **Semantic Depth**. A test that exists for an Example but exercises domain logic directly instead of through the entry point described in the acceptance criterion has correct structural traceability but wrong semantic depth. Every feature test must exercise the entry point the AC describes: if the AC specifies a command-line invocation, the test must invoke the command handler; if the AC specifies an API call, the test must call the API endpoint. Structural traceability (every Example has a test function) without semantic depth (every test exercises the right entry point) creates a false sense of coverage. +**Spec Value Fidelity**. Every literal, placeholder, and Examples table column in a specification exists because the PO judged it domain-meaningful — an entity identifier, a boundary value, a configuration, a concrete expected outcome. The test must use each value in a way that reflects that domain purpose. Noise patterns that satisfy structural traceability without domain meaning — assigning to `_`, stuffing strings into assert messages, helper functions whose sole purpose is consuming a literal — violate this fidelity. If a spec value does not fit naturally in the test, the mismatch signals a spec clarity issue, not a test workaround opportunity. + **Invariant Property Tests**. Structural (invariant) rules describe properties that must hold across all inputs, not specific behaviours. Examples alone cannot prove an invariant — they only confirm it holds for the selected cases (MacIver, 2016). When a Rule asserts an invariant (e.g., "total must equal sum of parts," "output must be sorted," "balance must never go negative"), the specification pre-mortem and behavior pre-mortem surface candidate counterexamples. These counterexamples become assertions in a Hypothesis property test (`tests/unit/`) that verifies the invariant across a generated range of inputs, catching failure modes that no finite set of hand-picked Examples could have found. ## Content diff --git a/.opencode/knowledge/software-craft/test-stubs.md b/.opencode/knowledge/software-craft/test-stubs.md index 15a071c..f4e6e29 100644 --- a/.opencode/knowledge/software-craft/test-stubs.md +++ b/.opencode/knowledge/software-craft/test-stubs.md @@ -68,31 +68,7 @@ Stubs (functions with `...` body) are exempt from placeholder and literal checks **Test File Layout**. pytest-beehave organizes tests as: Feature title → directory, Rule → test file, Example/Scenario Outline → function name. Test files are placed in `tests/features//_test.py`. -**Derived Output Columns**. When a Scenario Outline Examples table includes output columns (e.g., `expected`) that are deterministic functions of input columns, the test body must reassign them. Hypothesis generates random values for ALL columns — including outputs — because it requires `@given` and `@example` to share the same parameter set. The reassignment pattern: - -```python -@given(size=st.integers(min_value=1, max_value=50), count=st.integers(min_value=0, max_value=100), expected=st.integers()) -@example(size=3, count=5, expected=3) -@example(size=1, count=3, expected=1) -def test_bounded_history(size, count, expected): - expected = min(size, count) - assert len(cache.history(pair, "data", count=size)) == expected -``` - -The `@example` rows provide correct concrete values for regression. The `@given` value for `expected` is random noise — the reassignment overrides it. This is the intended pattern. - -**String Literal Helpers**. When a Gherkin step references a compound identifier like `"FOO/BAR"`, the test needs this literal in the function body for `beehave check`. Use a helper that naturally consumes the literal: - -```python -def _pair_from(symbol: str) -> Pair: - base, quote = symbol.split("/") - return Pair(base=Token(base), quote=Token(quote)) - -def test_example(): - pair = _pair_from("FOO/BAR") -``` - -The literal `"FOO/BAR"` appears naturally in the function body — no noise assertions or unused variables needed. NEVER satisfy literal checks by stuffing strings into assert messages (`assert x == y, "FOO/BAR"`) or assigning to underscore (`_ = "FOO/BAR"`). +**Spec Value Fidelity in Tests**. Every literal and placeholder from the spec must appear in the test body — `beehave check` verifies this. Per Spec Value Fidelity ([[software-craft/test-design#concepts]]), the test must use each value in a way that reflects its domain purpose. If `"BTC/USD"` represents a trading pair, use it to construct or identify one. If `42` is a boundary value, use it at the boundary. Never satisfy traceability with noise: assigning to `_`, stuffing strings into assert messages, or writing helper functions whose sole purpose is consuming a literal. These mask the real issue — the value's domain purpose is not reflected in the test. ## Related diff --git a/.opencode/skills/write-bdd-features/SKILL.md b/.opencode/skills/write-bdd-features/SKILL.md index 987b842..c7a4740 100644 --- a/.opencode/skills/write-bdd-features/SKILL.md +++ b/.opencode/skills/write-bdd-features/SKILL.md @@ -25,4 +25,8 @@ Available knowledge: [[requirements/gherkin]], [[requirements/moscow]], [[requir 7. IF a Rule has more than 8 Must behaviors (after grouping by Then-outcome and collapsing Scenario Outlines) → this is a soft flag for PO review. Do NOT split or modify the Rule — Rule structure is frozen after define-flow. Decomposition was applied during refine-features; this check catches edge cases that slipped through. A Rule with 9+ Must behaviors is acceptable if the behaviour genuinely requires that many distinct cases. 8. Evaluate each Rule's Examples for quality, checking every criterion per [[requirements/gherkin#concepts]]: Evaluate Example quality per criteria in [[requirements/gherkin#concepts]]. Every criterion that fails is a hard blocker: fix before advancing. -9. Run `beehave check ` to verify structural traceability catches issues at write time — title character violations, placeholder name problems, literal format issues. Fix any errors before committing. This prevents downstream rework when the SE generates stubs. \ No newline at end of file +9. For each literal and `` in every Example, verify it carries domain meaning per Spec Value Fidelity ([[software-craft/test-design#concepts]]): + a) Every quoted string or bare number must identify an entity, boundary, configuration, or error case — not serve as display text. + b) Every column in a Scenario Outline Examples table must be referenced in at least one step. If the expected outcome is a deterministic function of inputs, express the relationship in the `Then` step rather than adding a computed output column. + Remove or reword any value that fails this check before advancing. +10. Run `beehave check ` to verify structural traceability catches issues at write time. Fix any errors before committing. \ No newline at end of file diff --git a/.opencode/skills/write-test/SKILL.md b/.opencode/skills/write-test/SKILL.md index 501fc01..908c39f 100644 --- a/.opencode/skills/write-test/SKILL.md +++ b/.opencode/skills/write-test/SKILL.md @@ -10,8 +10,6 @@ Available knowledge: [[software-craft/tdd]], [[software-craft/test-design]], [[s 1. Pick the next test function with an `...` (Ellipsis) body from the auto-generated test stubs: prefer tests within the same Rule file before moving to the next Rule, and order by fewest dependencies first per [[software-craft/tdd#concepts]]. pytest-beehave creates stubs with `test_` naming and `...` bodies (auto-skipped during collection). IF the Example belongs to a structural (invariant) Rule → also generate a Hypothesis property test in `tests/unit/` per [[software-craft/test-design#concepts]], using the counterexamples surfaced by the behavior pre-mortem per [[requirements/pre-mortem#concepts]]. 2. **External fixture gate**: IF the Example/Scenario Outline references an external adapter or API (check domain spec External Contracts section), verify the real fixture exists in `tests/fixtures//` per [[software-craft/external-fixtures#key-takeaways]]. IF the fixture is missing → flag as a blocker. Do NOT write the test against imagined data. Request fixture capture before proceeding. 3. Write a failing test that specifies the expected behavior per [[software-craft/tdd#key-takeaways]]. Replace the `...` body with the test implementation. The test function name (derived from the Example title) is immutable — it is the traceability link maintained by pytest-beehave. For Scenario Outline stubs with `@given`/`@example` decorators, use the placeholder parameters in the test body: - - **Output columns**: When the Examples table has an output column that is a deterministic function of inputs (e.g., `expected = min(a, b)`), reassign it in the test body. The `@given` value is random noise — the reassignment computes the correct value. - - **String literals**: When steps contain compound identifiers (e.g., `"FOO/BAR"`), use a helper function that consumes the literal naturally (e.g., `base, quote = "FOO/BAR".split("/")`). NEVER stuff literals into assert messages or assign to `_` — these are noise. - - **All placeholders and literals** from the spec steps must appear in the test body — `beehave check` verifies their presence. + - All placeholders and literals from the spec steps must appear in the test body per Spec Value Fidelity ([[software-craft/test-design#concepts]]) — `beehave check` verifies their presence. Use each value in a way that reflects its domain purpose: identifiers identify, boundaries bound, configurations configure. Never add noise code to satisfy traceability. 4. IF a spec gap or inconsistency is discovered → do NOT modify specification documents (domain_spec.md, glossary.md, product_definition.md, ADRs, feature files). Flag it in output notes. The SE may ONLY modify production code and test code. 5. Run `task test-fast` to confirm the test fails for the right reason (RED) per [[software-craft/tdd#key-takeaways]].