Skip to content

Conversation

@davidturnbull
Copy link
Contributor

@davidturnbull davidturnbull commented Sep 22, 2025

Summary

This PR enhances the glob pattern matching system to support more flexible patterns with [locale] placeholders and glob wildcards. The implementation uses recursive backtracking algorithms to bidirectionally transform between concrete file paths and placeholder patterns.

Key Innovation: Recursive Pattern Matching

The core challenge solved by this PR is reconstructing template patterns from matched file paths when templates contain a mix of placeholders, globs, and literals. This requires sophisticated recursive algorithms because:

  1. Glob patterns are ambiguous - *.json could match app.json or config-prod.json
  2. Placeholders must be identified - We need to know which part of src/en/app.json corresponds to [locale]
  3. Multiple solutions may exist - The algorithm must find the correct match

How expandPlaceholderedGlob Works

Step 1: Forward Expansion (Finding Files)

Input:  "src/[locale]/**/*.json" with locale "en"
Replace: "src/en/**/*.json"
Glob:   ["src/en/config/app.json", "src/en/data/users.json"]

Step 2: Reverse Reconstruction (Restoring Placeholders via Recursion)

This is where the recursion happens. The algorithm uses depth-first search with memoization to match template segments against source path segments.

The Recursive Algorithms

1. matchTemplateToSource (lines 191-252) - Path-Level Recursion

This function recursively matches template segments to source path segments:

function matchTemplateToSource(
  template: TemplateSegment[],      // e.g., ["src", "[locale]", "**", "*.json"]
  sourceSegments: string[],          // e.g., ["src", "en", "config", "app.json"]
  locale: string                     // e.g., "en"
): string[] | null

Recursive Strategy:

  • Base case: If we've matched all template segments and consumed all source segments, success!
  • Recursive case: Try to match the current template segment against source segment(s), then recurse on the rest

Key recursive behaviors:

For regular segments:

// Match current segment, recurse on remainder
if (segmentMatchesSource(segment, sourceSegments[sourceIndex], locale)) {
  const rest = dfs(templateIndex + 1, sourceIndex + 1);  // RECURSION
  if (rest) {
    const current = buildOutputSegment(segment, sourceSegments[sourceIndex], locale);
    return [current, ...rest];
  }
}

For globstar (**) - This is particularly clever:

// Try consuming 0, 1, 2, ... N segments
for (let consume = 0; consume <= sourceSegments.length - sourceIndex; consume += 1) {
  const rest = dfs(templateIndex + 1, sourceIndex + consume);  // RECURSION
  if (rest) {
    const consumed = sourceSegments.slice(sourceIndex, sourceIndex + consume);
    return [...consumed, ...rest];
  }
}

The globstar recursion tries every possible number of segments the ** could match, using backtracking to find the correct solution.

2. renderSegment (lines 111-175) - Character-Level Recursion

When a segment contains placeholders or globs within a single path component (e.g., app-[locale]-*.json), this function recursively matches character by character:

function renderSegment(
  segment: TemplateSegment,  // e.g., "app-[locale]-*.json"
  source: string,            // e.g., "app-en-config.json"
  locale: string             // e.g., "en"
): string | null

Recursive Strategy:

  • Base case: If we've matched all parts and consumed the entire source string, success!
  • Recursive case: Match the current part (literal/placeholder/glob), then recurse on the rest

For literals:

if (part.kind === "literal") {
  if (source.startsWith(part.value, position)) {
    const rest = dfs(partIndex + 1, position + part.value.length);  // RECURSION
    if (rest !== null) {
      return part.value + rest;
    }
  }
}

For placeholders:

if (part.kind === "placeholder") {
  if (source.startsWith(locale, position)) {
    const rest = dfs(partIndex + 1, position + locale.length);  // RECURSION
    if (rest !== null) {
      return `${LOCALE_PLACEHOLDER}${rest}`;  // Restore [locale]
    }
  }
}

For globs - Most complex, tries every possible match length:

// Try matching 0, 1, 2, ... N characters
for (let length = 0; position + length <= source.length; length += 1) {
  const fragment = source.slice(position, position + length);
  if (minimatch(fragment, part.value)) {
    const rest = dfs(partIndex + 1, position + length);  // RECURSION
    if (rest !== null) {
      return fragment + rest;  // Use actual matched value
    }
  }
}

Why Recursion + Memoization?

The Problem:
Without recursion, you'd need to enumerate all possible ways to split a path, which is exponential. For example, matching **/*.json against a/b/c/d/file.json has multiple valid splits for where ** ends and *.json begins.

The Solution:

  • Recursion provides elegant backtracking to explore all possibilities
  • Memoization (lines 196, 116) caches results to avoid recomputing the same subproblems
  • Together, they turn an exponential problem into polynomial time

Complete Example

Template: ["src", "[locale]", "**", "app-*.json"]
Source:   ["src", "en", "config", "prod", "app-main.json"]
Locale:   "en"

Recursion trace:
1. Match "src" → "src" ✓ (literal)
2. Match "[locale]" → "en" ✓ (placeholder, output "[locale]")
3. Match "**" → try consuming segments:
   - Try 0: ["config", "prod", "app-main.json"] → fails (doesn't match "app-*.json")
   - Try 1: ["config"] → recurse with ["prod", "app-main.json"]
     - Try 0: ["prod", "app-main.json"] → fails
     - Try 1: ["prod"] → recurse with ["app-main.json"]
       - Match "app-*.json" → "app-main.json" ✓ (glob, output "app-main.json")
       - SUCCESS!
   
Result: ["src", "[locale]", "config", "prod", "app-main.json"]

Caveats and Trade-offs

While the algorithm is functionally correct for practical use cases, there are important behavioral characteristics and edge cases to understand:

1. Lazy Matching Strategy

The algorithm uses lazy (shortest-first) matching for globs and globstars. When multiple valid matches exist, it returns the first solution found.

Example:

Pattern: "**/config/**/*.json"
Source:  "a/b/config/c/d/config/e/file.json"

Possible matches:
1. ** = "a/b", config, ** = "c/d/config/e"   Algorithm returns this
2. ** = "a/b/config/c/d", config, ** = "e"

Both are valid, but the algorithm prefers (1) because it tries 
consuming fewer segments first.

Impact: Results are deterministic but may not match user intuition about which path segments correspond to which pattern parts.

2. Ambiguous Patterns Can Fail

Patterns where placeholders and globs could match the same content may fail to restore correctly:

Problematic pattern:

Pattern: "[locale]-*-[locale].json"
Source:  "en-config-en.json"  Works fine
Source:  "en-en-test.json"    Ambiguous!

Is "en-en-test.json":
- [locale]="en", *="en-test", [locale]="en" 
- [locale]="en", *="", [locale]="en" with leftover "-test" 

The algorithm tries shortest match first, which may not be correct.

Mitigation: The algorithm will throw a CLIError if it cannot restore placeholders, preventing silent incorrect behavior:

Pattern "config/[locale]-*-[locale].json" does not map cleanly to 
matched path "config/en-en-test.json". Adjust the glob so placeholder 
segments can be restored without ambiguity.

3. Empty Glob Matches

Globs can match empty strings, which may be unexpected:

Pattern: "config-*.json"
Matches: "config-.json"  // * matches empty string ""
Result:  "config-.json"  "config-*.json" 

Impact: Patterns may match more liberally than expected. This follows standard glob behavior where * means "zero or more characters."

4. Performance with Complex Patterns

Deeply nested patterns with many wildcards can have exponential worst-case complexity:

Pattern: "*/*/*/*/*/*"
Source:  "a/b/c/d/e/f"

Each * could match any segment, creating many possibilities to explore.
Memoization helps, but performance degrades with pattern complexity.

Impact: For most i18n projects (< 10,000 files, simple patterns), this is negligible. Very complex patterns on large codebases may be slow.

5. Greedy Matching within Segments

Within a single path segment, glob matching is greedy but uses backtracking:

Pattern: "app-*.json"
Source:  "app-config-prod.json"

The * will eventually match "config-prod" (not just "config")
because the algorithm backtracks until it finds a full match.

Impact: Generally correct, but understanding the backtracking behavior helps when debugging unexpected matches.

6. Non-Deterministic Choice

When multiple valid interpretations exist, the search order (lazy matching, left-to-right) determines which is returned:

Pattern: "**/*.json"
Source:  "a/b/c/file.json"

Valid outputs:
- "a/b/c/*.json"         Returned (** consumes "a/b/c")
- "a/b/**/*.json"       
- "a/**/*.json"
- "**/*.json"

Impact: All outputs are technically correct, but users might expect different behavior. The algorithm is consistent but the choice is somewhat arbitrary.

Recommended Pattern Guidelines

Based on these caveats, here are recommended patterns:

Safe Patterns

// Literal paths with single placeholder
"src/[locale]/messages.json"

// Single glob with placeholder
"src/[locale]/*.json"
"[locale]/translations/**/*.json"

// Globstar for deep traversal
"**/[locale]/**/*.json"

// Unique separators around placeholders
"config.[locale].json"
"app_[locale]_*.json"

⚠️ Avoid These Patterns

// Multiple placeholders with globs between
"[locale]-*-[locale].json"

// Wildcards that could match locale value
"*/[locale]/*"  // if path is "en/en/file.json"

// Adjacent wildcards
"**/**/[locale]/*.json"

// Multiple wildcards in one segment
"*-[locale]-*-*.json"

Technical Details

  • Two-level recursion: Path-level (segments) and character-level (within segments)
  • Backtracking: Algorithm explores multiple possibilities and backtracks on failure
  • Memoization: Caches results using templateIndex|sourceIndex or partIndex|position as keys
  • Greedy matching avoided: Tries all possibilities instead of first match
  • Windows compatibility: Path normalization throughout

Test Plan

  • ✅ Test patterns with placeholders only: src/[locale]/messages.json
  • ✅ Test patterns with globs only: src/**/*.json
  • ✅ Test mixed patterns: src/[locale]/**/*.json
  • ✅ Test complex patterns: config/[locale]/app-*.json
  • ✅ Test globstar consuming multiple levels: **/[locale]/**/*.json
  • ✅ Verify placeholder restoration accuracy
  • ✅ Test ambiguous pattern detection (should throw errors)
  • ✅ Test empty glob matches behavior
  • ✅ Ensure Windows path compatibility

🤖 Generated with Claude Code

@davidturnbull davidturnbull marked this pull request as ready for review September 22, 2025 02:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants