Skip to content

fix(common): use lazy quantifiers in removeDynamicContent to avoid over-matching#6065

Open
ebrasha wants to merge 1 commit into
sqlmapproject:masterfrom
ebrasha:bugfix/remove-dynamic-content-lazy-regex
Open

fix(common): use lazy quantifiers in removeDynamicContent to avoid over-matching#6065
ebrasha wants to merge 1 commit into
sqlmapproject:masterfrom
ebrasha:bugfix/remove-dynamic-content-lazy-regex

Conversation

@ebrasha
Copy link
Copy Markdown

@ebrasha ebrasha commented May 30, 2026

What changed

removeDynamicContent() was using greedy .+ between dynamic-marking anchors.
On any page where the same prefix/suffix shows up more than once (SPAs, templated lists, repeated card components), that single greedy match would swallow everything from the first prefix to the last suffix and replace it with one collapsed block.

That over-cleanup hides real differences between True/False responses and feeds straight into false positives (and false negatives) downstream in the comparison logic.

I switched the two affected patterns from greedy to lazy:

  • ^.+suffix^.+?suffix for the "suffix-only" case, so we stop at the first suffix instead of the last
  • prefix.+suffixprefix.+?suffix for the "both anchors" case, so each dynamic block gets cleaned individually instead of all of them merging into one
  • the prefix.+$ case stayed as-is — it's anchored at end-of-string, so greedy vs lazy doesn't change the result there
  • added a short note in the docstring so nobody accidentally reverts this back to greedy later

Why this is better

The dynamic markings represent a single dynamic block, not a region spanning the entire page. Greedy was breaking that contract whenever the anchors repeated.

Some specifics worth calling out:

  • with lazy matching, re.sub (which is global by default) actually walks the page and cleans each dynamic region one by one, which is what the markings were designed for
  • this lines up the cleanup logic with how findDynamicContent builds the markings in the first place
  • pages that don't have repeating anchors behave identically — same single match, same replacement
  • combined with the matchRatio calibration fix from the earlier PR, the comparison engine now sees genuine True/False deltas instead of artificially flattened pages

Scope

Only lib/core/common.py is touched.
Change is limited to removeDynamicContent() — two regex quantifiers and a docstring note.
No public API changes, no behavior change on pages without repeating anchors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant