[BUG] `split_words()` silently drops adjacent `@@` variable tokens → zero predictions

## Description

When processing Ghidra decompiled code where variables appear **adjacent without spaces** (e.g., `func(a,b,c)` instead of `func(a, b, c)`), VarBERT silently produces **zero predictions** for the entire function.

This is common in Ghidra output — Ghidra often omits spaces after commas in function calls and comma-separated expressions.


## Root Cause

The text preprocessing pipeline in `_process_code_with_text()` ([text_processor.py:L146-153](https://github.com/binsync/varbert_api/blob/main/varbert/text_processor.py#L146-L153)) replaces variable names with `@@varname@@random_id@@` placeholders using prefix/suffix matching. When two variables are adjacent without whitespace, the placeholders merge into a single whitespace-delimited word. For example, this Ghidra code:

```c
FUN_0000abcd(local_18,param_3,pcVar2);
```

becomes:

```
FUN_0000abcd(@@local_18@@varid_abc@@,@@param_3@@varid_def@@,@@pcVar2@@varid_ghi@@);
```

The `split_words()` function in [model.py:L122-140](https://github.com/binsync/varbert_api/blob/main/varbert/model.py#L122-L140) splits by spaces and then uses `re.search()` to extract `@@` patterns from each word. Since `re.search()` only returns the **first** match, the subsequent adjacent `@@` tokens are silently lost.

This causes a count mismatch in `generate_popular_names()` ([text_processor.py:L202-204](https://github.com/binsync/varbert_api/blob/main/varbert/text_processor.py#L202-L204)):

```python
if len(all_holders) != len(names):
    return {}, ""  # all predictions discarded
```

## Steps to Reproduce

```python
from varbert import VariableRenamingAPI
from libbs.artifacts import Function, FunctionArgument, FunctionHeader, StackVariable

# Minimal Ghidra function with adjacent variables (no spaces after commas)
code = """\
int FUN_00012345(long param_1,int param_2,undefined8 param_3)
{
  int iVar1;
  char *pcVar2;
  int local_18;
  int local_14;

  local_18 = 0;
  local_14 = *(int *)(param_1 + 4);
  FUN_0000abcd(local_18,param_3,pcVar2);
  if (param_2 == 0) {
    iVar1 = atoi((char *)pcVar2);
    local_14 = iVar1;
  }
  for (local_18 = 0; local_18 < local_14; local_18 = local_18 + 1) {
    *(int *)(param_1 + (long)local_18 * 4) = local_14;
  }
  return local_18;
}
"""

func = Function(0x12345, 0x100,
                header=FunctionHeader("FUN_00012345", 0x12345, args={}),
                stack_vars={})
for i, n in enumerate(["param_1", "param_2", "param_3"]):
    func.args[i] = FunctionArgument(i, n, None, 8)
for i, n in enumerate(["iVar1", "pcVar2", "local_18", "local_14"]):
    func.stack_vars[i] = StackVariable(i, n, None, 8, func.addr)

api = VariableRenamingAPI(use_decompiler=False, decompiler_name="ghidra")
names, _ = api.predict_variable_names(
    func, decompilation_text=code, use_decompiler=False, remove_bad_names=False)
print(f"Predictions: {len(names)}")
# Expected: 7 predictions
# Actual:   0 predictions
```

<details>
<summary>Diagnosis</summary>

The following script traces the preprocessing pipeline **without loading the model** to show exactly where the token is lost:

```python
import re
import random
from varbert.text_processor import DecompilationTextProcessor
from varbert.model import VarBERTInterface
from libbs.artifacts import Function, FunctionArgument, FunctionHeader, StackVariable

code = """\
int FUN_00012345(long param_1,int param_2,undefined8 param_3)
{
  int iVar1;
  char *pcVar2;
  int local_18;
  int local_14;

  local_18 = 0;
  local_14 = *(int *)(param_1 + 4);
  FUN_0000abcd(local_18,param_3,pcVar2);
  if (param_2 == 0) {
    iVar1 = atoi((char *)pcVar2);
    local_14 = iVar1;
  }
  for (local_18 = 0; local_18 < local_14; local_18 = local_18 + 1) {
    *(int *)(param_1 + (long)local_18 * 4) = local_14;
  }
  return local_18;
}
"""

func = Function(0x12345, 0x100,
                header=FunctionHeader("FUN_00012345", 0x12345, args={}),
                stack_vars={})
for i, n in enumerate(["param_1", "param_2", "param_3"]):
    func.args[i] = FunctionArgument(i, n, None, 8)
for i, n in enumerate(["iVar1", "pcVar2", "local_18", "local_14"]):
    func.stack_vars[i] = StackVariable(i, n, None, 8, func.addr)

random.seed(42)
preprocessor = DecompilationTextProcessor(code, func=func, decompiler=None)
processed_code = preprocessor.processed_code

# Count @@ holders in processed code (what generate_popular_names sees)
all_holders = re.findall(r"@@[^\s@]+@@[^\s@]+@@", processed_code)
print(f"@@ holders in processed code: {len(all_holders)}")

# Count @@ words after split_words (what the model tokenizer sees)
words = VarBERTInterface.split_words(processed_code)
at_words = [w for w in words if "@@" in w]
print(f"@@ words after split_words:   {len(at_words)}")

# Show merged words where multiple @@ patterns are stuck together
for w in words:
    count = len(re.findall(r"@@[^\s@]+@@[^\s@]+@@", w))
    if count > 1:
        print(f"\nMerged word with {count} @@ patterns:")
        print(f"   {w}")
```

**Output without fix** (bug present):

```
@@ holders in processed code: 25
@@ words after split_words:   24

Merged word with 2 @@ patterns:
   ,@@param_3@@varid_9p452b@@,@@pcVar2@@varid_vhs1k3@@);
```

The processed code has 25 `@@` placeholder tokens, but `split_words()` only produces 24 `<mask>` tokens — the word `,@@param_3@@...@@,@@pcVar2@@...@@);` contains two `@@` patterns merged together, but `re.search()` only extracts the first one. This 25 vs 24 mismatch causes `generate_popular_names()` to discard all predictions.

**Output with fix** (bug resolved):

```
@@ holders in processed code: 25
@@ words after split_words:   25
```

</details>

## Expected vs Actual Behavior

| Input | Expected | Actual |
|---|---|---|
| `func(a, b, c)` (spaces) | 7 predictions | ✅ 7 predictions |
| `func(a,b,c)` (no spaces) | 7 predictions | ❌ 0 predictions |

Log warnings produced:
```
WARNING | varbert.text_processor | Unexpected number of variable name holders versus variable names.
WARNING | varbert.api | Unable to predict any names for function ...
```

## Proposed Fix

Replace `re.search()` with `re.finditer()` in `split_words()` to extract **all** `@@` patterns from each word, not just the first:

```diff
 @staticmethod
 def split_words(text: str):
     words = text.replace("\n", " ").split(" ")
     r = []
     for w in words:
-        m = re.search(r"@@[^\s@]+@@[^\s@]+@@", w)
-        if m is not None:
-            if m.start() > 0:
-                r.append(w[: m.start()])
-            r.append(w[m.start(): m.end()])
-            if m.end() < len(w):
-                r.append(w[m.end():])
+        matches = list(re.finditer(r"@@[^\s@]+@@[^\s@]+@@", w))
+        if matches:
+            pos = 0
+            for m in matches:
+                if m.start() > pos:
+                    r.append(w[pos: m.start()])
+                r.append(w[m.start(): m.end()])
+                pos = m.end()
+            if pos < len(w):
+                r.append(w[pos:])
         else:
             r.append(w)
     r = [w for w in r if len(w) > 0]
     return r
```

## Impact

Any Ghidra-decompiled function containing adjacent variables without whitespace separators (very common in Ghidra output) will silently return zero variable name predictions. The failure is silent — no exception is raised, only a warning is logged.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `split_words()` silently drops adjacent `@@` variable tokens → zero predictions #14

Description

Root Cause

Steps to Reproduce

Expected vs Actual Behavior

Proposed Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Input	Expected	Actual
`func(a, b, c)` (spaces)	7 predictions	✅ 7 predictions
`func(a,b,c)` (no spaces)	7 predictions	❌ 0 predictions

[BUG] split_words() silently drops adjacent @@ variable tokens → zero predictions #14

Description

Description

Root Cause

Steps to Reproduce

Expected vs Actual Behavior

Proposed Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] `split_words()` silently drops adjacent `@@` variable tokens → zero predictions #14