Skip to content

[BUG] split_words() silently drops adjacent @@ variable tokens → zero predictions #14

@hwu71

Description

@hwu71

Description

When processing Ghidra decompiled code where variables appear adjacent without spaces (e.g., func(a,b,c) instead of func(a, b, c)), VarBERT silently produces zero predictions for the entire function.

This is common in Ghidra output — Ghidra often omits spaces after commas in function calls and comma-separated expressions.

Root Cause

The text preprocessing pipeline in _process_code_with_text() (text_processor.py:L146-153) replaces variable names with @@varname@@random_id@@ placeholders using prefix/suffix matching. When two variables are adjacent without whitespace, the placeholders merge into a single whitespace-delimited word. For example, this Ghidra code:

FUN_0000abcd(local_18,param_3,pcVar2);

becomes:

FUN_0000abcd(@@local_18@@varid_abc@@,@@param_3@@varid_def@@,@@pcVar2@@varid_ghi@@);

The split_words() function in model.py:L122-140 splits by spaces and then uses re.search() to extract @@ patterns from each word. Since re.search() only returns the first match, the subsequent adjacent @@ tokens are silently lost.

This causes a count mismatch in generate_popular_names() (text_processor.py:L202-204):

if len(all_holders) != len(names):
    return {}, ""  # all predictions discarded

Steps to Reproduce

from varbert import VariableRenamingAPI
from libbs.artifacts import Function, FunctionArgument, FunctionHeader, StackVariable

# Minimal Ghidra function with adjacent variables (no spaces after commas)
code = """\
int FUN_00012345(long param_1,int param_2,undefined8 param_3)
{
  int iVar1;
  char *pcVar2;
  int local_18;
  int local_14;

  local_18 = 0;
  local_14 = *(int *)(param_1 + 4);
  FUN_0000abcd(local_18,param_3,pcVar2);
  if (param_2 == 0) {
    iVar1 = atoi((char *)pcVar2);
    local_14 = iVar1;
  }
  for (local_18 = 0; local_18 < local_14; local_18 = local_18 + 1) {
    *(int *)(param_1 + (long)local_18 * 4) = local_14;
  }
  return local_18;
}
"""

func = Function(0x12345, 0x100,
                header=FunctionHeader("FUN_00012345", 0x12345, args={}),
                stack_vars={})
for i, n in enumerate(["param_1", "param_2", "param_3"]):
    func.args[i] = FunctionArgument(i, n, None, 8)
for i, n in enumerate(["iVar1", "pcVar2", "local_18", "local_14"]):
    func.stack_vars[i] = StackVariable(i, n, None, 8, func.addr)

api = VariableRenamingAPI(use_decompiler=False, decompiler_name="ghidra")
names, _ = api.predict_variable_names(
    func, decompilation_text=code, use_decompiler=False, remove_bad_names=False)
print(f"Predictions: {len(names)}")
# Expected: 7 predictions
# Actual:   0 predictions
Diagnosis

The following script traces the preprocessing pipeline without loading the model to show exactly where the token is lost:

import re
import random
from varbert.text_processor import DecompilationTextProcessor
from varbert.model import VarBERTInterface
from libbs.artifacts import Function, FunctionArgument, FunctionHeader, StackVariable

code = """\
int FUN_00012345(long param_1,int param_2,undefined8 param_3)
{
  int iVar1;
  char *pcVar2;
  int local_18;
  int local_14;

  local_18 = 0;
  local_14 = *(int *)(param_1 + 4);
  FUN_0000abcd(local_18,param_3,pcVar2);
  if (param_2 == 0) {
    iVar1 = atoi((char *)pcVar2);
    local_14 = iVar1;
  }
  for (local_18 = 0; local_18 < local_14; local_18 = local_18 + 1) {
    *(int *)(param_1 + (long)local_18 * 4) = local_14;
  }
  return local_18;
}
"""

func = Function(0x12345, 0x100,
                header=FunctionHeader("FUN_00012345", 0x12345, args={}),
                stack_vars={})
for i, n in enumerate(["param_1", "param_2", "param_3"]):
    func.args[i] = FunctionArgument(i, n, None, 8)
for i, n in enumerate(["iVar1", "pcVar2", "local_18", "local_14"]):
    func.stack_vars[i] = StackVariable(i, n, None, 8, func.addr)

random.seed(42)
preprocessor = DecompilationTextProcessor(code, func=func, decompiler=None)
processed_code = preprocessor.processed_code

# Count @@ holders in processed code (what generate_popular_names sees)
all_holders = re.findall(r"@@[^\s@]+@@[^\s@]+@@", processed_code)
print(f"@@ holders in processed code: {len(all_holders)}")

# Count @@ words after split_words (what the model tokenizer sees)
words = VarBERTInterface.split_words(processed_code)
at_words = [w for w in words if "@@" in w]
print(f"@@ words after split_words:   {len(at_words)}")

# Show merged words where multiple @@ patterns are stuck together
for w in words:
    count = len(re.findall(r"@@[^\s@]+@@[^\s@]+@@", w))
    if count > 1:
        print(f"\nMerged word with {count} @@ patterns:")
        print(f"   {w}")

Output without fix (bug present):

@@ holders in processed code: 25
@@ words after split_words:   24

Merged word with 2 @@ patterns:
   ,@@param_3@@varid_9p452b@@,@@pcVar2@@varid_vhs1k3@@);

The processed code has 25 @@ placeholder tokens, but split_words() only produces 24 <mask> tokens — the word ,@@param_3@@...@@,@@pcVar2@@...@@); contains two @@ patterns merged together, but re.search() only extracts the first one. This 25 vs 24 mismatch causes generate_popular_names() to discard all predictions.

Output with fix (bug resolved):

@@ holders in processed code: 25
@@ words after split_words:   25

Expected vs Actual Behavior

Input Expected Actual
func(a, b, c) (spaces) 7 predictions ✅ 7 predictions
func(a,b,c) (no spaces) 7 predictions ❌ 0 predictions

Log warnings produced:

WARNING | varbert.text_processor | Unexpected number of variable name holders versus variable names.
WARNING | varbert.api | Unable to predict any names for function ...

Proposed Fix

Replace re.search() with re.finditer() in split_words() to extract all @@ patterns from each word, not just the first:

 @staticmethod
 def split_words(text: str):
     words = text.replace("\n", " ").split(" ")
     r = []
     for w in words:
-        m = re.search(r"@@[^\s@]+@@[^\s@]+@@", w)
-        if m is not None:
-            if m.start() > 0:
-                r.append(w[: m.start()])
-            r.append(w[m.start(): m.end()])
-            if m.end() < len(w):
-                r.append(w[m.end():])
+        matches = list(re.finditer(r"@@[^\s@]+@@[^\s@]+@@", w))
+        if matches:
+            pos = 0
+            for m in matches:
+                if m.start() > pos:
+                    r.append(w[pos: m.start()])
+                r.append(w[m.start(): m.end()])
+                pos = m.end()
+            if pos < len(w):
+                r.append(w[pos:])
         else:
             r.append(w)
     r = [w for w in r if len(w) > 0]
     return r

Impact

Any Ghidra-decompiled function containing adjacent variables without whitespace separators (very common in Ghidra output) will silently return zero variable name predictions. The failure is silent — no exception is raised, only a warning is logged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions