-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description
When processing Ghidra decompiled code where variables appear adjacent without spaces (e.g., func(a,b,c) instead of func(a, b, c)), VarBERT silently produces zero predictions for the entire function.
This is common in Ghidra output — Ghidra often omits spaces after commas in function calls and comma-separated expressions.
Root Cause
The text preprocessing pipeline in _process_code_with_text() (text_processor.py:L146-153) replaces variable names with @@varname@@random_id@@ placeholders using prefix/suffix matching. When two variables are adjacent without whitespace, the placeholders merge into a single whitespace-delimited word. For example, this Ghidra code:
FUN_0000abcd(local_18,param_3,pcVar2);becomes:
FUN_0000abcd(@@local_18@@varid_abc@@,@@param_3@@varid_def@@,@@pcVar2@@varid_ghi@@);
The split_words() function in model.py:L122-140 splits by spaces and then uses re.search() to extract @@ patterns from each word. Since re.search() only returns the first match, the subsequent adjacent @@ tokens are silently lost.
This causes a count mismatch in generate_popular_names() (text_processor.py:L202-204):
if len(all_holders) != len(names):
return {}, "" # all predictions discardedSteps to Reproduce
from varbert import VariableRenamingAPI
from libbs.artifacts import Function, FunctionArgument, FunctionHeader, StackVariable
# Minimal Ghidra function with adjacent variables (no spaces after commas)
code = """\
int FUN_00012345(long param_1,int param_2,undefined8 param_3)
{
int iVar1;
char *pcVar2;
int local_18;
int local_14;
local_18 = 0;
local_14 = *(int *)(param_1 + 4);
FUN_0000abcd(local_18,param_3,pcVar2);
if (param_2 == 0) {
iVar1 = atoi((char *)pcVar2);
local_14 = iVar1;
}
for (local_18 = 0; local_18 < local_14; local_18 = local_18 + 1) {
*(int *)(param_1 + (long)local_18 * 4) = local_14;
}
return local_18;
}
"""
func = Function(0x12345, 0x100,
header=FunctionHeader("FUN_00012345", 0x12345, args={}),
stack_vars={})
for i, n in enumerate(["param_1", "param_2", "param_3"]):
func.args[i] = FunctionArgument(i, n, None, 8)
for i, n in enumerate(["iVar1", "pcVar2", "local_18", "local_14"]):
func.stack_vars[i] = StackVariable(i, n, None, 8, func.addr)
api = VariableRenamingAPI(use_decompiler=False, decompiler_name="ghidra")
names, _ = api.predict_variable_names(
func, decompilation_text=code, use_decompiler=False, remove_bad_names=False)
print(f"Predictions: {len(names)}")
# Expected: 7 predictions
# Actual: 0 predictionsDiagnosis
The following script traces the preprocessing pipeline without loading the model to show exactly where the token is lost:
import re
import random
from varbert.text_processor import DecompilationTextProcessor
from varbert.model import VarBERTInterface
from libbs.artifacts import Function, FunctionArgument, FunctionHeader, StackVariable
code = """\
int FUN_00012345(long param_1,int param_2,undefined8 param_3)
{
int iVar1;
char *pcVar2;
int local_18;
int local_14;
local_18 = 0;
local_14 = *(int *)(param_1 + 4);
FUN_0000abcd(local_18,param_3,pcVar2);
if (param_2 == 0) {
iVar1 = atoi((char *)pcVar2);
local_14 = iVar1;
}
for (local_18 = 0; local_18 < local_14; local_18 = local_18 + 1) {
*(int *)(param_1 + (long)local_18 * 4) = local_14;
}
return local_18;
}
"""
func = Function(0x12345, 0x100,
header=FunctionHeader("FUN_00012345", 0x12345, args={}),
stack_vars={})
for i, n in enumerate(["param_1", "param_2", "param_3"]):
func.args[i] = FunctionArgument(i, n, None, 8)
for i, n in enumerate(["iVar1", "pcVar2", "local_18", "local_14"]):
func.stack_vars[i] = StackVariable(i, n, None, 8, func.addr)
random.seed(42)
preprocessor = DecompilationTextProcessor(code, func=func, decompiler=None)
processed_code = preprocessor.processed_code
# Count @@ holders in processed code (what generate_popular_names sees)
all_holders = re.findall(r"@@[^\s@]+@@[^\s@]+@@", processed_code)
print(f"@@ holders in processed code: {len(all_holders)}")
# Count @@ words after split_words (what the model tokenizer sees)
words = VarBERTInterface.split_words(processed_code)
at_words = [w for w in words if "@@" in w]
print(f"@@ words after split_words: {len(at_words)}")
# Show merged words where multiple @@ patterns are stuck together
for w in words:
count = len(re.findall(r"@@[^\s@]+@@[^\s@]+@@", w))
if count > 1:
print(f"\nMerged word with {count} @@ patterns:")
print(f" {w}")Output without fix (bug present):
@@ holders in processed code: 25
@@ words after split_words: 24
Merged word with 2 @@ patterns:
,@@param_3@@varid_9p452b@@,@@pcVar2@@varid_vhs1k3@@);
The processed code has 25 @@ placeholder tokens, but split_words() only produces 24 <mask> tokens — the word ,@@param_3@@...@@,@@pcVar2@@...@@); contains two @@ patterns merged together, but re.search() only extracts the first one. This 25 vs 24 mismatch causes generate_popular_names() to discard all predictions.
Output with fix (bug resolved):
@@ holders in processed code: 25
@@ words after split_words: 25
Expected vs Actual Behavior
| Input | Expected | Actual |
|---|---|---|
func(a, b, c) (spaces) |
7 predictions | ✅ 7 predictions |
func(a,b,c) (no spaces) |
7 predictions | ❌ 0 predictions |
Log warnings produced:
WARNING | varbert.text_processor | Unexpected number of variable name holders versus variable names.
WARNING | varbert.api | Unable to predict any names for function ...
Proposed Fix
Replace re.search() with re.finditer() in split_words() to extract all @@ patterns from each word, not just the first:
@staticmethod
def split_words(text: str):
words = text.replace("\n", " ").split(" ")
r = []
for w in words:
- m = re.search(r"@@[^\s@]+@@[^\s@]+@@", w)
- if m is not None:
- if m.start() > 0:
- r.append(w[: m.start()])
- r.append(w[m.start(): m.end()])
- if m.end() < len(w):
- r.append(w[m.end():])
+ matches = list(re.finditer(r"@@[^\s@]+@@[^\s@]+@@", w))
+ if matches:
+ pos = 0
+ for m in matches:
+ if m.start() > pos:
+ r.append(w[pos: m.start()])
+ r.append(w[m.start(): m.end()])
+ pos = m.end()
+ if pos < len(w):
+ r.append(w[pos:])
else:
r.append(w)
r = [w for w in r if len(w) > 0]
return rImpact
Any Ghidra-decompiled function containing adjacent variables without whitespace separators (very common in Ghidra output) will silently return zero variable name predictions. The failure is silent — no exception is raised, only a warning is logged.