This repository contains the evaluation framework for assessing Code Pre-trained Models (Code PTMs) on vulnerability detection task, developed as part of a master's thesis research.
The framework provides a comprehensive pipeline for evaluating various encoder-only and encoder-decoder Code PTMs on C/C++ vulnerability detection datasets (PrimeVul, DiverseVul and Devign). The evaluation process follows a four-stage approach: data processing, feature extraction, classification, and evaluation.
The evaluation framework consists of the following components:
- Data Processing: Load and preprocess function-level C/C++ code snippets with vulnerability labels
- Feature Extraction: Extract fixed-size embeddings from Code PTMs using various pooling strategies
- Classification: Train lightweight neural classifiers on the extracted embeddings
- Evaluation: Assess performance using macro F1 score with cross-dataset evaluation support
- CodeBERT (base)
- GraphCodeBERT (base)
- UniXcoder variants (base, base-unimodal, base-nine)
- CodeSage variants (small-v2, base-v2, large-v2)
- ModernBERT (base, large)
- CodeT5 variants (small, base, large)
- CodeT5+ variants (220m, 220m-bimodal, 770m)
- CoditT5 (base)
- AST-T5 (base)
- DivoT5 variants (60m, 220m)
- Random Embedding Model for statistical significance testing
# Install required dependencies
pip install torch transformers scikit-learn pandas numpy tqdm
# Optional: Install wandb for experiment tracking
pip install wandbExtract embeddings from Code PTMs for all dataset splits:
python extractor.py \
--model codebert-base \
--task vul \
--dataset primevul \
--batch_size 32 \
--seed 42Train and evaluate classifiers on extracted features:
python classifier.py \
--task vul \
--model codebert-base \
--dataset primevul \
--test_dataset diversevul \
--batch_size 16 \
--num_epochs 10 \
--learning_rate 5e-5 \
--method CLS \
--wandbRun comprehensive random baseline evaluation:
python experiments_random.py \
--task vul \
--dataset primevul \
--test_dataset diversevul \
--n_embed_seeds 100 \
--embed_seed_start 73 \
--extract \
--batch_size 16 \
--num_epochs 10 \
--method CLSCLS: Extract the embedding from the special [CLS] token positionEOS: Extract the embedding from the special (end-of-sequence) token positionAVG: Compute the average of all token embeddings, excluding padding tokensMAX: Take the element-wise maximum across all token embeddings, excluding padding tokens
--model: Specify which Code PTM to use--task: Task type (currently supports 'vul' for vulnerability detection)--dataset: Training/validation dataset--test_dataset: Test dataset (optional, defaults to same as training dataset)
--batch_size: Batch size for training--num_epochs: Number of training epochs--learning_rate: Learning rate for AdamW optimizer--seed: Random seed for reproducibility
The framework supports cross-dataset evaluation to assess model generalization:
# Train on PrimeVul, test on DiverseVul
python classifier.py \
--dataset primevul \
--test_dataset diversevul \
--model codebert-base \
--method CLSAll model and dataset configurations are centralized in configs.py:
MODEL_CONFIGS: Model-specific parameters (tokenizer, max length, pooling options)TASK_CONFIGS: Dataset-specific file paths and configurations