A repository for finetuning a ModernBERT-based model to detect vulnerabilities in code. This project adapts answerdotai/ModernBERT-base using LoRA techniques to classify code segments into vulnerability categories.
ThreatDetect-code-vulnerability-detection is designed to automatically analyze code and detect potential vulnerabilities. By finetuning ModernBERT with a dedicated dataset of code samples, the model can classify code into multiple vulnerability categories (e.g., various CWE weaknesses) as well as mark code as safe.
Key features:
- Finetuning using LoRA on Q and V matrices for efficient training.
- Classification into 7 labels: six CWE-based vulnerability classes and one safe label.
- Training scripts designed for high-performance computing environments using SLURM.
Label | Description |
---|---|
CWE-119 | Improper Restriction of Operations within the Bounds of a Memory Buffer |
CWE-125 | Out-of-bounds Read |
CWE-20 | Improper Input Validation |
CWE-416 | Use After Free |
CWE-703 | Improper Check or Handling of Exceptional Conditions |
CWE-787 | Out-of-bounds Write |
safe | Safe code |
.
├── data
│ ├── data_cleaning.ipynb # Notebook for cleaning and preparing the dataset
│ └── minified-diverseful-multilabels.parquet # Processed dataset for training
├── scripts
│ ├── torch_accelerate_lora.py # Finetuning script using torch & accelerate frameworks
│ └── run_finetuning.sh # SLURM batch script to run training using sbatch
├── environment.yml # Environment configuration for micromamba/conda users
├── requirements.txt # Python dependencies for venv users
└── LICENSE # MIT License
You can set up your environment using one of the following methods:
- Ensure you have micromamba or conda installed.
- Create and activate the environment:
(Alternatively, use
micromamba env create -f environment.yml micromamba activate ThreatDetect-env
conda env create -f environment.yml
andconda activate ThreatDetect-env
.)
- Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
The main finetuning script is located in the scripts
folder. It utilizes torch and accelerate frameworks with LoRA modifications.
A SLURM batch file (run_finetuing.sh
) is provided to run training on a cluster:
- Submit the job with:
sbatch scripts/run_finetuing.sh
- Monitor the job logs for progress and accuracy metrics.
If you wish to run training locally (without SLURM), execute:
python scripts/torch_accelerate_lora.py
Ensure your environment is properly configured to use the appropriate GPU/CPU settings.
- Base Model: Finetuned from answerdotai/ModernBERT-base
- Training Method: LoRA applied to Q and V matrices
- Classification: Detects code vulnerabilities across 7 labels (six CWE-based classes and 'safe')
For further details and to explore the model, check out its Hugging Face Model Card.
- Source: A minified, clean, and deduplicated version of the DiverseVul dataset.
- HF Dataset: lemon42-ai/minified-diverseful-multilabels
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Feel free to open issues or submit pull requests for improvements or bug fixes.
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and test them.
- Submit a pull request with a detailed description of your changes.
- Thanks to the developers of ModernBERT and the maintainers of the DiverseVul dataset.
- Acknowledgement to the paper: DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection by Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David A. Wagner.
Developed by Abdellah Oumida and Mohammed Sbaihi.
Happy coding and safe programming!