ThreatDetect-code-vulnerability-detection

A repository for finetuning a ModernBERT-based model to detect vulnerabilities in code. This project adapts answerdotai/ModernBERT-base using LoRA techniques to classify code segments into vulnerability categories.

Overview

ThreatDetect-code-vulnerability-detection is designed to automatically analyze code and detect potential vulnerabilities. By finetuning ModernBERT with a dedicated dataset of code samples, the model can classify code into multiple vulnerability categories (e.g., various CWE weaknesses) as well as mark code as safe.

Key features:

Finetuning using LoRA on Q and V matrices for efficient training.
Classification into 7 labels: six CWE-based vulnerability classes and one safe label.
Training scripts designed for high-performance computing environments using SLURM.

Vulnerability Labels

Label	Description
CWE-119	Improper Restriction of Operations within the Bounds of a Memory Buffer
CWE-125	Out-of-bounds Read
CWE-20	Improper Input Validation
CWE-416	Use After Free
CWE-703	Improper Check or Handling of Exceptional Conditions
CWE-787	Out-of-bounds Write
safe	Safe code

Repository Structure

.
├── data
│   ├── data_cleaning.ipynb                   # Notebook for cleaning and preparing the dataset
│   └── minified-diverseful-multilabels.parquet # Processed dataset for training
├── scripts
│   ├── torch_accelerate_lora.py                # Finetuning script using torch & accelerate frameworks
│   └── run_finetuning.sh                        # SLURM batch script to run training using sbatch
├── environment.yml                             # Environment configuration for micromamba/conda users
├── requirements.txt                            # Python dependencies for venv users
└── LICENSE                                     # MIT License

Getting Started

Environment Setup

You can set up your environment using one of the following methods:

Using Conda/Micromamba

Ensure you have micromamba or conda installed.
Create and activate the environment:
```
micromamba env create -f environment.yml
micromamba activate ThreatDetect-env
```
(Alternatively, use conda env create -f environment.yml and conda activate ThreatDetect-env.)

Using Virtualenv

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   # On Windows use: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Finetuning the Model

The main finetuning script is located in the scripts folder. It utilizes torch and accelerate frameworks with LoRA modifications.

Running Training on SLURM

A SLURM batch file (run_finetuing.sh) is provided to run training on a cluster:

Submit the job with:
```
sbatch scripts/run_finetuing.sh
```
Monitor the job logs for progress and accuracy metrics.

Running Locally

If you wish to run training locally (without SLURM), execute:

python scripts/torch_accelerate_lora.py

Ensure your environment is properly configured to use the appropriate GPU/CPU settings.

Model & Dataset Details

Model

Base Model: Finetuned from answerdotai/ModernBERT-base
Training Method: LoRA applied to Q and V matrices
Classification: Detects code vulnerabilities across 7 labels (six CWE-based classes and 'safe')

For further details and to explore the model, check out its Hugging Face Model Card.

Dataset

Source: A minified, clean, and deduplicated version of the DiverseVul dataset.
HF Dataset: lemon42-ai/minified-diverseful-multilabels

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests for improvements or bug fixes.

Fork the repository.
Create a new branch for your feature or bug fix.
Make your changes and test them.
Submit a pull request with a detailed description of your changes.

Acknowledgements

Thanks to the developers of ModernBERT and the maintainers of the DiverseVul dataset.
Acknowledgement to the paper: DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection by Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David A. Wagner.

Developed by Abdellah Oumida and Mohammed Sbaihi.

Happy coding and safe programming!

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
tdcpp-deck.png		tdcpp-deck.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ThreatDetect-code-vulnerability-detection

Overview

Vulnerability Labels

Repository Structure

Getting Started

Environment Setup

Using Conda/Micromamba

Using Virtualenv

Finetuning the Model

Running Training on SLURM

Running Locally

Model & Dataset Details

Model

Dataset

License

Contributing

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

lemon42-ai/ThreatDetect-code-vulnerability-detection

Folders and files

Latest commit

History

Repository files navigation

ThreatDetect-code-vulnerability-detection

Overview

Vulnerability Labels

Repository Structure

Getting Started

Environment Setup

Using Conda/Micromamba

Using Virtualenv

Finetuning the Model

Running Training on SLURM

Running Locally

Model & Dataset Details

Model

Dataset

License

Contributing

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages