Skip to content

NVIDIA-Digital-Bio/megalodon

Repository files navigation

Megalodon: Applications of Modular Co-Design for De Novo 3D Molecule Generation

Danny Reidenbach*·Filipp Nikitin*†·Olexandr Isayev·Saee Paliwal
*NVIDIA   ·   Department of Computational Biology, Carnegie Mellon University
Department of Chemistry, Carnegie Mellon University

📄 Paper·📖 Citation·⚙️ Setup

*Equal contributionsWork performed during internship at NVIDIA

Overview

Megalodon Architecture

Abstract

De novo 3D molecule generation is a pivotal task in drug discovery. However, many recent geometric generative models struggle to produce high-quality 3D structures, even if they maintain 2D validity and topological stability. To tackle this issue and enhance the learning of effective molecular generation dynamics, we present Megalodon—a family of scalable transformer models. These models are enhanced with basic equivariant layers and trained using a joint continuous and discrete denoising co-design objective.

We assess Megalodon's performance on established molecule generation benchmarks and introduce new 3D structure benchmarks that evaluate a model's capability to generate realistic molecular structures, particularly focusing on energetics. We show that Megalodon achieves state-of-the-art results in 3D molecule generation, conditional structure generation, and structure energy benchmarks using diffusion and flow matching. Furthermore, doubling the number of parameters in Megalodon to 40M significantly enhances its performance, generating up to 49x more valid large molecules and achieving energy levels that are 2-10x lower than those of the best prior generative models.


Setup

Prerequisites

  • Python 3.10+
  • CUDA-compatible GPU (recommended for training)
  • Conda or Mamba (recommended)

Environment Setup

# Clone the repository
git clone [repository-url]
cd megalodon

# Create and activate conda environment
conda create -n megalodon python=3.10 -y
conda activate megalodon

# Install dependencies
pip install -e .
pip install -r requirements.txt

Data Setup

The training and evaluation require the GEOM-Drugs and QM9 datasets.

For data downloading and preprocessing instructions, please refer to the data_processing directory.


Usage

Make sure that src content is available in your PYTHONPATH (e.g., export PYTHONPATH="./src:$PYTHONPATH") if megalodon is not installed locally (pip install -e .).

Model Training

QM9 Dataset:

# Megalodon diffusion model
python scripts/train.py --config-name=megalodon_diffusion train.gpus=2 data.dataset_root="./qm9_data"

# Megalodon flow matching model  
python scripts/train.py --config-name=megalodon_fm train.gpus=2 data.dataset_root="./qm9_data"

# Quick diffusion model (reduced timesteps)
python scripts/train.py --config-name=megalodon_quick_diffusion train.gpus=2 data.dataset_root="./qm9_data"

GEOM-Drugs Dataset:

# Megalodon diffusion model
python scripts/train.py --config-path=scripts/conf/drugs --config-name=megalodon_diffusion train.gpus=2 data.dataset_root="./drugs_data"

# Megalodon flow matching model
python scripts/train.py --config-path=scripts/conf/drugs --config-name=megalodon_fm train.gpus=2 data.dataset_root="./drugs_data"

# Quick diffusion model
python scripts/train.py --config-path=scripts/conf/drugs --config-name=megalodon_quick_diffusion train.gpus=2 data.dataset_root="./drugs_data"

Training Configuration

You can easily override configuration parameters:

# Customize training parameters
python scripts/train.py --config-name=megalodon_diffusion \
    train.gpus=4 \
    train.n_epochs=300 \
    train.seed=42 \
    data.batch_size=64 \
    optimizer.lr=0.0005

Model Inference and Sampling

Available Model Configurations

QM9 Models:

  • megalodon_diffusion.yaml - Megalodon with diffusion objective - download weights
  • megalodon_fm.yaml - Megalodon with flow matching objective - download weights
  • megalodon_quick_diffusion.yaml - Megalodon with lighter architecture - download weights

GEOM-Drugs Models:

  • megalodon_diffusion.yaml - Megalodon diffusion for drug-like molecules - download weights
  • megalodon_quick_diffusion.yaml - Megalodon with lighter architecture - download weights
  • megalodon_fm.yaml - Megalodon flow matching for drug-like molecules - download weights

Sampling and Evaluation Commands

Make sure that data.dataset_root leads to a processed dataset directory, as sampling and evaluation require some of the statistics from the data.

# Generate molecules using trained model
python scripts/sample.py --config_path scripts/conf/drugs/megalodon_diffusion.yaml --ckpt_path ckpts/drugs/megalodon_large_diffusion.ckpt --timesteps 500 --n_graphs 10

Note: The MegalodonFlow model for GEOM-Drugs was originally trained in the Semla codebase and later transferred to this framework. A special configuration megalodon_fm_inference is provided specifically for the drugs/megalodon_fm.ckpt checkpoint to ensure compatibility.

# Example: Using the special inference config for transferred MegalodonFlow model
python scripts/sample.py --config_path scripts/conf/drugs/megalodon_fm_inference.yaml --ckpt_path ckpts/drugs/megalodon_fm.ckpt --timesteps 100 --n_graphs 10

Upon completion of the sampling process, comprehensive evaluation metrics will be automatically calculated and displayed in the terminal output.

For advanced energy evaluation using the GFN2-xTB energy benchmark, please refer to the dedicated evaluation repository geom-drugs-3dgen-evaluation.


Citation

If you use Megalodon in your research, please cite our paper:

@article{reidenbach2025applications,
  title={Applications of Modular Co-Design for De Novo 3D Molecule Generation},
  author={Reidenbach, Danny and Nikitin, Filipp and Isayev, Olexandr and Paliwal, Saee},
  journal={arXiv preprint arXiv:2505.18392},
  year={2025}
}

About

Megalodon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages