Table of Contents
MolLangBench is a comprehensive benchmark designed to evaluate the fundamental capabilities of AI models in language-prompted molecular structure recognition, editing, and generation.
This repository provides:
- Code and examples to load and use the dataset directly from the Hugging Face Dataset
- Evaluation scripts and prompt templates to test OpenAI models (e.g., o1, o3, o4-mini) using either molecular images or SMILES strings as inputs
It is straightforward to extend this repository to evaluate other language or multimodal models by adapting the provided input formatting and evaluation templates.
You can easily set up the required environment by following these steps:
-
Clone the repository
git clone https://github.com/TheLuoFengLab/MolLangBench.git cd MolLangBench -
Install dependencies
pip install -r requirements.txt
Load the dataset in Python:
from datasets import load_dataset
# Recognition (train + test)
rec_train = load_dataset("ChemFM/MolLangBench", name="recognition", split="train")
rec_test = load_dataset("ChemFM/MolLangBench", name="recognition", split="test")
# Filter one specific subtask
subtask = "one_hop_neighbors"
subset = rec_test.filter(lambda x: x["task"] == subtask)
# Editing (test only)
edit = load_dataset("ChemFM/MolLangBench", name="edit", split="test")
# Generation (test only)
gen = load_dataset("ChemFM/MolLangBench", name="generation", split="test")We provide end-to-end scripts for:
- Generating prompt files (
.jsonl) - Creating batch job inputs for the OpenAI API
- Submitting jobs & retrieving outputs
- Computing evaluation metrics
Below is a step-by-step example workflow for the “one hop neighbors” recognition subtask.
Prompt templates for all tasks and modalities (SMILES and image) are located in the prompts folder. To generate a .jsonl prompt file, run:
python scripts/create_prompts.py \
--task_type <recognition|editing|generation> \
--recognition_subtask <recognition_subtask_name_if_applicable> \
--modality <smiles|image> \
--split <train|test> \
--output_file <output_jsonl_path>Example: For the "one hop neighbors" recognition subtask (SMILES modality, test split):
python scripts/create_prompts.py \
--task_type recognition \
--recognition_subtask one_hop_neighbors \
--modality smiles \
--split test \
--output_file exps/one_hop_neighbors/prompts.jsonlFor image modality, the image is included as a Base64-encoded string in the
.jsonlfile.
Generate a batch job file for your prompts and desired model:
python scripts/create_openai_jobs.py \
--prompt_file <prompts_jsonl_path> \
--output_file <batch_job_jsonl_path> \
--model <model_id> \
--custom_id_prefix <optional_prefix>Example:
Using the o4-mini model for "one hop neighbors":
python scripts/create_openai_jobs.py \
--prompt_file exps/one_hop_neighbors/prompts.jsonl \
--output_file exps/one_hop_neighbors/o4-mini/batch_input.jsonl \
--model o4-mini \
--custom_id_prefix o4_miniSubmit the batch job to the OpenAI API:
python scripts/submit_openai_jobs.py submit \
--jobs_file <batch_job_jsonl_path> \
[--api_key YOUR_API_KEY] \
[--organization YOUR_ORG_ID]You can also set your API key as an environment variable. The organization ID is optional.
Example:
python scripts/submit_openai_jobs.py submit \
--jobs_file exps/one_hop_neighbors/o4-mini/batch_input.jsonlThis command will print a batch_id for your job.
To retrieve the results (periodically checks until the job is complete):
python scripts/submit_openai_jobs.py retrieve \
--batch_id <BATCH_ID> \
--output_file <results_jsonl_path> \
[--api_key YOUR_API_KEY] \
[--organization YOUR_ORG_ID] \
[--check_interval 60]Example:
python scripts/submit_openai_jobs.py retrieve \
--batch_id BATCH_ID \
--output_file exps/one_hop_neighbors/o4-mini/results.jsonlEvaluate the model outputs with:
python scripts/evaluate_results.py \
--results_file <results_jsonl_path> \
--task_type <recognition|editing|generation> \
--subtask <subtask_name> \
--modality <smiles|image>Example:
python scripts/evaluate_results.py \
--results_file exps/one_hop_neighbors/o4-mini/results.jsonl \
--task_type recognition \
--subtask one_hop_neighbors \
--modality smilesThis will print out the evaluation metrics for your selected task and model.
The default result tags are
<count>and<atom_indices>. For certain tasks, you may need to specify custom result tags using the--result_1_tag <result_1_tag>argument.
⚠️ Warning:
For molecule editing and generation tasks using thegpt-image-1model, the OpenAI API does not support batch job submissions.
- You must submit each image generation or editing request individually using the OpenAI Images API.
- For automatic evaluation, you will also need a Mathpix account and API key to convert the generated molecular images back to SMILES strings.
Example scripts for these tasks are provided in the
Miscellaneousfolder.
We plan to provide a more streamlined workflow in the future.
The Miscellaneous folder contains helpful scripts and utilities, including:
-
Ground Truth Collection
Scripts for collecting ground truth information for each recognition task using RDKit. -
Image-to-SMILES Conversion
Scripts to call the Mathpix API for converting molecule images to SMILES strings for automated evaluation. -
Per-Image OpenAI API Submission
Scripts to submit image generation and editing requests (for the image modality) to the OpenAIgpt-image-1API one-by-one, as batch jobs are not currently supported.
More utilities and improvements will be added in the future.
Below are the complete evaluation results for all molecular structure recognition subtasks across a wide range of large language models and vision-language models.
Complete Results for Molecular Structure Recognition Tasks (click to expand)
| Task | GPT‑4o | GPT‑4.5-preview | GPT‑4.1 | o1‑mini | o1 | o3-mini | DeepSeek‑R1 | R1‑70B | o3 | o4‑mini | o3 (image) | o4-mini (image) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| One‑hop neighbors | 0.355/0.140 | 0.600/0.425 | 0.570/0.330 | 0.735/0.640 | 0.825/0.720 | 0.870/0.820 | 0.825/0.710 | 0.585/0.430 | 0.935/0.895 | 0.880/0.845 | 0.890/0.855 | 0.840/0.780 |
| Two‑hop neighbors | 0.215/0.055 | 0.280/0.100 | 0.400/0.210 | 0.465/0.350 | 0.745/0.560 | 0.820/0.740 | 0.610/0.475 | 0.305/0.135 | 0.935/0.825 | 0.870/0.790 | 0.770/0.705 | 0.775/0.690 |
| Three‑hop neighbors | 0.165/0.015 | 0.355/0.165 | 0.265/0.140 | 0.400/0.265 | 0.560/0.400 | 0.825/0.705 | 0.550/0.385 | 0.300/0.130 | 0.925/0.830 | 0.775/0.710 | 0.695/0.600 | 0.660/0.575 |
| Quaternary carbons | 0.530/0.290 | 0.690/0.435 | 0.740/0.440 | 0.615/0.470 | 0.865/0.665 | 0.835/0.740 | 0.780/0.680 | 0.440/0.330 | 0.935/0.865 | 0.845/0.750 | 0.670/0.600 | 0.720/0.665 |
| Ring junctions | 0.285/0.080 | 0.495/0.185 | 0.485/0.210 | 0.325/0.175 | 0.575/0.470 | 0.580/0.520 | 0.535/0.420 | 0.255/0.160 | 0.685/0.650 | 0.590/0.570 | 0.660/0.595 | 0.615/0.555 |
| Bond connection | 0.448 | 0.472 | 0.336 | 0.698 | 0.758 | 0.832 | 0.802 | 0.564 | 0.950 | 0.880 | 0.626 | 0.706 |
| Halogen atoms | 0.845/0.290 | 0.905/0.420 | 0.900/0.355 | 0.920/0.570 | 0.975/0.740 | 0.955/0.710 | 0.970/0.735 | 0.740/0.375 | 0.965/0.860 | 0.965/0.820 | 0.855/0.815 | 0.920/0.860 |
| Aldehyde | 0.855/0.570 | 0.965/0.610 | 0.945/0.730 | 0.855/0.725 | 0.970/0.825 | 0.985/0.920 | 0.960/0.835 | 0.715/0.585 | 0.990/0.960 | 0.985/0.945 | 0.925/0.925 | 0.975/0.965 |
| Amide | 0.505/0.180 | 0.570/0.205 | 0.635/0.315 | 0.585/0.340 | 0.715/0.440 | 0.685/0.510 | 0.635/0.415 | 0.495/0.205 | 0.765/0.650 | 0.755/0.610 | 0.565/0.500 | 0.735/0.665 |
| Carboxyl | 0.760/0.260 | 0.885/0.235 | 0.900/0.485 | 0.840/0.580 | 0.965/0.675 | 0.955/0.760 | 0.900/0.660 | 0.820/0.495 | 0.985/0.845 | 0.950/0.725 | 0.785/0.750 | 0.870/0.820 |
| Ester | 0.600/0.145 | 0.760/0.285 | 0.780/0.330 | 0.675/0.325 | 0.935/0.500 | 0.895/0.645 | 0.680/0.400 | 0.615/0.270 | 0.955/0.780 | 0.950/0.640 | 0.720/0.505 | 0.840/0.595 |
| Ketone | 0.530/0.155 | 0.750/0.260 | 0.870/0.435 | 0.750/0.465 | 0.925/0.600 | 0.985/0.745 | 0.880/0.600 | 0.770/0.370 | 0.985/0.865 | 0.985/0.795 | 0.765/0.675 | 0.850/0.775 |
| Benzene | 0.490/0.145 | 0.540/0.105 | 0.660/0.155 | 0.530/0.235 | 0.720/0.360 | 0.725/0.565 | 0.595/0.385 | 0.500/0.190 | 0.880/0.695 | 0.730/0.550 | 0.675/0.405 | 0.680/0.485 |
| Furan | 0.295/0.265 | 0.820/0.325 | 0.905/0.515 | 0.780/0.500 | 0.920/0.660 | 0.865/0.745 | 0.895/0.710 | 0.850/0.490 | 0.975/0.845 | 0.940/0.790 | 0.890/0.820 | 0.870/0.815 |
| Pyridine | 0.555/0.225 | 0.525/0.250 | 0.730/0.365 | 0.685/0.375 | 0.765/0.555 | 0.860/0.740 | 0.685/0.520 | 0.630/0.340 | 0.925/0.825 | 0.835/0.750 | 0.715/0.585 | 0.790/0.665 |
| Thiophene | 0.860/0.385 | 0.840/0.325 | 0.880/0.480 | 0.840/0.605 | 0.915/0.690 | 0.940/0.795 | 0.920/0.705 | 0.850/0.565 | 0.970/0.890 | 0.925/0.820 | 0.960/0.855 | 0.920/0.855 |
| Bond stereo | 0.390 | 0.395 | 0.670 | 0.425 | 0.330 | 0.310 | 0.310 | 0.345 | 0.480 | 0.325 | 0.575 | 0.640 |
| Chiral stereo | 0.440 | 0.395 | 0.530 | 0.465 | 0.510 | 0.435 | 0.440 | 0.495 | 0.545 | 0.520 | 0.510 | 0.495 |
| Average | 0.507/0.249 | 0.625/0.311 | 0.678/0.391 | 0.644/0.456 | 0.776/0.581 | 0.798/0.680 | 0.721/0.566 | 0.571/0.360 | 0.877/0.792 | 0.817/0.713 | 0.736/0.661 | 0.772/0.700 |
- Each entry reports recognition accuracy / localization accuracy where applicable.
- Tasks with only recognition evaluation show a single recognition accuracy value.
- Bold values indicate the best performance among all evaluated language models.
- "o3 (image)" and "o4-mini (image)" indicate vision-language models evaluated on molecular images.
Below are the complete evaluation results for molecule editing and generation tasks across all evaluated language and vision-language models.
Complete Results for Molecular Structure Recognition Tasks (click to expand)
| Task | GPT‑4o | GPT‑4.5-preview | GPT‑4.1 | o1‑mini | o1 | o3-mini | DeepSeek‑R1 | R1‑70B | o3 | o4‑mini | GPT Image 1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Molecule editing | 0.725/0.400 | 0.950/0.570 | 0.835/0.465 | 0.710/0.385 | 0.845/0.635 | 0.805/0.650 | 0.720/0.485 | 0.675/0.375 | 0.945/0.785 | 0.920/0.690 | 0.510/0.080 |
| Molecule generation | 0.525/0.005 | 0.800/0.055 | 0.710/0.035 | 0.335/0.035 | 0.385/0.100 | 0.450/0.175 | 0.400/0.045 | 0.205/0.010 | 0.670/0.290 | 0.600/0.260 | 0.130/0.000 |
- Each entry reports SMILES validity / accuracy.
- Bold entries highlight the best performance.
Main Developer: Feiyang Cai - [email protected]
Project Supervisor: Feng Luo - [email protected]
Join our community on Discord to stay updated or ask questions.
If you find our work valuable, please consider giving the project a star and citing it in your research:
@article{MolLangBench,
title={MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation},
author={Feiyang Cai and Jiahui Bai and Tao Tang and Joshua Luo and Tianyu Zhu and Ling Liu and Feng Luo},
year={2025},
journal = {arXiv preprint arXiv:2505.15054},
}
Thank you for your support!
This project is licensed under the MIT License. You are free to use, modify, and distribute this codebase under the terms of the MIT license.