MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

Table of Contents

About The Project
Getting Started
Quick Start
Usage
Miscellaneous
Benchmark Results
Contact
Citation
License

About The Project

MolLangBench is a comprehensive benchmark designed to evaluate the fundamental capabilities of AI models in language-prompted molecular structure recognition, editing, and generation.

This repository provides:

Code and examples to load and use the dataset directly from the Hugging Face Dataset
Evaluation scripts and prompt templates to test OpenAI models (e.g., o1, o3, o4-mini) using either molecular images or SMILES strings as inputs

It is straightforward to extend this repository to evaluate other language or multimodal models by adapting the provided input formatting and evaluation templates.

Getting Started

You can easily set up the required environment by following these steps:

Clone the repository

git clone https://github.com/TheLuoFengLab/MolLangBench.git
cd MolLangBench

Install dependencies
```
pip install -r requirements.txt
```

(back to top)

Quick Start

Load the dataset in Python:

from datasets import load_dataset

# Recognition (train + test)
rec_train = load_dataset("ChemFM/MolLangBench", name="recognition", split="train")
rec_test  = load_dataset("ChemFM/MolLangBench", name="recognition", split="test")

# Filter one specific subtask
subtask = "one_hop_neighbors"
subset  = rec_test.filter(lambda x: x["task"] == subtask)

# Editing (test only)
edit = load_dataset("ChemFM/MolLangBench", name="edit", split="test")

# Generation (test only)
gen  = load_dataset("ChemFM/MolLangBench", name="generation", split="test")

(back to top)

Usage

We provide end-to-end scripts for:

Generating prompt files (.jsonl)
Creating batch job inputs for the OpenAI API
Submitting jobs & retrieving outputs
Computing evaluation metrics

Below is a step-by-step example workflow for the “one hop neighbors” recognition subtask.

1. Prepare Prompts

Prompt templates for all tasks and modalities (SMILES and image) are located in the prompts folder. To generate a .jsonl prompt file, run:

python scripts/create_prompts.py \
    --task_type <recognition|editing|generation> \
    --recognition_subtask <recognition_subtask_name_if_applicable> \
    --modality <smiles|image> \
    --split <train|test> \
    --output_file <output_jsonl_path>

Example: For the "one hop neighbors" recognition subtask (SMILES modality, test split):

python scripts/create_prompts.py \
    --task_type recognition \
    --recognition_subtask one_hop_neighbors \
    --modality smiles \
    --split test \
    --output_file exps/one_hop_neighbors/prompts.jsonl

For image modality, the image is included as a Base64-encoded string in the .jsonl file.

2. Create OpenAI Batch Job File

Generate a batch job file for your prompts and desired model:

python scripts/create_openai_jobs.py \
    --prompt_file <prompts_jsonl_path> \
    --output_file <batch_job_jsonl_path> \
    --model <model_id> \
    --custom_id_prefix <optional_prefix>

Example: Using the o4-mini model for "one hop neighbors":

python scripts/create_openai_jobs.py \
    --prompt_file exps/one_hop_neighbors/prompts.jsonl \
    --output_file exps/one_hop_neighbors/o4-mini/batch_input.jsonl \
    --model o4-mini \
    --custom_id_prefix o4_mini

3. Submit Jobs & Retrieve Outputs

Submit the batch job to the OpenAI API:

python scripts/submit_openai_jobs.py submit \
    --jobs_file <batch_job_jsonl_path> \
    [--api_key YOUR_API_KEY] \
    [--organization YOUR_ORG_ID]

You can also set your API key as an environment variable. The organization ID is optional.

Example:

python scripts/submit_openai_jobs.py submit \
    --jobs_file exps/one_hop_neighbors/o4-mini/batch_input.jsonl

This command will print a batch_id for your job.

To retrieve the results (periodically checks until the job is complete):

python scripts/submit_openai_jobs.py retrieve \
    --batch_id <BATCH_ID> \
    --output_file <results_jsonl_path> \
    [--api_key YOUR_API_KEY] \
    [--organization YOUR_ORG_ID] \
    [--check_interval 60]

Example:

python scripts/submit_openai_jobs.py retrieve \
    --batch_id BATCH_ID \
    --output_file exps/one_hop_neighbors/o4-mini/results.jsonl

4. Evaluate the Results

Evaluate the model outputs with:

python scripts/evaluate_results.py \
    --results_file <results_jsonl_path> \
    --task_type <recognition|editing|generation> \
    --subtask <subtask_name> \
    --modality <smiles|image>

Example:

python scripts/evaluate_results.py \
    --results_file exps/one_hop_neighbors/o4-mini/results.jsonl \
    --task_type recognition \
    --subtask one_hop_neighbors \
    --modality smiles

This will print out the evaluation metrics for your selected task and model.

The default result tags are <count> and <atom_indices>. For certain tasks, you may need to specify custom result tags using the --result_1_tag <result_1_tag> argument.

(back to top)

⚠️ Warning:
For molecule editing and generation tasks using the gpt-image-1 model, the OpenAI API does not support batch job submissions.

You must submit each image generation or editing request individually using the OpenAI Images API.

For automatic evaluation, you will also need a Mathpix account and API key to convert the generated molecular images back to SMILES strings.

Example scripts for these tasks are provided in the Miscellaneous folder.
We plan to provide a more streamlined workflow in the future.

Miscellaneous

The Miscellaneous folder contains helpful scripts and utilities, including:

Ground Truth Collection
Scripts for collecting ground truth information for each recognition task using RDKit.
Image-to-SMILES Conversion
Scripts to call the Mathpix API for converting molecule images to SMILES strings for automated evaluation.
Per-Image OpenAI API Submission
Scripts to submit image generation and editing requests (for the image modality) to the OpenAI gpt-image-1 API one-by-one, as batch jobs are not currently supported.

More utilities and improvements will be added in the future.

(back to top)

Benchmark Results

Molecular Structure Recognition

Below are the complete evaluation results for all molecular structure recognition subtasks across a wide range of large language models and vision-language models.

Complete Results for Molecular Structure Recognition Tasks (click to expand)

Task	GPT‑4o	GPT‑4.5-preview	GPT‑4.1	o1‑mini	o1	o3-mini	DeepSeek‑R1	R1‑70B	o3	o4‑mini	o3 (image)	o4-mini (image)
One‑hop neighbors	0.355/0.140	0.600/0.425	0.570/0.330	0.735/0.640	0.825/0.720	0.870/0.820	0.825/0.710	0.585/0.430	0.935/0.895	0.880/0.845	0.890/0.855	0.840/0.780
Two‑hop neighbors	0.215/0.055	0.280/0.100	0.400/0.210	0.465/0.350	0.745/0.560	0.820/0.740	0.610/0.475	0.305/0.135	0.935/0.825	0.870/0.790	0.770/0.705	0.775/0.690
Three‑hop neighbors	0.165/0.015	0.355/0.165	0.265/0.140	0.400/0.265	0.560/0.400	0.825/0.705	0.550/0.385	0.300/0.130	0.925/0.830	0.775/0.710	0.695/0.600	0.660/0.575
Quaternary carbons	0.530/0.290	0.690/0.435	0.740/0.440	0.615/0.470	0.865/0.665	0.835/0.740	0.780/0.680	0.440/0.330	0.935/0.865	0.845/0.750	0.670/0.600	0.720/0.665
Ring junctions	0.285/0.080	0.495/0.185	0.485/0.210	0.325/0.175	0.575/0.470	0.580/0.520	0.535/0.420	0.255/0.160	0.685/0.650	0.590/0.570	0.660/0.595	0.615/0.555
Bond connection	0.448	0.472	0.336	0.698	0.758	0.832	0.802	0.564	0.950	0.880	0.626	0.706
Halogen atoms	0.845/0.290	0.905/0.420	0.900/0.355	0.920/0.570	0.975/0.740	0.955/0.710	0.970/0.735	0.740/0.375	0.965/0.860	0.965/0.820	0.855/0.815	0.920/0.860
Aldehyde	0.855/0.570	0.965/0.610	0.945/0.730	0.855/0.725	0.970/0.825	0.985/0.920	0.960/0.835	0.715/0.585	0.990/0.960	0.985/0.945	0.925/0.925	0.975/0.965
Amide	0.505/0.180	0.570/0.205	0.635/0.315	0.585/0.340	0.715/0.440	0.685/0.510	0.635/0.415	0.495/0.205	0.765/0.650	0.755/0.610	0.565/0.500	0.735/0.665
Carboxyl	0.760/0.260	0.885/0.235	0.900/0.485	0.840/0.580	0.965/0.675	0.955/0.760	0.900/0.660	0.820/0.495	0.985/0.845	0.950/0.725	0.785/0.750	0.870/0.820
Ester	0.600/0.145	0.760/0.285	0.780/0.330	0.675/0.325	0.935/0.500	0.895/0.645	0.680/0.400	0.615/0.270	0.955/0.780	0.950/0.640	0.720/0.505	0.840/0.595
Ketone	0.530/0.155	0.750/0.260	0.870/0.435	0.750/0.465	0.925/0.600	0.985/0.745	0.880/0.600	0.770/0.370	0.985/0.865	0.985/0.795	0.765/0.675	0.850/0.775
Benzene	0.490/0.145	0.540/0.105	0.660/0.155	0.530/0.235	0.720/0.360	0.725/0.565	0.595/0.385	0.500/0.190	0.880/0.695	0.730/0.550	0.675/0.405	0.680/0.485
Furan	0.295/0.265	0.820/0.325	0.905/0.515	0.780/0.500	0.920/0.660	0.865/0.745	0.895/0.710	0.850/0.490	0.975/0.845	0.940/0.790	0.890/0.820	0.870/0.815
Pyridine	0.555/0.225	0.525/0.250	0.730/0.365	0.685/0.375	0.765/0.555	0.860/0.740	0.685/0.520	0.630/0.340	0.925/0.825	0.835/0.750	0.715/0.585	0.790/0.665
Thiophene	0.860/0.385	0.840/0.325	0.880/0.480	0.840/0.605	0.915/0.690	0.940/0.795	0.920/0.705	0.850/0.565	0.970/0.890	0.925/0.820	0.960/0.855	0.920/0.855
Bond stereo	0.390	0.395	0.670	0.425	0.330	0.310	0.310	0.345	0.480	0.325	0.575	0.640
Chiral stereo	0.440	0.395	0.530	0.465	0.510	0.435	0.440	0.495	0.545	0.520	0.510	0.495
Average	0.507/0.249	0.625/0.311	0.678/0.391	0.644/0.456	0.776/0.581	0.798/0.680	0.721/0.566	0.571/0.360	0.877/0.792	0.817/0.713	0.736/0.661	0.772/0.700

Each entry reports recognition accuracy / localization accuracy where applicable.
Tasks with only recognition evaluation show a single recognition accuracy value.
Bold values indicate the best performance among all evaluated language models.
"o3 (image)" and "o4-mini (image)" indicate vision-language models evaluated on molecular images.

Molecule Editing and Generation Benchmark Results

Below are the complete evaluation results for molecule editing and generation tasks across all evaluated language and vision-language models.

Complete Results for Molecular Structure Recognition Tasks (click to expand)

Task	GPT‑4o	GPT‑4.5-preview	GPT‑4.1	o1‑mini	o1	o3-mini	DeepSeek‑R1	R1‑70B	o3	o4‑mini	GPT Image 1
Molecule editing	0.725/0.400	0.950/0.570	0.835/0.465	0.710/0.385	0.845/0.635	0.805/0.650	0.720/0.485	0.675/0.375	0.945/0.785	0.920/0.690	0.510/0.080
Molecule generation	0.525/0.005	0.800/0.055	0.710/0.035	0.335/0.035	0.385/0.100	0.450/0.175	0.400/0.045	0.205/0.010	0.670/0.290	0.600/0.260	0.130/0.000

Each entry reports SMILES validity / accuracy.
Bold entries highlight the best performance.

(back to top)

Contact

Main Developer: Feiyang Cai - [email protected]
Project Supervisor: Feng Luo - [email protected]

Join our community on Discord to stay updated or ask questions.

(back to top)

Citation

If you find our work valuable, please consider giving the project a star and citing it in your research:

@article{MolLangBench,
      title={MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation},  
      author={Feiyang Cai and Jiahui Bai and Tao Tang and Joshua Luo and Tianyu Zhu and Ling Liu and Feng Luo},
      year={2025},
      journal = {arXiv preprint arXiv:2505.15054},
}

Thank you for your support!

(back to top)

License

This project is licensed under the MIT License. You are free to use, modify, and distribute this codebase under the terms of the MIT license.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Miscellaneous		Miscellaneous
prompts		prompts
scripts		scripts
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

About The Project

Getting Started

Quick Start

Usage

1. Prepare Prompts

2. Create OpenAI Batch Job File

3. Submit Jobs & Retrieve Outputs

4. Evaluate the Results

Miscellaneous

Benchmark Results

Molecular Structure Recognition

Molecule Editing and Generation Benchmark Results

Contact

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

TheLuoFengLab/MolLangBench

Folders and files

Latest commit

History

Repository files navigation

MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

About The Project

Getting Started

Quick Start

Usage

1. Prepare Prompts

2. Create OpenAI Batch Job File

3. Submit Jobs & Retrieve Outputs

4. Evaluate the Results

Miscellaneous

Benchmark Results

Molecular Structure Recognition

Molecule Editing and Generation Benchmark Results

Contact

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages