Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# OpenFF QCArchive Dataset Submission

## Dataset Lifecycle

All datasets submitted to QCArchive via this repository conform to the [Dataset Lifecycle](./LIFECYCLE.md).
Expand All @@ -16,7 +16,7 @@ Datasets must be submitted as pull requests.
export GIT_LFS_SKIP_SMUDGE=1
git clone [email protected]:openforcefield/qca-dataset-submission.git
```

This will clone the repo, but avoid downloading existing LFS objects.
If you wish to download all LFS objects, leave off the `export GIT_LFS_SKIP_SMUDGE=1`.

Expand Down Expand Up @@ -85,16 +85,16 @@ Datasets must be submitted as pull requests.

### Creating a compute expansion
If you have already computed a dataset but want to re-compute it with a new `QCSpec` (e.g. new level of theory), you can do so using a compute expansion. This is faster than creating a new dataset, and explicitly links datasets with the same molecules and purpose.
A compute expansion involves adding a file called `compute.json` to your original submission, which contains the dataset metadata (identical to the original dataset) and the new compute spec.
This can be done manually, or programatically.
The programatic description is provided below, with an example of the notebook and of the file.
A compute expansion involves adding a file called `compute.json` to your original submission, which contains the dataset metadata (identical to the original dataset) and the new compute spec.
This can be done manually, or programatically.
The programatic description is provided below, with an example of the notebook and of the file.

1. Create a new branch as described above, and navigate to the submission directory of the dataset you want to expand.
2. Create a new jupyter notebook called `generate-compute.ipynb` [example here](https://github.com/openforcefield/qca-dataset-submission/blob/master/submissions/2024-09-18-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.1/generate-compute.ipynb).
3. In the notebook, either download the original dataset and remove the molecules and _original_ `QCSpec`, or re-create the dataset with the same name as the original and skip the molecule addition step.
* See below for details about how changes to the dataset are propagated; note that the dataset name must be the same, and changes to any metadata except `compute-tag` and the `QCSpec` will be ignored when submitting the compute expansion.
* Please note that the default `compute_tag` is `openff`; if you need to use a different one, please add it explicitly to the dataset at this step, as the `compute.json` file overrides the compute tag added manually to the PR. If you do need to change the compute tag after submission, you can change it by updating the label on the PR and the change will take effect when the error cycling action runs next.
4. Add the _new_ `QCSpec` to the dataset, and save the dataset to `compute.json`, example [here](https://github.com/openforcefield/qca-dataset-submission/blob/add-ddx-to-nagl-benchmark/submissions/2024-09-18-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.1/compute.json).
4. Add the _new_ `QCSpec` to the dataset, and save the dataset to `compute.json`, example [here](https://github.com/openforcefield/qca-dataset-submission/blob/add-ddx-to-nagl-benchmark/submissions/2024-09-18-OpenFF-NAGL2-ESP-Timing-Benchmark-v1.1/compute.json).
5. Add the additional compute spec to the submission's `README.md` file.
6. Add the `generate-compute.ipynb` and `compute.json` files to the submission's `QCSubmit Manifest` entry in the `README.md` file.
7. Proof the submission and open a PR. Dataset validation will run automatically.
Expand All @@ -120,7 +120,7 @@ All Open Force Field datasets submitted to QCArchive undergo well-defined *lifec

![Dataset Lifecycle](assets/lifecycle-diagram.png)

Each labeled rectangle in the lifecycle represents a *state*.
Each labeled rectangle in the lifecycle represents a *state*.
A submission PR changes state according to the arrows.
Changes in state may be performed by automation or manually by a human when certain critera are met.

Expand All @@ -145,7 +145,7 @@ The lifecycle process is described below, with [bracketed] items indicating the
- [GHA] [`lifecycle-submission`](.github/workflows/lifecycle-submission.yml) will attempt to submit the dataset
- if successful, will move card to state ["Error Cycling"](https://github.com/openforcefield/qca-dataset-submission/projects/1#column-9577365); add comment to PR
- if failed, will keep card queued; add comment to PR; attempt again next execution

- [Human] Submit worker jobs on a server to begin compute. If using Nautilus, carefully monitor utilization and scale down resources as jobs finish.

4. COMPLETE, INCOMPLETE, ERROR numbers reported for `Optimizations`, `TorsionDrives`
Expand Down Expand Up @@ -184,7 +184,7 @@ In addition to the states given above, there are additional touchpoints availabl
4. The order of a submission PR in a [Dataset Tracking](https://github.com/openforcefield/qca-dataset-submission/projects/1) column matters.
Submissions higher in a column will be operated on first by all Github Action automation.
For example, if you want to error cycle a submission before any others so it has a higher chance of being pulled by idle manager workers, place it at the top of the Error Cycling column.


# Dude where's my Dataset?

Expand Down Expand Up @@ -246,6 +246,7 @@ These are currently used to compute properties of a minimum energy conformation
|`OpenFF Sulfur Hessian Training Coverage Supplement v1.1` | [2024-11-08-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.1](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-08-OpenFF-Sulfur-Hessian-Training-Coverage-Supplement-v1.1) | Additional Hessian training data for Sage sulfur and phosphorus parameters (from ['OpenFF Sulfur Optimization Training Coverage Supplement v1.0'](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-09-11-OpenFF-Sulfur-Optimization-Training-Coverage-Supplement-v1.0)) | O, S, C, Cl, P, N, F, Br, H | |
| `OpenFF Aniline Para Hessian v1.0` | [2024-10-07-OpenFF-Aniline-Para-Hessian-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-07-OpenFF-Aniline-Para-Hessian-v1.0) | Hessian single points for the final molecules in the `OpenFF Aniline Para Opt v1.0` [dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-04-02-OpenFF-Aniline-Para-Opt-v1.0) | 'O', 'Cl', 'S', 'Br', 'H', 'F', 'N', 'C' ||
|`OpenFF Gen2 Hessian Dataset Protomers v1.0` | [2024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-07-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.0/) | Hessian single points for the final molecules in the `OpenFF Gen2 Optimization Dataset Protomers v1.0` [dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers) | 'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S'||
|`OpenFF Gen2 Hessian Dataset Protomers v1.1` | [2024-11-12-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.1](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-12-OpenFF-Gen2-Hessian-Dataset-Protomers-v1.1/) | Hessian single points for the final molecules in the `OpenFF Gen2 Optimization Dataset Protomers v1.0` [dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers), re-generated to preserve molecule IDs between opt and basic datasets. | 'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S'||
| `MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0` | [2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-10-11-MLPepper-RECAP-Optimized-Fragments-Add-Iodines-v1.0) | Set of diverse iodine containing molecules with a number of calculated electrostatic properties. | Br, Cl, S, B, O, Si, C, N, I, P, H, F| |
| `OpenFF Iodine Chemistry Hessian Dataset v1.0` | [2024-11-11-OpenFF-Iodine-Chemistry-Hessian-Dataset-v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2024-11-11-OpenFF-Iodine-Chemistry-Hessian-Dataset-v1.0) | Hessian single points for the final molecules in the `OpenFF Iodine Chemistry Optimization Dataset v1.0` [dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2022-07-27-OpenFF-iodine-optimization-set) | I, F, Br, C, Cl, O, S, N, H ||

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# OpenFF Gen2 Hessian Dataset Protomers v1.1

## Description

Hessian single points for the final molecules in the [OpenFF Gen2 Optimization Dataset Protomers v1.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers) at the B3LYP-D3BJ/DZVP level of theory.

## General information

* Date: 2024-11-12
* Class: OpenFF Basic Dataset
* Purpose: Hessian dataset generation for MSM parameters
* Name: OpenFF Gen2 Hessian Dataset Protomers v1.1
* Number of unique molecules: 108
* Number of conformers: 597
* Number of conformers (min, mean, max): 1, 5.53, 10
* Molecular weight (min, mean, max): 82.06, 282.05, 542.59
* Charges: -3.0, -2.0, -1.0, 0.0, 1.0, 2.0
* Dataset submitter: Alexandra McIsaac
* Dataset generator: Alexandra McIsaac

## Changelog
Compared to OpenFF Gen2 Hessian Dataset Protomers v1.0, this dataset has been re-generated with an updated `create_basic_dataset` from QCSubmit 0.54 so as to preserve the molecule IDs between optimization and Hessian datasets.
The molecules and results should be nearly identical, but preserving molecule IDs improves our ability to process the dataset in our forcefield fitting pipeline.

## QCSubmit generation pipeline

* `generate-dataset.ipynb`: This notebook shows how the dataset was prepared from the input files.


## QCSubmit Manifest

* `dataset.json.bz2`: Compressed dataset ready for submission
* `dataset.pdf`: Visualization of dataaset molecules
* `dataset.smi`: Smiles strings for dataset molecules
* `generate-dataset.ipynb`: Notebook describing dataset generation and submission
* `input-environment.yaml`: Environment file used to create Python environment for the notebook
* `full-environment.yaml`: Fully-resolved environment used to execute the notebook.


## Metadata

* elements: {'H', 'C', 'Cl', 'P', 'F', 'Br', 'O', 'N', 'S'}
* unique molecules: 108
* Spec: B3LYP-D3BJ/DZVP
* SCF properties:
* dipole
* quadrupole
* wiberg_lowdin_indices
* mayer_indices
* QC Spec:
* name: default
* method: B3LYP-D3BJ
* basis: DZVP
Git LFS file not shown
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
C[S@@](=N)(=O)NC1CC1
C1[C@@H]2[C@@H](C2O)C[NH2+]1
CSC(=S)NN
C=C(C(=O)[O-])OP(=O)([O-])[O-]
c1cnc(cc1Cl)S(=O)[O-]
c1c(c(c(nc1Cl)Cl)O)C(=O)[O-]
C1=C(N=C(S1)C[NH3+])Br
c1cc(c(cc1Br)[S-])C(=O)[O-]
C1=C(C(=C(S1)C[NH3+])Br)Br
CN1C(=NN=N1)C[NH3+]
c1c(cc2c(c1C(=O)[O-])OCO2)Br
CC1=NC(=NN1)C[NH3+]
C[C@@H](C1=NNC(=O)N1)NC#N
C[C@]1(CC(=O)N1)C2=NC=CS2
c1c(cnc(n1)C2(CC2)C(=O)[O-])Br
CN1CC(=O)NC1=[NH+]P(=O)(Cl)Cl
c1cc(nc(c1)Cl)C2(CC2)C(=O)[O-]
CC1(COP(=S)(OC1)S)C
c1cc(cc2c1cc(cc2)O)C(=O)[O-]
c1c(c(c(c(n1)Cl)Cl)[C@H]2CC[NH2+]2)Cl
C[C@H]1C[N@@](S(=O)(=O)C1)NC
c1cc(c(cc1S(=O)(=O)N)S(=O)(=O)N)[O-]
c1cc(c(cc1[C@H]2CC[NH2+]2)F)F
c1cc(c(cc1C(F)F)CC(=O)[O-])C(F)F
c1cc(c(c(c1)Cl)NC2=[NH+]CCO2)Cl
CC(C)c1ccc(cc1C(=O)[O-])O
C[C@@H]1C(=[NH+]c2cc(ccc2O1)C#N)N
COC(=O)CC1=CSC(=N1)NS(=O)(=O)C
CCCOc1ccc(cc1[S-])C
c1cc(c(cc1F)C(F)(F)F)[C@@H]2CC[NH2+]2
c1ccc(c(c1)[O-])S(=O)(=O)c2ccccc2[O-]
c1cc2c(cc1F)C=C3N2CC[NH2+]C3
C1CC1NC(=O)C[N@]2C[C@H](CO2)O
CCOC(=O)[C@@H]1Cc2cccc(c2C1=O)[O-]
c1ccc2c(c1)[NH+]=C([C@@H](O2)CCO)N
CC1=C2c3c(ccc(c3CC[C@@H]2C(=O)[O-])F)N1
c1ccc2c(c1)[C@@H](CO2)NC3=[NH+]CCS3
CC1=CC(=C(C(=O)O1)CN=Nc2cccc(c2)Cl)[O-]
C[C@@H](c1ccccc1Cl)NC2=[NH+]CCO2
C[C@@H](c1ccc2cc(ccc2c1)OC)C(=O)[O-]
COc1cccc(c1)N2CC[C@@H](C2=O)[NH3+]
C[NH+]1CCN(CC1)c2c(c[nH+]cc2Cl)Cl
Cc1cccc(c1)/N=N/C2=C(C(=C(NC2=O)[O-])C#N)C
CC(=O)N1C=C(c2c1cccc2)[C@H]3CC[NH2+]3
C[C@@H](C1=CCCc2c1ccc(c2)OC)C(=O)[O-]
C[C@@H](Cc1ccc(cc1)OCCO)[NH3+]
CC1(CC1)COC(=O)[C@@H](c2ccccc2)C(=O)[O-]
C[C@@H]1C(=[NH+]c2cc3c(cc2O1)CCCC3)N
CCc1cc(cc(c1CC(=O)[O-])CC)C#CC
CC1(CC1)c2c(cc(cn2)Br)C[NH2+]C3CC3
C=C1C[C@@H]2COC(=O)[C@@]2(C1)Cc3cccc(c3)C(=O)[O-]
CCOc1ccc(cc1OC)C[C@@H](C)[NH3+]
c1ccc(cc1)CC[C@@H]2C(=[NH+]c3ccccc3O2)N
Cc1cc(c(nc1)C2(CC2)C)C[NH2+]C3CC3
Cc1cc(cc(c1[O-])C)S(=O)(=O)c2cc(c(c(c2)C)[O-])C
c1c2c(cc(c1C=C3CCCCC3)Cl)[C@@H](CC2)C(=O)[O-]
CC(=O)N1CCC[C@]12CCCC[N@@H+](C2)C
C[NH+]1CCN(CC1)C(=O)[C@@H]2CCCC[NH2+]2
c1ccc(cc1)[C@@H](C(=O)[O-])C(=O)OCC2CCCCC2
CC1(C[C@@H](c2ccccc2O1)C[NH2+]C3CC3)C
CCS(=O)(=O)c1ccc(c(c1OCC2CCCC2)Cl)C(=O)[O-]
c1cc2c(c(c1)F)NC(=CC2=O)C[NH+]3CC(C3)c4cccnc4
C[N@@](CCC[NH+]1CCCC1)c2c3c(ccn2)OC=C3
C1C[C@@H]([NH2+]C1)CCCS(=O)(=O)CCCCC(F)(F)F
C[N@]1C[NH2+]CC2=C1N=C(N2C)N3CC[NH+](CC3)C
C[C@@](C[NH3+])(c1ccc(c(c1)OC2CCCC2)OC)O
c1cc2c(cc1F)C(=O)C=C(N2)CN3CC(C3)[NH+]4CCOCC4
CS(=O)(=O)c1ccc(c(c1OCCN2C=NN=N2)Cl)C(=C3C(=O)CCCC3=O)[O-]
CS(=O)(=O)[N@@]1CC[C@]2(C1)CCC[N@@H+](C2)Cc3ccccn3
c1cc(cc(c1)Cl)C2(CC2)C(=O)Nc3ccc(c(c3)C4=N[N-]N=N4)c5cc(cnc5)F
CCc1cc(cc(c1C2=C([C@@H]3CC(C[C@@H]3C2=O)OC)[O-])CC)C
C[N@H+]1C[C@@H]2CC[C@H](C1)[N@@](C2)CC3=CC=C(O3)c4ccccc4C#N
Cc1cc(ccn1)c2ccc(cc2C3=N[N-]N=N3)NC(=O)C4(CC4)c5cccc(c5)Cl
C1C[C@@]2(CC[N@@H+](C2)CC3=NC=CS3)C(=O)N(C1)C4CCOCC4
C[S@@](=O)Cl
c1ccc2c(c1)ccc(n2)COc3ccc(cc3)[C@@](C4CCCC4)(C(=O)[O-])O
CC(C)[NH+](CCNC(=O)CN1[C@@H](CCC1=O)C(F)(F)F)C(C)C
c1cc2c(cc1CC(=O)C3CC3)OC(=C2)C(=O)N[C@@H]4C[NH+]5CCC4CC5
c1ccc2c(c1)C(=C(N2Cc3ccc(cc3)F)C(=O)[O-])c4cccc(c4)OCC5CC5
CC1(C=C(c2ccccc2O1)C[NH+]3CCN(CC3)c4ccccc4)C
CS(=O)(=O)c1ccc(c(c1COC2CCCCC2)Cl)C(=C3C(=O)CCCC3=O)[O-]
CCC[N@@H+]1C[C@@H](C[C@@H]2[C@@H]1CC3=C(Nc4c3c2ccc4)COC)CC#N
c1ccc(cc1)CN(Cc2ccccc2)c3ccc(cc3)C4(CCC4)CC(=O)[O-]
CC[NH+](CC)CCOc1ccc(cc1)CNc2ccc(c3c2cccc3)Cl
COc1cccc(c1)CN2c3ccccc3C(=C2C(=O)[O-])c4ccc(cc4)OCC5CC5
Cc1cc(cc(c1Cc2ccc(c(c2)S(=O)(=O)c3ccc(cc3)F)[O-])C)OC(C)(C)C(=O)[O-]
CCC[N@@H+]1C[C@@H](C[C@@H]2[C@@H]1CC3=C(Nc4c3c2ccc4)COC)CSC
CCC1=COC(=N1)c2ccc(cc2)OCCCOc3ccc4c(c3)C=CN4[C@@H](C)C(=O)[O-]
CC1=C(SC2=C1C(=O)N(C(=O)N2CCCF)CC(C)(C)OC)C[NH2+]CCN
COc1ccc2ccc(cc2c1)OCCCOc3ccc-4c(c3)N(c5c4cccc5)CC(=O)[O-]
CC[C@@H]([C@@H]1CCc2c1ccc(c2)OCCC3=C(OC(=N3)c4cccc(c4)OC)C)C(=O)[O-]
Cc1ccc(cc1N(C)C)C2=NC(=C(O2)C)CCOc3ccc4c(c3)CC[C@@H]4CC(=O)[O-]
C[C@@]1(O[C@@H](CS1)C[N@H+]2CCCC(C2)(C)C)C(C)(C)COc3ccc(cc3)Cl
CCN(CC)c1cccc(c1)C2=NC(=C(O2)C)CCOc3ccc4c(c3)CC[C@@H]4CC(=O)[O-]
CC1(C[N@@](c2c1cc(cc2)c3ccc(cc3)C(F)(F)F)S(=O)(=O)c4ccc5c(c4)[C@@H](CCC5)CC(=O)[O-])C
CC(C)c1c(c(c(c(n1)C(C)C)CO)c2ccc(cc2)F)C[NH2+]C3CCCCC3
CCN(CC)C(=O)N[C@@H]1C[C@@H]2c3cc(cc4c3C(=C(N4)C)C[C@@H]2[N@@H+](C1)C)C5SCCS5
CCC[N@@H+]1C[C@@H](C[C@@H]2[C@@H]1CC3=C(Nc4c3c2cc(c4)C)SC)NC(=O)N(CC)CC
CCCc1cc(ccc1OCCCOc2ccc3c(c2)CC[C@@H]3CC(=O)[O-])C4=NC(=C(O4)C(=O)C)C
C[C@@H](CC[NH+](Cc1ccccc1)Cc2ccccc2)OC(=O)N3C(COC3(C)C)(C)C
Cc1ccc2c(c1)C(=O)N(N=N2)C[C@@H]3CC[C@@H]([C@@H]3C(=O)[O-])C(=O)c4ccc(cc4)OCC5CCCCC5
CCCC(=O)N1CC[NH+](CC1)CCCCN2C(=S)N(C(=O)C2(C)C)c3ccc(c(c3)C(F)(F)F)C#N
C(C#N)N=N#N
CC[NH+](CC)CCC[C@@]1(C(C(=O)c2ccccc2O1)(CCS(=O)(=O)C)CCS(=O)(=O)C)C
C[C@@H]1C(=O)N(c2ccc(nc2[N@@]1C3CC[NH+](CC3)C)Nc4cccc(c4)S(=O)(=O)N5CCOCC5)C
CC(C)S(=O)(=O)Br
CCS(=O)(=O)N(C)C
CNC(=O)[S-]
Loading
Loading