Fix HDF5 dataset consistency #356

AntonioMirarchi · 2025-02-04T10:06:47Z

This PR ensures consistency when using the HDF5 dataset class. Specifically, when self.cached=True, the _preload_data(self) function correctly handles cases where the embed 1D array is shared across all samples. When self.cached=False, the dataset instead processes tensors directly, assuming:

tensor_input = [[d[i]]] if d.ndim == 1 else d[i]

However, this assumption caused errors when embed was shared across samples, as it incorrectly indexed node i within the 1D embed array.

This PR fixes the issue and includes a pytest to verify consistency with and without caching.

…t caching

AntonioMirarchi · 2025-02-06T09:51:54Z

@stefdoerr @sef43 can you review this?

Only one consideration regarding the need to obtain a tensor with torch.Size([1, 1]) if d.ndim == 1: previously, [[d[i]]] was used, whereas now, to maintain consistency with the _preload_data() function, I return a tensor of torch.Size([1]) using [d[i]]. I think it's fine because module.py takes care of it here.

stefdoerr · 2025-02-06T09:56:04Z

Can't we change the dataloader to tile the atom types instead of duplicating information in the file (and wasting space on disk?)

AntonioMirarchi · 2025-02-06T10:51:09Z

If you are referring to the write_as_hdf5() function that is used only in tests/test_datasets.py

AntonioMirarchi added 7 commits February 4, 2025 10:38

update fields_data processing in get

abaea4c

update write_as_hdf5 function to inclue tile_embed arg

0556ac3

add pytest to ensure hdf5 get returns the same output with and withou…

434601c

…t caching

clean script

36fe28b

to black

dbcefe5

add batch_size to test

981aa30

to black

cfd2700

stefdoerr approved these changes Feb 6, 2025

View reviewed changes

stefdoerr merged commit d616c8a into torchmd:main Feb 6, 2025
6 checks passed

AntonioMirarchi deleted the hdf5 branch February 6, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HDF5 dataset consistency #356

Fix HDF5 dataset consistency #356

Uh oh!

AntonioMirarchi commented Feb 4, 2025

Uh oh!

AntonioMirarchi commented Feb 6, 2025

Uh oh!

stefdoerr commented Feb 6, 2025

Uh oh!

AntonioMirarchi commented Feb 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix HDF5 dataset consistency #356

Fix HDF5 dataset consistency #356

Uh oh!

Conversation

AntonioMirarchi commented Feb 4, 2025

Uh oh!

AntonioMirarchi commented Feb 6, 2025

Uh oh!

stefdoerr commented Feb 6, 2025

Uh oh!

AntonioMirarchi commented Feb 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants