Skip to content

Conversation

@eleftherioszisis
Copy link
Collaborator

@eleftherioszisis eleftherioszisis commented Nov 7, 2025

Allow specifying a local asset store when instantiating a Client. The mount location will be used to symlink the files from the assets instead of downloading, if they exist.

Example:

from entitysdk import Client, LocalAssetStore, models
client = Client(..., local_store=LocalAssetStore(prefix="/data"))
client.download_file(...)

If the asset.full_path exists in /data/{full_path} then the file is not downloaded via the api but it is instead symlinked. This works for both client.download_content, client.download_file and all methods that use them.

@eleftherioszisis eleftherioszisis marked this pull request as ready for review November 7, 2025 14:02
@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
pytest 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/entitysdk/__init__.py 100.00% <100.00%> (ø)
src/entitysdk/client.py 100.00% <100.00%> (ø)
src/entitysdk/core.py 100.00% <100.00%> (ø)
src/entitysdk/store.py 100.00% <100.00%> (ø)

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@james-isbister
Copy link
Contributor

james-isbister commented Nov 7, 2025

Seems like a nice simple way of specifying the data should be mounted.

Also it seems quite handy that it only creates symlinks if the file exists in the mounted directory. I remember there was some discussion that in some cases we might want to raise an error if the mount doesn't exist? Maybe this behaviour could be specified as an optional bool in the DataMount constructor, if this behaviour is favourable in some cases? If you guys recommend this, it would be better for the DataMount object to do this, then having to do this check manually in every obi-one endpoint I think.

Also do we need some custom logic for the staging functions? At the moment would the staging functions create symlinks when they call download functions internally, and then they'd try to edit the files at those symlinks? Rather we probably want that symlinks are first created through the download functions, but then in the editing step (i.e. for the circuit_config.json), the symlink is deleted and the files to be edited are copied to the specified download locations, and then edited?

@eleftherioszisis
Copy link
Collaborator Author

eleftherioszisis commented Nov 8, 2025

Also it seems quite handy that it only creates symlinks if the file exists in the mounted directory. I remember there was some discussion that in some cases we might want to raise an error if the mount doesn't exist?

When a DataMount is created it checks if the prefix (/data) passed to it exists, otherwise it raises. Is that what you mean? Or do you mean it should raise if a file does not exist in the mount?

Also do we need some custom logic for the staging functions? At the moment would the staging functions create symlinks when they call download functions internally, and then they'd try to edit the files at those symlinks? Rather we probably want that symlinks are first created through the download functions, but then in the editing step (i.e. for the circuit_config.json), the symlink is deleted and the files to be edited are copied to the specified download locations, and then edited?

If there is such a case, the staging functions should not work directly on input data (links) but rather separate inputs/outputs. iirc for circuit_config.json specifically I think it is downloaded with client.download_content, in which case the file from the DataMount is used to read the bytes of the file but there is no symlink made.

In any case, given that the data/ mount point will be read only, if the staging trying to modify these files, it will raise an error.

@james-isbister
Copy link
Contributor

When a DataMount is created it checks if the prefix (/data) passed to it exists, otherwise it raises

Yes that's what I meant. Seems good if that's always behaviour then

If there is such a case, the staging functions should not work directly on input data (links) but rather separate inputs/outputs. iirc for circuit_config.json specifically I think it is downloaded with client.download_content, in which case the file from the DataMount is used to read the bytes of the file but there is no symlink made.

Yes, I didn't mean that the data in the mount would be edited, but rather a file would be created at the path of the symlink (not at the path the symlink points to). But I see now the distinction that the files which are edited during staging use download_content rather than download_file. At least stage_simulation uses download_content for the simulation_config: https://github.com/openbraininstitute/entitysdk/blob/04f80af09e9e93c2ad9a4b09c4f8407737fe4978/src/entitysdk/staging/simulation.py#L55C4-L55C5

This seems like a good pattern, thanks for clarrifying!

Copy link
Contributor

@mgeplf mgeplf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks fine; my concern is that the word Mount doesn't express the intent of what is going on - or at least it confused me at first.

To me, mounting is the act of attaching a filesystem. What is happening here can happen regardless of where the data originates - it's more of a AssetStore or AssetDepot, or perhaps a cache or an indirection?

I hate to nitpick about this, as it's quite an invasive change. Perhaps better a better description would help to make it more clear?

@eleftherioszisis
Copy link
Collaborator Author

The implementation looks fine; my concern is that the word Mount doesn't express the intent of what is going on - or at least it confused me at first.

To me, mounting is the act of attaching a filesystem. What is happening here can happen regardless of where the data originates - it's more of a AssetStore or AssetDepot, or perhaps a cache or an indirection?

I hate to nitpick about this, as it's quite an invasive change. Perhaps better a better description would help to make it more clear?

I couldn't come up with a good enough name that describes the role of that object. I think DataCache feels closer to its true meaning.

@mgeplf
Copy link
Contributor

mgeplf commented Nov 12, 2025

I couldn't come up with a good enough name that describes the role of that object. I think DataCache feels closer to its true meaning.

Ok. I'd still be tempted to name it something with Asset in the name, since it's only assets being saved, not general data.

I'm willing to do the change.

@eleftherioszisis
Copy link
Collaborator Author

After some discussion we converged to LocalAssetStore
If no objections we'll go with that.

@eleftherioszisis eleftherioszisis merged commit ab29917 into main Nov 13, 2025
12 checks passed
@eleftherioszisis eleftherioszisis deleted the mount branch November 13, 2025 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants