Skip to content

Conversation

@John-P
Copy link
Contributor

@John-P John-P commented Aug 19, 2021

Add classes which act as a mutable mapping (dictionary like) structure and enables efficient management of annotations. An annotation here is defined as a geometry and some associated JSON data. Currently, supported features are:

  • A common interface for multiple backends:
    • SQLite (with an rtree index for fast spatial queries)
    • Pure Python dict (stored as geoJSON on disk)
  • Serialisation and deserialisation to and from disk
  • Compression (SQLite only)
  • Conversion to and from other formats:
    • geoJSON
    • JSON Lines / ndjson
    • Pandas dataframe
  • Spatial queries using a bounding box or shapely Polygon
  • Customisable binary shape predicate (defaults to intersection)
  • Query predicates on properties using:
    • A subset of python (fastest, compatible across backends with some limitations)
    • A pickled function
    • A python callable (slowest but easiest and can be any Python code)
  • Custom indexes (SQLite only) for accelerated queries

Example (SQLiteStore)

from shapely.geometry import Polygon
from tiatoolbox.annotation.storage import SQLiteStore, Annotation

store = SQLiteStore("polygons.db")
# Create a test geometry (polygon, point, or line string)
triangle = Polygon([(0, 0), (1, 1), (0, 2)])

# Store an annotation geometry with a class label
key = store.append(Annotation(triangle, {"class": 1})) 
# The value returned in a unique key (UUID4 by default)

# Get the stored annotation (tuple of geometry and properties dict returned)
print(store[index])

# Store an annotation geometry with a custom key
store.append(Annotation(triangle, key="foo")) 
print(store["foo"])
# or use __setitem__ syntax:
store[key] = Annotation(triangle)

# Change the properties of an annotation in the store
store.patch(key, {"class": 2})

# Query in a bounding box
results = store.query([0, 0, 128, 128])

# Query using any polygon
results = store.query(Polygon.from_bounds([0, 0, 128, 128]))

# Query in a bounding box but just return only the indexes (rather than the polygons and properties)
# This can be much faster than returning geometries or doing a geometry query
# as the indexes are small and neither the geometries nor the properties have to 
# be decoded for the query to run.
results = store.query_index([0, 0, 128, 128])

# Query with a predicate statement
results = store.query([0, 0, 128, 128], where="props['class']==4")

# Create an index (SQLite only)
# Can give significant speedup (100x in some tests) even for simple property access e.g.
key = store.create_index("example_index", "props['class']==4")

To-Dos

  • Swap to using strings for keys.
  • Default to UUID4 if no key given.
  • Allow custom string keys.
  • Change query shape predicate (via string kwarg).
  • Deepsource passing.
  • Query predicates.
  • Custom indexes.
  • Annotation compression (optional, defaults to zlib for sqlite).
  • Metadata storage (e.g. compression method, tiatoolbox version)
  • Test Coverage > 99%.
  • Grouping during query.
  • Use dataclass instead of tuple for annotations?
  • Conform to / test Python MutableMapping interface
    • ABC
      • __getitem__
      • __setitem__
      • __delitem__
      • __iter__
      • __len__
    • Mixins
      • __contains__
      • keys
      • items
      • values
      • get
      • __eq__
      • __ne__
      • pop
      • popitem
      • clear
      • update
      • setdefault

@John-P John-P marked this pull request as draft August 19, 2021 23:00
@shaneahmed shaneahmed added the enhancement New feature or request label Aug 20, 2021
@John-P John-P added this to the Release v1.0.0 milestone Aug 25, 2021
@John-P
Copy link
Contributor Author

John-P commented Aug 27, 2021

At the moment export to various formats is done via sub-classes with different load and dump. I am considering changing this to be one class but with to_format and from_format functions like with Pandas DataFrames. This would make it easier to read in one format and output to another. This could currently be done by changing the class after reading (casting) but this is a pain to do in python.

@vqdang
Copy link
Contributor

vqdang commented Aug 27, 2021

Escalate to here for tracking. Basically, I want to do this but I haven't found a way to via the current API.

inst_dict = {
  UUID: {
     'box' : number[],
     'contour' : number[],
  }
}
store  = SQLite3RTreeStore('dumb.db')
store.remove(list_of_uuid)
store.append(list_of_uuid)

Given the sample API in the OP. I guess you would like users to do this?

box_store  = SQLite3RTreeStore('dumb.db')
box_store.append([v['box'] for v in inst_dict])

contour_store  = SQLite3RTreeStore('dumb2.db')
contour_store.append([v['contour'] for v in inst_dict])

But removal will require UUID to sync. Also, may need to write to the store based on UUID rather than simply appending any geometries.

@John-P
Copy link
Contributor Author

John-P commented Aug 27, 2021

Escalate to here for tracking. Basically, I want to do this but I haven't found a way to via the current API.

inst_dict = {
  UUID: {
     'box' : number[],
     'contour' : number[],
  }
}
store  = SQLite3RTreeStore('dumb.db')
store.remove(list_of_uuid)
store.append(list_of_uuid)

...

You can currently add a list of polygons. However, you cannot add a list of just indexes/UUIDs because the rtree data structure requires the geometry. You can do something equivalent like this:

store  = SQLite3RTreeStore('dumb.db')
list_of_ids = store.append(list_of_polygons)
store.remove(list_of_ids)

I haven't exposed a way to let you specify the ID at the moment as this is generated as a hash of the geometry when appending to avoid duplicate geometries. This could be changed to be UUIDs instead or a manual ID could be allowed but would require some extra error handling etc.

In your above sample, the geometry would not be serialised to disk. Therefore it would not be known when loaded again later and it would not be possible to spatially query the data. Additionally, storing the geometry using the class (rather than just using a bounding box) allows for optimised polygon intersection queries.

@John-P
Copy link
Contributor Author

John-P commented Aug 27, 2021

...
Given the sample API in the OP. I guess you would like users to do this?

box_store  = SQLite3RTreeStore('dumb.db')
box_store.append([v['box'] for v in inst_dict])

contour_store  = SQLite3RTreeStore('dumb2.db')
contour_store.append([v['contour'] for v in inst_dict])

...

Yes, the way it is currently implemented you could either make two stores (one for boxes and one for polygons), or you could store them as separate annotations in the same store. However, I am unsure why you are needing to store both. You can get the bounding box from the polygon via polygon.bounds. The bounding box is already stored for the rtree indexing (this is found at append time via polygon.bounds). This could be exposed if is useful. However, it would add complexity where I cannot currently see why this is required.

@John-P
Copy link
Contributor Author

John-P commented Aug 27, 2021

...
But removal will require UUID to sync. Also, may need to write to the store based on UUID rather than simply appending any geometries.

It sounds like you want something to act as an rtree index without storing the geometry and for you to handle the annotation in memory in a separate structure such as a list or dictionary. For this I would suggest simply using an RTree class (as in shapely or the rtree package). I could add a class to do this in memory with sqlite if you like (essentially the same as the current class but with no storage of geometry or properties to disk, just the rtree and an ID string). However, you would be losing the benefits of optimised storage go geometry on disk, and fast queries on large numbers of annotations (more than could fit in memory) etc.

This PR is more aimed at creating a way to read and write a large number of annotation to and from disk efficiently (fast and in a with low memory usage so that you can work with more annotations that would fit in memory at once). Here the store class effectively is your dict of geometries and properties in this case. It handles the generation of IDs, spatial indexing and reading and writing from disk for you. You would only keep another list of dict in memory as a working set e.g. for performing operations on a subset of the annotations before updating the store class.

Copy link
Member

@shaneahmed shaneahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @John-P
Please make the requested changes. we can merge then.

@John-P John-P requested a review from shaneahmed November 23, 2021 15:55
New line at the end of docstring.
Copy link
Member

@shaneahmed shaneahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @john

@shaneahmed shaneahmed merged commit 87c83a2 into develop Nov 24, 2021
@shaneahmed shaneahmed deleted the feature-annotation-store branch November 24, 2021 09:58
@shaneahmed shaneahmed mentioned this pull request Dec 22, 2021
shaneahmed added a commit that referenced this pull request Dec 23, 2021
### Major Updates and Feature Improvements
- Adds nucleus instance segmentation base class
  - Adds  [HoVerNet](https://www.sciencedirect.com/science/article/abs/pii/S1361841519301045) architecture
- Adds multi-task segmentor [HoVerNet+](https://arxiv.org/abs/2108.13904) model
- Adds <a href="https://www.thelancet.com/journals/landig/article/PIIS2589-7500(2100180-1/fulltext">IDaRS</a> pipeline
- Adds [SlideGraph](https://arxiv.org/abs/2110.06042) pipeline
- Adds PCam patch classification models
- Adds support for stain augmentation feature
- Adds classes and functions under `tiatoolbox.tools.graph` to enable construction of graphs in a format which can be used with PyG (PyTorch Geometric).
- Add classes which act as a mutable mapping (dictionary like) structure and enables efficient management of annotations. (#135)
- Adds example notebook for adding advanced models
- Adds classes which can generate zoomify tiles from a WSIReader object.
- Adds WSI viewer using Zoomify/WSIReader API (#212)
- Adds README to example page for clarity
- Adds support to override or specify mpp and power

### Changes to API
- Replaces `models.controller` API with `models.engine`
- Replaces `CNNPatchPredictor` with `PatchPredictor`

### Bug Fixes and Other Changes
- Fixes  Fix `filter_coordinates` read wrong resolutions for patch extraction
- For `PatchPredictor`
  - `ioconfig` will supersede everything
  - if `ioconfig` is not provided
    - If `model` is pretrained (defined in `pretrained_model.yaml` )
      - Use the yaml ioconfig
      - Any other input patch reading arguments will overwrite the yaml ioconfig (at the same keyword).
    - If `model` is not defined, all input patch reading arguments must be provided else exception will be thrown.
- Improves performance of mask based patch extraction

### Development related changes
- Improve tests performance for Travis runs
- Adds feature detection mechanism to detect the platform and installed packages etc.
- On demand imports for some libraries for performance
- Improves performance of mask based patch extraction

Co-authored-by: Shan Raza <[email protected]>
@shaneahmed shaneahmed mentioned this pull request Dec 23, 2021
shaneahmed added a commit that referenced this pull request Dec 23, 2021
- Bump version: 0.8.0 → 1.0.0

### Major Updates and Feature Improvements
- Adds nucleus instance segmentation base class
  - Adds  [HoVerNet](https://www.sciencedirect.com/science/article/abs/pii/S1361841519301045) architecture
- Adds multi-task segmentor [HoVerNet+](https://arxiv.org/abs/2108.13904) model
- Adds <a href="https://www.thelancet.com/journals/landig/article/PIIS2589-7500(2100180-1/fulltext">IDaRS</a> pipeline
- Adds [SlideGraph](https://arxiv.org/abs/2110.06042) pipeline
- Adds PCam patch classification models
- Adds support for stain augmentation feature
- Adds classes and functions under `tiatoolbox.tools.graph` to enable construction of graphs in a format which can be used with PyG (PyTorch Geometric).
- Add classes which act as a mutable mapping (dictionary like) structure and enables efficient management of annotations. (#135)
- Adds example notebook for adding advanced models
- Adds classes which can generate zoomify tiles from a WSIReader object.
- Adds WSI viewer using Zoomify/WSIReader API (#212)
- Adds README to example page for clarity
- Adds support to override or specify mpp and power

### Changes to API
- Replaces `models.controller` API with `models.engine`
- Replaces `CNNPatchPredictor` with `PatchPredictor`

### Bug Fixes and Other Changes
- Fixes  Fix `filter_coordinates` read wrong resolutions for patch extraction
- For `PatchPredictor`
  - `ioconfig` will supersede everything
  - if `ioconfig` is not provided
    - If `model` is pretrained (defined in `pretrained_model.yaml` )
      - Use the yaml ioconfig
      - Any other input patch reading arguments will overwrite the yaml ioconfig (at the same keyword).
    - If `model` is not defined, all input patch reading arguments must be provided else exception will be thrown.
- Improves performance of mask based patch extraction

### Development related changes
- Improve tests performance for Travis runs
- Adds feature detection mechanism to detect the platform and installed packages etc.
- On demand imports for some libraries for performance
- Improves performance of mask based patch extraction

Co-authored-by: @tialab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants