-
Notifications
You must be signed in to change notification settings - Fork 297
Closed
Description
When designing in memory model for iceberg spce(metadata, snapshots, manifests etc), there are some problems to consider:
API safety.
For a field in spec, there are several cases to consider:
- Exists in both v1 and v2, and both are required
- Exists in both v1 and v2, and both are optional
- Exists in both v1 and v2, but optional in one and require in another.
- Exists in v1, but deprecated v2.
- Exists in v2, and have default value in v1.
Let's use the summary
field in snapshot spec as an example. There are some potential solutions to this:
- Just like in https://github.com/JanKaul/iceberg-rust, we have different models for v1 and v2 spec, e.g.
SnapshotV1
andSnapshotV2
. This way we can guarantee the api safety of model, but it introduces extra maintaince burden when spec evolves, since most fields are common in both spec. - Use one struct for both version(https://github.com/icelake-io/icelake/), and let users to do the check at runtime. This way we don't have extra maintaince effort for different specs, but we can't guarantee the api safety.
Field access
Currently, in both two repos, all fields are accessed through the public field. Personally, I'm not in favor of this approach since it may cause some problems:
- Misuse of fields. For example, for
TableMetadata
, when we want to append a new snapshot, we should both append snapshot and snapshot log, but exposingVec<Snapshot>
may lead to wrong usage. - In-memory data structure should be designed for high-performance access. For example,
TableMetadata
should contain a map of snapshot id to snapshot, rather than a vector of snapshots, so that we can access snapshot by id fast. - The in-memory data struct fields should not have a one-one mapping as spec, but should provide more friendly access methods to user. For example, there should no
ManifestList
data structure, and user should access manifest fiels fromSnapshot
's method.
So for in-memory data structure, I would propose hiding all public fields and provide public methods to access necessary fields.
Xuanwo, JanKaul, ZENOTME and lmatz
Metadata
Metadata
Assignees
Labels
No labels