Skip to content

Discussion: The design of in memory model of iceberg spec. #3

@liurenjie1024

Description

@liurenjie1024

When designing in memory model for iceberg spce(metadata, snapshots, manifests etc), there are some problems to consider:

API safety.

For a field in spec, there are several cases to consider:

  1. Exists in both v1 and v2, and both are required
  2. Exists in both v1 and v2, and both are optional
  3. Exists in both v1 and v2, but optional in one and require in another.
  4. Exists in v1, but deprecated v2.
  5. Exists in v2, and have default value in v1.

Let's use the summary field in snapshot spec as an example. There are some potential solutions to this:

  1. Just like in https://github.com/JanKaul/iceberg-rust, we have different models for v1 and v2 spec, e.g. SnapshotV1 and SnapshotV2. This way we can guarantee the api safety of model, but it introduces extra maintaince burden when spec evolves, since most fields are common in both spec.
  2. Use one struct for both version(https://github.com/icelake-io/icelake/), and let users to do the check at runtime. This way we don't have extra maintaince effort for different specs, but we can't guarantee the api safety.

Field access

Currently, in both two repos, all fields are accessed through the public field. Personally, I'm not in favor of this approach since it may cause some problems:

  1. Misuse of fields. For example, for TableMetadata, when we want to append a new snapshot, we should both append snapshot and snapshot log, but exposing Vec<Snapshot> may lead to wrong usage.
  2. In-memory data structure should be designed for high-performance access. For example, TableMetadata should contain a map of snapshot id to snapshot, rather than a vector of snapshots, so that we can access snapshot by id fast.
  3. The in-memory data struct fields should not have a one-one mapping as spec, but should provide more friendly access methods to user. For example, there should no ManifestList data structure, and user should access manifest fiels from Snapshot's method.

So for in-memory data structure, I would propose hiding all public fields and provide public methods to access necessary fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions