Skip to content
This repository was archived by the owner on Jun 29, 2022. It is now read-only.

Conversation

@rvagg
Copy link
Member

@rvagg rvagg commented Apr 9, 2019

This is a WIP and I'm mainly interested in feedback from folks who have been thinking about this stuff longer than me.

I've done a few things here:

  • Removed "HAMT" from the README and collapsed UnixFS v2 and MFS down a level—the HAMT for MFS is an implementation detail that I don't think is interesting for the purpose of this document (and there's a chance the HAMT used for MFS will be an MFS-specific thing rather than something to encourage end-users to adopt; I think).
  • Removed HAMT.md, for now, that could come back later but it's a distraction in the high-level overview. I'd argue that if it does come back it should have a name like "HashMap.md" (see next point). I'd love to see that directory fill up with a bunch of different data structures, general utility collections and specific use-case-focused ones (geo, etc.).
  • Introduced "Utility Collections" as a component on top of "Complex Data Structures". The intent here is that these are data structures that are primarily focused on end-user application stacks. Of course a HAMT is likely going to be in there but that's an implementation detail that we really don't need to list front-and-center. IPLD data structures should have use-focused names where possible, not algorithm names. We want people to actually use these things, not just admire how clever it all is.
  • Added an overview of "Utility Collections" with some high-level thoughts on what it's aiming for and what types of data structures should be in there along with what kinds of operations they'd expose and some prior-art from other standard libraries. At some point I'd also like to list the algorithms that those standard libraries use because some of them may be portable to IPLD, like HAMT. There's also other places to dig for algorithms of course, like database storage engines, and I'd like to get to that too but that information may not fit here.

/cc @mikeal @vmx @warpfork - who else?

@ghost ghost assigned rvagg Apr 9, 2019
@rvagg rvagg force-pushed the rvagg/collections-docs branch from 807fb09 to 4585b77 Compare April 9, 2019 11:28
Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great write-up.

A general note (not really specific to this PR). Could we use lowercase directories/filenames? That makes things on the terminal on case-sensitive systems way easier :)


Collections are a fundamental primitive in every programming language. Being able to organize data into collections that allow for convenient and efficient access and modification is a core activity in programming.

While IPLD is not a programming languagelement, indext represents enormous potential for sharing data and providing access to diverse and very large data sets. With sufficient data organization primitives, IPLD can replace many functions traditionally provided by a centralized database system. Client applications should be able to access and even manipulate data structures stored across many peers, from trivial lists to massive and complex data sets that are exposed with efficient query and search operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence has typos and I can't really make sense of it.

* `Remove(element)`
* `Iterate()`
* `Contains(element)?`
* `Size()?`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m always concerned when we encode information like this into the root node, because you could just lie ;)

Without traversing the entire structure you don’t actually know the real size. For instance, we encode the length of files into unixfs-v2 but it can’t really be trusted. It’s useful in interfaces for showing “this file/directory is {size}” without actually requesting all the blocks, but I’m constantly worried people will trust it when they shouldn’t.

If we put a size property in the base interface I’m concerned that it will lead people to trust it, and it’s actually very easy to lie.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's built into this document though, I'm leaving that open to the implementation and just listing this as a possible operation. In fact, I was just assuming here that it would be implemented as an actual traversal calculation if it's implemented at all. It's what I did in iamap but the user has to know the cost of calling it!

README.md Outdated
│ MFS │ │ End-User Applications │
│ │ │ │
├───────────┤ ├─────────────────┬──────────────────┤ ┌───────────┐
│ │ │ Structured Data │ Utility │ │ │
Copy link
Contributor

@mikeal mikeal Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Utility Collections” should actually just replace “Complex Data Structures.” unixfs relies on hamt which is a collection, and there’s no way you can make SQL and GeoSpatial interfaces without these as well.

Copy link
Member Author

@rvagg rvagg Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to understand the geo etc. stuff better. I'm thinking the name "Specialized Data Structures" might be more suited and maybe "End-User Applications" should wrap around it so it's still also sitting on "Utility Collections".

So, I get GeoSpatial, but I don't know what "VR" is referring to in this context and I'm not sure "SQL" makes sense under that heading because it's an interfacing method rather than a collection type. @mikeal can you expand on the thoughts behind this category a bit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @rvagg version here. That's also my view on things. The old version was wrong anyway. A Geo index is not a sorted index.

@rvagg
Copy link
Member Author

rvagg commented Apr 10, 2019

oh, I've just worked out why some of the words are jumbled up -- before pushing this i did a replace of variable names I used in the methods, they were like this: Add(e, i) so I did a global replace -- e, i -> element, index but obviously didn't bound some of those with parens! I'll clean that up and it looks like some of the methods are missing arguments.

@rvagg rvagg force-pushed the rvagg/collections-docs branch from 42a715d to ce6cb1c Compare April 10, 2019 01:21
@rvagg
Copy link
Member Author

rvagg commented Apr 10, 2019

@mikeal

Re the floor/ceil thing: it was supposed to be implicit and it's really up to the collection implementor to decide on the reasons for exposing things in different ways. I've added this paragraph as well as some clarifying notes in the sorted collections:

Operations exposed by collections may depend on user ergonomics and the practicalities of the underlying algorithms. For example, an ordered collection may expose Floor() and Ceiling() convenience operations or only expose iterators with floor/ceiling modifiers that can serve the same purpose. The nature of collections algorithms is such that there may be efficiency reasons whereby apparent convenience methods provide significant performance gains over their long-hand versions.

The same thing goes for Push(), Pop() vs Get(0), Set(e, 0) and a bunch of other cases.. It's actually pretty eye-opening if you look across the standard library collection landscape, there's compromises and leaky abstractions absolutely everywhere and I think that's just the nature of the beast. It'd be nice to have only a couple of pure collections that can serve every purpose but there's good reasons that java.util.Collection has 55 "known implementing classes in the base Java standard library alone. The trick is to provide some generic ones that are generically useful and some specific ones that are performant and/or ergonomic for some specific cases. That's repeated across ecosystems. That's where I've gone with this document.

Re the block diagrams: I've rejiggered it again in a way that I think reflects the ideal reality. "Utility Collections" sits underneath them all, I've renamed that other block to "Specialized Data Structures & Utilities: VR, Geo, SQL, etc.". GeoSpacial will be more about data structures I think but SQL will be more about utilities on top of the base collections. That's where the interesting ecosystem of specialized use-cases will evolve and things will get really interesting!

@vmx
Copy link
Member

vmx commented Apr 10, 2019

Re the block diagrams

I would add "Maps, Sets, etc." back to "Utility Collections", at least to me it makes things clearer.

@rvagg
Copy link
Member Author

rvagg commented Apr 16, 2019

I think this is OK to land, aside from the "Map" and "List" naming conflicts discussed in #112 (comment) which I wouldn't mind clearing up a little.

@rvagg
Copy link
Member Author

rvagg commented Apr 17, 2019

In an attempt to deal with the "Map" and "List" confusion with the data model, I've tried the following:

  • Renamed "Utility Collections" to "Multi-block Collections"
  • Added an introductory paragraph:

This document will re-use some terms found in the IPLD data model, in particular "Map" and "List". These should not be confused as they are operating at different layers of the IPLD stack. In the context of the data model, these names represent forms that are serialized into blocks along with other primitive data kinds. However, a "Map" or a "List" as a multi-block collection is a structure that is mapped on to many blocks (making use of the primitive kinds in the data model within those blocks in various ways), and exposing interfaces for building and interacting with complex and potentially arbitrarily large data structures. A multi-block collection combines specific data model encoding for individual blocks as well as logic that ties multiple blocks together into a useful data structure.

Here's what the stack looks like:

ipld-stack

Does that help or just make it more confusing?

@rvagg rvagg force-pushed the rvagg/collections-docs branch from 50050b5 to fb16d49 Compare April 26, 2019 04:44
@rvagg rvagg mentioned this pull request Apr 26, 2019
@rvagg
Copy link
Member Author

rvagg commented May 1, 2019

I'd like to merge this, does anyone have objections? It's not a "spec" per se, more of an introduction to specs that will hopefully fill this directory.

@mikeal
Copy link
Contributor

mikeal commented May 1, 2019

It's not a "spec" per se, more of an introduction to specs that will hopefully fill this directory.

We have a bunch of those so that shouldn’t be a blocker. An action item I have for us to talk through at the summit is some kind of staging process for specs so that we can clearly communicate the state and intention of different specs as we are currently documenting things that are not implemented, being implemented, and even things that are fully implemented that we are actively seeking to move away from. So, spec activity and newness is not a very useful signal and we’ll need to find something more explicit.

Anyway, +1 to merge.

@rvagg rvagg merged commit e918c10 into master May 2, 2019
@ghost ghost removed awaiting review status/in-progress In progress labels May 2, 2019
@rvagg rvagg deleted the rvagg/collections-docs branch May 2, 2019 02:36
Stebalien pushed a commit to Stebalien/specs that referenced this pull request Sep 18, 2019
prataprc pushed a commit to iprs-dev/ipld-specs that referenced this pull request Oct 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants