Initial utility collections docs #110

rvagg · 2019-04-09T11:24:13Z

This is a WIP and I'm mainly interested in feedback from folks who have been thinking about this stuff longer than me.

I've done a few things here:

Removed "HAMT" from the README and collapsed UnixFS v2 and MFS down a level—the HAMT for MFS is an implementation detail that I don't think is interesting for the purpose of this document (and there's a chance the HAMT used for MFS will be an MFS-specific thing rather than something to encourage end-users to adopt; I think).
Removed HAMT.md, for now, that could come back later but it's a distraction in the high-level overview. I'd argue that if it does come back it should have a name like "HashMap.md" (see next point). I'd love to see that directory fill up with a bunch of different data structures, general utility collections and specific use-case-focused ones (geo, etc.).
Introduced "Utility Collections" as a component on top of "Complex Data Structures". The intent here is that these are data structures that are primarily focused on end-user application stacks. Of course a HAMT is likely going to be in there but that's an implementation detail that we really don't need to list front-and-center. IPLD data structures should have use-focused names where possible, not algorithm names. We want people to actually use these things, not just admire how clever it all is.
Added an overview of "Utility Collections" with some high-level thoughts on what it's aiming for and what types of data structures should be in there along with what kinds of operations they'd expose and some prior-art from other standard libraries. At some point I'd also like to list the algorithms that those standard libraries use because some of them may be portable to IPLD, like HAMT. There's also other places to dig for algorithms of course, like database storage engines, and I'd like to get to that too but that information may not fit here.

vmx

Great write-up.

A general note (not really specific to this PR). Could we use lowercase directories/filenames? That makes things on the terminal on case-sensitive systems way easier :)

vmx · 2019-04-09T13:26:23Z

Data-Structures/Utility-Collections.md

+
+Collections are a fundamental primitive in every programming language. Being able to organize data into collections that allow for convenient and efficient access and modification is a core activity in programming.
+
+While IPLD is not a programming languagelement, indext represents enormous potential for sharing data and providing access to diverse and very large data sets. With sufficient data organization primitives, IPLD can replace many functions traditionally provided by a centralized database system. Client applications should be able to access and even manipulate data structures stored across many peers, from trivial lists to massive and complex data sets that are exposed with efficient query and search operations.


First sentence has typos and I can't really make sense of it.

Data-Structures/Utility-Collections.md

mikeal · 2019-04-09T19:46:51Z

Data-Structures/Utility-Collections.md

+* `Remove(element)`
+* `Iterate()`
+* `Contains(element)?`
+* `Size()?`


I’m always concerned when we encode information like this into the root node, because you could just lie ;)

Without traversing the entire structure you don’t actually know the real size. For instance, we encode the length of files into unixfs-v2 but it can’t really be trusted. It’s useful in interfaces for showing “this file/directory is {size}” without actually requesting all the blocks, but I’m constantly worried people will trust it when they shouldn’t.

If we put a size property in the base interface I’m concerned that it will lead people to trust it, and it’s actually very easy to lie.

I don't think that's built into this document though, I'm leaving that open to the implementation and just listing this as a possible operation. In fact, I was just assuming here that it would be implemented as an actual traversal calculation if it's implemented at all. It's what I did in iamap but the user has to know the cost of calling it!

Data-Structures/Utility-Collections.md

mikeal · 2019-04-09T20:26:16Z

README.md

+           │    MFS    │ │       End-User Applications        │
+           │           │ │                                    │
+           ├───────────┤ ├─────────────────┬──────────────────┤            ┌───────────┐
+           │           │ │ Structured Data │     Utility      │            │           │


“Utility Collections” should actually just replace “Complex Data Structures.” unixfs relies on hamt which is a collection, and there’s no way you can make SQL and GeoSpatial interfaces without these as well.

I think I need to understand the geo etc. stuff better. I'm thinking the name "Specialized Data Structures" might be more suited and maybe "End-User Applications" should wrap around it so it's still also sitting on "Utility Collections".

So, I get GeoSpatial, but I don't know what "VR" is referring to in this context and I'm not sure "SQL" makes sense under that heading because it's an interfacing method rather than a collection type. @mikeal can you expand on the thoughts behind this category a bit?

I agree with @rvagg version here. That's also my view on things. The old version was wrong anyway. A Geo index is not a sorted index.

rvagg · 2019-04-10T00:56:54Z

oh, I've just worked out why some of the words are jumbled up -- before pushing this i did a replace of variable names I used in the methods, they were like this: Add(e, i) so I did a global replace -- e, i -> element, index but obviously didn't bound some of those with parens! I'll clean that up and it looks like some of the methods are missing arguments.

rvagg · 2019-04-10T01:31:06Z

@mikeal

Re the floor/ceil thing: it was supposed to be implicit and it's really up to the collection implementor to decide on the reasons for exposing things in different ways. I've added this paragraph as well as some clarifying notes in the sorted collections:

Operations exposed by collections may depend on user ergonomics and the practicalities of the underlying algorithms. For example, an ordered collection may expose Floor() and Ceiling() convenience operations or only expose iterators with floor/ceiling modifiers that can serve the same purpose. The nature of collections algorithms is such that there may be efficiency reasons whereby apparent convenience methods provide significant performance gains over their long-hand versions.

The same thing goes for Push(), Pop() vs Get(0), Set(e, 0) and a bunch of other cases.. It's actually pretty eye-opening if you look across the standard library collection landscape, there's compromises and leaky abstractions absolutely everywhere and I think that's just the nature of the beast. It'd be nice to have only a couple of pure collections that can serve every purpose but there's good reasons that java.util.Collection has 55 "known implementing classes in the base Java standard library alone. The trick is to provide some generic ones that are generically useful and some specific ones that are performant and/or ergonomic for some specific cases. That's repeated across ecosystems. That's where I've gone with this document.

Re the block diagrams: I've rejiggered it again in a way that I think reflects the ideal reality. "Utility Collections" sits underneath them all, I've renamed that other block to "Specialized Data Structures & Utilities: VR, Geo, SQL, etc.". GeoSpacial will be more about data structures I think but SQL will be more about utilities on top of the base collections. That's where the interesting ecosystem of specialized use-cases will evolve and things will get really interesting!

vmx · 2019-04-10T11:43:02Z

Re the block diagrams

I would add "Maps, Sets, etc." back to "Utility Collections", at least to me it makes things clearer.

rvagg · 2019-04-16T06:57:01Z

I think this is OK to land, aside from the "Map" and "List" naming conflicts discussed in #112 (comment) which I wouldn't mind clearing up a little.

rvagg · 2019-04-17T05:25:25Z

In an attempt to deal with the "Map" and "List" confusion with the data model, I've tried the following:

Renamed "Utility Collections" to "Multi-block Collections"
Added an introductory paragraph:

This document will re-use some terms found in the IPLD data model, in particular "Map" and "List". These should not be confused as they are operating at different layers of the IPLD stack. In the context of the data model, these names represent forms that are serialized into blocks along with other primitive data kinds. However, a "Map" or a "List" as a multi-block collection is a structure that is mapped on to many blocks (making use of the primitive kinds in the data model within those blocks in various ways), and exposing interfaces for building and interacting with complex and potentially arbitrarily large data structures. A multi-block collection combines specific data model encoding for individual blocks as well as logic that ties multiple blocks together into a useful data structure.

Here's what the stack looks like:

Does that help or just make it more confusing?

data-structures/multiblock-collections.md

rvagg · 2019-05-01T10:33:11Z

I'd like to merge this, does anyone have objections? It's not a "spec" per se, more of an introduction to specs that will hopefully fill this directory.

mikeal · 2019-05-01T19:03:29Z

It's not a "spec" per se, more of an introduction to specs that will hopefully fill this directory.

We have a bunch of those so that shouldn’t be a blocker. An action item I have for us to talk through at the summit is some kind of staging process for specs so that we can clearly communicate the state and intention of different specs as we are currently documenting things that are not implemented, being implemented, and even things that are fully implemented that we are actively seeking to move away from. So, spec activity and newness is not a very useful signal and we’ll need to find something more explicit.

Anyway, +1 to merge.

Rework the operation.md doc a bit

Initial utility collections docs

ghost assigned rvagg Apr 9, 2019

ghost added awaiting review status/in-progress In progress labels Apr 9, 2019

rvagg force-pushed the rvagg/collections-docs branch from 807fb09 to 4585b77 Compare April 9, 2019 11:28

vmx reviewed Apr 9, 2019

View reviewed changes

mikeal reviewed Apr 9, 2019

View reviewed changes

Data-Structures/Utility-Collections.md Outdated Show resolved Hide resolved

mikeal reviewed Apr 9, 2019

View reviewed changes

rvagg force-pushed the rvagg/collections-docs branch from 42a715d to ce6cb1c Compare April 10, 2019 01:21

rvagg mentioned this pull request Apr 16, 2019

Data Model doc expanded, and in terms of Kinds. #112

Merged

mikeal reviewed Apr 17, 2019

View reviewed changes

data-structures/multiblock-collections.md Outdated Show resolved Hide resolved

rvagg mentioned this pull request Apr 24, 2019

IPLD Monthly Meetup (Virtual) - May 7th 2018 ipld/ipld#70

Open

Initial multi-block collections outline

fb16d49

rvagg force-pushed the rvagg/collections-docs branch from 50050b5 to fb16d49 Compare April 26, 2019 04:44

rvagg mentioned this pull request Apr 26, 2019

add basic spec for hamt #109

Closed

rvagg merged commit e918c10 into master May 2, 2019

ghost removed awaiting review status/in-progress In progress labels May 2, 2019

rvagg deleted the rvagg/collections-docs branch May 2, 2019 02:36

Stebalien pushed a commit to Stebalien/specs that referenced this pull request Sep 18, 2019

Merge pull request ipld#110 from filecoin-project/feat/rework-operation

0818a5f

Rework the operation.md doc a bit

prataprc pushed a commit to iprs-dev/ipld-specs that referenced this pull request Oct 13, 2020

Merge pull request ipld#110 from ipld/rvagg/collections-docs

86dc953

Initial utility collections docs


		Collections are a fundamental primitive in every programming language. Being able to organize data into collections that allow for convenient and efficient access and modification is a core activity in programming.

		While IPLD is not a programming languagelement, indext represents enormous potential for sharing data and providing access to diverse and very large data sets. With sufficient data organization primitives, IPLD can replace many functions traditionally provided by a centralized database system. Client applications should be able to access and even manipulate data structures stored across many peers, from trivial lists to massive and complex data sets that are exposed with efficient query and search operations.

Initial utility collections docs #110

Initial utility collections docs #110

Uh oh!

Conversation

rvagg commented Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vmx left a comment

Choose a reason for hiding this comment

Uh oh!

vmx Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikeal Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

rvagg Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikeal Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rvagg Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vmx Apr 10, 2019

Choose a reason for hiding this comment

Uh oh!

rvagg commented Apr 10, 2019

Uh oh!

rvagg commented Apr 10, 2019

Uh oh!

vmx commented Apr 10, 2019

Uh oh!

rvagg commented Apr 16, 2019

Uh oh!

rvagg commented Apr 17, 2019

Uh oh!

Uh oh!

rvagg commented May 1, 2019

Uh oh!

mikeal commented May 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rvagg commented Apr 9, 2019 •

edited

Loading

mikeal Apr 9, 2019 •

edited

Loading

rvagg Apr 9, 2019 •

edited

Loading