Persistent Data Structures #221

jackfirth · 2022-05-24T09:17:16Z

jackfirth
May 24, 2022
Collaborator Sponsor

This is an extension of #201. I've been thinking about persistent collections lately so I figured I'd write my thoughts down.

To provide a good collections library, we'll need some good immutable data structures. This discussion proposes three persistent data structures we can use as building blocks to build the rest of the persistent collections library:

Relaxed radix balanced trees (RRB trees)
Sized and weighted persistent hash array mapped tries (HAMTs)
Sized and weighted persistent red-black trees (RB trees)

A sized tree is a tree where each branch node records the total number of transitive children that branch has. A weighted tree is a tree where each leaf node, in addition to containing some user-supplied data, also contains a positive integer weight, and each branch contains the total weight of that branch. Recording the size of each subtree makes the size() operation trivially constant-time for all collections, but much more crucially, it enables efficient random access into the trees. Recording weights is similarly useful for collections like multisets where each leaf may correspond to many elements. With these data structures, we can implement the following persistent collections:

Persistent list: use an RRB tree (Clojure does this)
Persistent map: use a HAMT (Racket already does this)
Persistent set: use a HAMT whose values are all #false (or some other constant) (Racket already does this)
Persistent multiset: use a HAMT whose values are all #false and whose weights are the number of times the element occurs in the multiset
Persistent sorted map: use an RB tree
Persistent sorted set: use an RB tree whose values are all #false
Persistent sorted multiset: use an RB tree whose values are all #false and whose weights are the number of times the element occurs in the multiset

Additionally, we can implement the following collection views:

Viewing a set as a list (the HAMT branch sizes enable indexing into the set)
Viewing a sublist of a list (RRB trees allow finding the start and end of a sublist efficiently)
Viewing a multiset as a list (the HAMT branch weights enable indexing into the multiset)
Viewing the unique elements of a multiset as a set (simply ignore the weights)
Viewing the unique elements of a multiset as a list (the HAMT branch sizes enable indexing into the unique elements)
Viewing the keys of a map as a set (simply ignore the values stored in each HAMT leaf node)
Viewing a multiset as a map from elements to counts
Viewing a map as a set of key-value entries
Viewing the values of a map as a list
Viewing a sorted set as a list (the RB tree branch sizes enable indexing into the sorted set)
Viewing a subset of a sorted set (e.g. "all elements of the set between 10 and 20")
Viewing a sublist of a sorted set (e.g. "the 5 smallest elements of the set")
Viewing a subset (or sublist) of a sorted multiset
Viewing a submap of a sorted map
Viewing the unique elements of a sorted multiset as a sorted set
Viewing the unique elements of a sorted multiset as a list
Viewing a sorted multiset as a list
Viewing the keys of a sorted map as a sorted set
Viewing a sorted multiset as a sorted map from elements to counts
Viewing a sorted map as a sorted set of key-value entries
Viewing the values of a sorted map as a list

This lets us perform many complex queries on these collections efficiently. Here are some examples, with xs[a..b] being the syntax for selecting a subrange of a collection where both a and b endpoints are optional:

How many elements does this sorted set contain that are smaller than 10? xs[..10].size
What are the five smallest elements of this sorted set? xs.asList[..5]
How many unique elements does this sorted multiset contain that are between 10 and 20? xs.uniqueElements[10..20].size
What are the entries of the five smallest keys of this sorted map? xs.entries.asList[..5]
What are the counts of each of the unique elements in this sorted multiset between 10 and 20? xs[10..20].asMap.values

Out of scope for this discussion

This discussion only covers persistent immutable collections. Non-persistent immutable collections backed by flat arrays and optimized for read-only use are out of scope for now. Those are important to have for performance, but the persistent implementations need to come first since the non-persistent ones would have to switch to the persistent implementations upon first modification.

Also, I'm not touching mutable collections yet. Those deserve their own separate discussion.

rocketnia · 2022-05-24T21:49:01Z

rocketnia
May 24, 2022

This is great!

Viewing the unique elements of a multiset as a set (simply ignore the weights)

It looks like this view in particular is where it comes in handy to keep track of both the "total weight" and the "size" of each subtree as you were saying above, right? That way the total weight of the multiset is its .size, and the size of the multiset is the .size of the set view.

Juggling that terminology is a bit awkward. What I really want to say is that the multiset's .size is its "size" (not "total weight"), and the set view's .size is the original multiset's "unweighted size" or "deduplicated size" or something like that (not merely its "size").

9 replies

sorawee May 25, 2022

Another thought: instead of having weight on every internal node, would it be cheaper if internal node doesn't need to care about this weight at all. Instead, the weight is kept at the top-level wrapper.

sorawee May 25, 2022

Concerning the interface:

Given a set of integers, how can I ask what is the smallest element greater than x? Same for:

smallest greater than or equal to x
largest element less than x
largest element less than or equal to x

More generally, how can I ask for the next k smallest elements greater than x, etc, etc.?

jackfirth May 25, 2022
Collaborator Author Sponsor

What's the time complexity of cons, first, and rest for a list backed by RRB tree? And what is the overhead in practice for these operations? Is it efficient enough to supplant regular cons, first, and rest?

Amortized logarithmic for all operations. The branching factor is very high though, so it's more like a constant-time operation, similar to HAMTs. There are extensions that can be implemented to make it faster for repeated access and update of the same element and nearby elements though. The RRB tree paper mentions them, and clojure and scala both implement different extensions for this.

Would implementing multiset as a map from key to list of "identical" elements (under some equivalence relation) make things easier? One nice thing about this is that you get (reversed) stability among "identical" elements for free (or if snoc is efficient, then it doesn't need to be reversed). I'm not sure how HAMT deals with the stability issue.

Implementing multisets as a map from unique elements to counts is simpler than implementing them as a map from unique elements to lists. Multisets are intended to be used with data where it doesn't matter which specific elements are used if they're equal, all that matters is that the count is correct. This lets multisets consume space proportional to the number of unique elements, not all elements.

jackfirth May 25, 2022
Collaborator Author Sponsor

Another thought: instead of having weight on every internal node, would it be cheaper if internal node doesn't need to care about this weight at all. Instead, the weight is kept at the top-level wrapper.

Internal weights are needed for random access. There's no way to figure out which subtree to traverse down when looking for the 10th element of a multiset if you don't know how many elements each subtree has.

jackfirth May 25, 2022
Collaborator Author Sponsor

Concerning the interface:

Given a set of integers, how can I ask what is the smallest element greater than x? Same for:

smallest greater than or equal to x

largest element less than x

largest element less than or equal to x

More generally, how can I ask for the next k smallest elements greater than x, etc, etc.?

First, select the subset for the range you want. Second, view that subset as a list. Third, select the item or sublist you want. Examples:

Smallest element greater than or equal to x. Take the range x.. (inclusive bound below, unbounded above) and select the subset using xs[x..] (indexing a sorted collection with a range constructs a subcollection view) then view it as a list using xs[x..].asList then select the first element using xs[x..].asList[0].
Largest element less than x. Similar to above, but getting the last element requires reversing the list view. I forgot to mention above but we should be able to make a reverse view of a list for any list. So this would be xs[..x].asList.reversed[0].
Next k smallest elements greater than or equal to x. Similar again, except instead of selecting one element from the final list view, we select a sublist of it: xs[x..].asList[..k]

jackfirth · 2022-05-25T05:45:06Z

jackfirth
May 25, 2022
Collaborator Author Sponsor

@mflatt For HAMTs, Racket's HAMTs don't have the size and weight features needed to implement the proposed collection views. Do you think it would be better to extend and reuse Racket's HAMTs, or to make a new HAMT implementation for Rhombus?

3 replies

mflatt May 26, 2022
Maintainer

I think probably a new one, but probably exposing stencil vectors to the Racket level to use in that implementation of HAMTs and possibly other data structures.

jackfirth May 26, 2022
Collaborator Author Sponsor

Reusing the stencil vectors is definitely a good idea.

jackfirth May 31, 2022
Collaborator Author Sponsor

Stencil vectors are exposed publicly as of racket/racket@235c86a.

soegaard · 2022-05-26T12:32:40Z

soegaard
May 26, 2022

On stencil vectors and HAMTs:

"Compiler and Runtime Support for HAMTS", Sona Torosyan
https://www.cs.utah.edu/docs/techreports/2021/PDF/UUCS-21-003.pdf

0 replies

gus-massa · 2022-05-27T13:30:56Z

gus-massa
May 27, 2022

I didn't know the terminology, so I'll copy the definition from the other thread in case it's useful for others:

Persistent collection — an immutable collection that supports efficient functional updates

0 replies

jackfirth · 2022-06-01T00:56:25Z

jackfirth
Jun 1, 2022
Collaborator Author Sponsor

This was the discussion topic at the May 26th meeting. We discussed the following:

Fast paths for collections with only a few elements: this doesn't need to be in the persistent data structure implementations. Instead, we'll have specialized implementations of the generic collection interfaces for empty collections and singleton collections, which are by far the most useful cases to specialize.
Fast repeated inserts into the head or tail of a list: this is an important use case to make fast since otherwise, Racket cons-list users will see it as a regression. There are two strategies we can employ. First, we can implement the RRB tree extension that Scala implements to make the structure behave more like a zipper, where there's a focus within the list and inserts and updates near the focus are constant time. Second, we can implement collection builders, which should take care of most of the cases where a list is being built up incrementally in a loop.
How these collections, especially the sorted collections, should interact with regard to equality and ordering APIs: as a first pass, unordered collections are expected to depend on a universal equal-always and hash code interface that all values implement. Sorted collections are expected to use a similar interface for ordering by default, which not all values necessarily implement. Default implementations of the ordering interface for data types are expected to be total orders that are usually - but not strictly - consistent with the rhombus equality interface. Additionally, sorted collections are expected to provide the option for users to supply a custom comparator that provides a total order over elements. Custom comparators are intended for use with data that doesn't have an "obvious" ordering or with orderings that are inconsistent with equality, such as sorting people by name (name collisions happen!) or sorting other complex aggregates by a particular field that isn't guaranteed to be unique.
Low-level implementation steps: Racket's HAMTs currently rely on stencil vectors, which are a primitive of the CS runtime. Rhombus HAMTs will want to use stencil vectors too, so stencil vectors need to be exposed to the Racket layer. As of racket/racket@235c86a, @mflatt has already done this work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent Data Structures #221

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Persistent Data Structures #221

jackfirth May 24, 2022 Collaborator Sponsor

Out of scope for this discussion

Replies: 5 comments · 12 replies

rocketnia May 24, 2022

sorawee May 25, 2022

sorawee May 25, 2022

jackfirth May 25, 2022 Collaborator Author Sponsor

jackfirth May 25, 2022 Collaborator Author Sponsor

jackfirth May 25, 2022 Collaborator Author Sponsor

jackfirth May 25, 2022 Collaborator Author Sponsor

mflatt May 26, 2022 Maintainer

jackfirth May 26, 2022 Collaborator Author Sponsor

jackfirth May 31, 2022 Collaborator Author Sponsor

soegaard May 26, 2022

gus-massa May 27, 2022

jackfirth Jun 1, 2022 Collaborator Author Sponsor

jackfirth
May 24, 2022
Collaborator Sponsor

Replies: 5 comments 12 replies

rocketnia
May 24, 2022

jackfirth May 25, 2022
Collaborator Author Sponsor

jackfirth May 25, 2022
Collaborator Author Sponsor

jackfirth May 25, 2022
Collaborator Author Sponsor

jackfirth
May 25, 2022
Collaborator Author Sponsor

mflatt May 26, 2022
Maintainer

jackfirth May 26, 2022
Collaborator Author Sponsor

jackfirth May 31, 2022
Collaborator Author Sponsor

soegaard
May 26, 2022

gus-massa
May 27, 2022

jackfirth
Jun 1, 2022
Collaborator Author Sponsor