Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type representation design #2348

Open
nomeata opened this issue Feb 14, 2021 · 17 comments
Open

Type representation design #2348

nomeata opened this issue Feb 14, 2021 · 17 comments
Labels
compiler Motoko → Wasm P2 medium priority, resolve within a couple of milestones typing Involves the type system

Comments

@nomeata
Copy link
Collaborator

nomeata commented Feb 14, 2021

In this issue I’d like to design our initial type representation (typerep) design.

What’s this?

The (run-time) type representation is a value at run-time that indicates a type.

What for, and why now?

We need (or can use this) for:

  • Currently, we do type-driven code generation in the IR-to-IR passes (show.ml) and the code generator (for serialization/deseriaizaion). This could be replaced with type-driven generation of a typrep value, and then interpreting that value in (static) rust code.
    The benefits are:

    • Simpler code generation
    • We can write more code in Rust (simpler, easier to maintain, easier to do complex stuff)
    • (Likely) smaller code this way.
  • In particular, I don’t see how we can extend our existing approach to deserialization code generation to the new requirement of dynamic subtyping checks (Spec: Do a subtyping check when decoding candid#168); this really seems to require a typerep value (with identity).

  • A bit down the road, this is a requisite for Shared Generics (Shared generics #2096), and even if we don’t do that right away, it would be good to keep that use-case in mind in the design.

Because this seems to be a blocker for correct Candid deserialization, now seems to be the right time to tackele this. Nevertheless, it might be advisable to first migrate debug_show, as it is simpler.

Requirements

  • The typrep values should contain enough information to reconstruct the full type, including field names (if only for error messages), not just field hashes.
  • The typrep values need to be statically decomposable (e.g. the typrep for ?t should contain the typrep for t). Statically meaning the typrep for t exists in memory already. This seems to be necessary for all kind of recursive algorithms (show), and it would be odd if they have to allocate. This likely rules out a format like our “type hash” (which otherwise is, in a way, a very compact representation of the full type).
  • The typrep values must be able to represent coinductive values, i.e. be cyclic. This is a fundamental difference to Motoko value values!
  • The typrep values must support some form of identity, or at least a conservative to way to identify them (conservative in the sense that different typreps of the same type may look different, as long as one doesn’t run into cycles). The pointer to the static memory location of the typrep might work here. This is needed for the caching that one has to do when implementing the Candid subtyping check.
  • (soft goal) The typreps should be compact.
  • (soft goal) The strings the typrep (field names) should be deduplicated.
  • (soft goal) The typreps should be easy to parse/traverse
  • (soft goal) The typreps memory structure should be the representation of a suitable Rust enum/struct, for ease of writing the Rust code.

The following requirements come from the usecase of shared generics.

  • The typrep values need to be composable (e.g. from the typrep for t one can construct the typrep for ?t). This must also work for infinite types (e.g. from t construct let t2 = ?(t, t2) in t2).
  • The above implies that typrep values must be dynamically allocatable.
  • The above implies that they must have a GC-compatible form.

Approach 1: Use Motoko Values

One possible design is to define a Motoko type:

type TypRep = {
  #int;
  #nat;
  …
  #tuple : [TypRep];
  #array : ({#mut; #immut}, TypRep);
  #opt : TypRep;
  …
  #record : [(Text; TypRep)];
  …
}

and use the representation of that.

  • ➕ No need to adjust GC
  • ➕ Possible to write functions over this type in prelude.ml or in IR-to-IR passes
  • ➕ Existing functionality to read and write these heap objects in the RTS can be used
  • ➕ String de-duplication for free (we already de-duplicate static strings)
  • ➗ Motoko values are inductive; this needs to be coinductive. Unclear what breaks if we start to conflate the two
  • ➖ Not very compact. Using a type representation that was designed for uniformity and subtyping when neither is needed here (or is it?)
  • ➖ Maybe less convenient to use in Rust than dedicated structures
  • ➖ Either have to deal with ropes for the strings, or introduce the invariant that strings in a TypRep are always contiguous.

Approach 2: Dedicated heap objects

We could introduce dedicated heap objects for the TypReps.

(No details here yet.)

Pros and Cons mostly the inverse of Approach 1.

And now?

Discuss!

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 14, 2021

Outcome of some thinking during a nice hike through the snow. Let’s collect all the things we have to think of now!

@rossberg
Copy link
Contributor

Thanks for collecting this! I agree with most of these points. A few questions below.

  • In particular, I don’t see how we can extend our existing approach to deserialization code generation to the new requirement of dynamic subtyping checks (dfinity/candid#168); this really seems to require a typerep value (with identity).

Can you elaborate on this? In particular, isn't dfinity/candid#168 about subtyping over Candid types, while typerep would represent Motoko types (which are different and have different subtyping rules)?

  • The typrep values should contain enough information to reconstruct the full type, including field names (if only for error messages), not just field hashes.

What error messages do you have in mind? This strikes me as rather an anti-requirement -- it would be unfortunate if we couldn't erase field names anymore, and if rtts were even more bloated than they already have to be. They are not supposed to become a user-facing reflection feature.

  • The typrep values must support some form of identity, or at least a conservative to way to identify them (conservative in the sense that different typreps of the same type may look different, as long as one doesn’t run into cycles). The pointer to the static memory location of the typrep might work here. This is needed for the caching that one has to do when implementing the Candid subtyping check.

Hm, where will type identity be needed? I can see that for optimisations, e.g., for memoisation, but that wouldn't necessarily need to be accurate. But for semantics, aren't you always inspecting the type instead of comparing it? Or if you compare, won't it be up to subtyping? Notably, even in the static semantics, there are very few occurrences of using type equivalence -- essentially, only subtyping over invariant type constructors (which could be replaced with antisymmetry).

  • The typrep values need to be composable (e.g. from the typrep for t one can construct the typrep for ?t). This must also work for infinite types (e.g. from t construct let t2 = ?(t, t2) in t2).

This will probably be the most tricky part, since it requires a form of substitution, so graph copy. (Fortunately not involving reduction, since we do not have higher-order polymorphism (yet?).)

  • ➕ No need to adjust GC
  • ➕ Possible to write functions over this type in prelude.ml or in IR-to-IR passes
  • ➕ Existing functionality to read and write these heap objects in the RTS can be used
  • ➕ String de-duplication for free (we already de-duplicate static strings)
  • ➗ Motoko values are inductive; this needs to be coinductive. Unclear what breaks if we start to conflate the two

In Motoko you could represent this with var indirections, though, can't you?

  • ➖ Not very compact. Using a type representation that was designed for uniformity and subtyping when neither is needed here (or is it?)

Yes, there is only one typerep type, so subtyping doesn't enter the picture, AFAICS.

  • ➖ Maybe less convenient to use in Rust than dedicated structures
  • ➖ Either have to deal with ropes for the strings, or introduce the invariant that strings in a TypRep are always contiguous.

Let's try to keep strings out of typereps?

I don't have a specific opinion regarding an embedded vs native representation. Would it be realistic to do some prototyping to see which one is simpler in practice?

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 15, 2021

Thanks for collecting this! I agree with most of these points. A few questions below.

  • In particular, I don’t see how we can extend our existing approach to deserialization code generation to the new requirement of dynamic subtyping checks (dfinity/candid#168); this really seems to require a typerep value (with identity).

Can you elaborate on this? In particular, isn't dfinity/candid#168 about subtyping over Candid types, while typerep would represent Motoko types (which are different and have different subtyping rules)?

When implementing deserialization, you actually need

  • the input candid type (the type table from the candid message)
  • the expected candid type (to do subtyping hecks)
  • the expected Motoko type (to actually produce the value that you want).

The last point can’t be omitted: Blob and [Nat8] have the same candid type, but different Motoko representations. But the second point can be omitted; it can be calculated from the last one.

And if you don’t want to do a fresh candid subtyping check repeatedly on each recursion of what’s now decode_go, you need to do the subtye caching on paris of input candid type (index) and expected Motoko type, I fear.

  • The typrep values should contain enough information to reconstruct the full type, including field names (if only for error messages), not just field hashes.

What error messages do you have in mind? This strikes me as rather an anti-requirement -- it would be unfortunate if we couldn't erase field names anymore, and if rtts were even more bloated than they already have to be. They are not supposed to become a user-facing reflection feature.

Without them you can’t implement show. Also, without them, you’d have to store both Motoko and Candid hashes (or make sure you can calculate one of the other).

Field names are part of our types, for what it’s worth, I don’t think we can avoid them.

Hm, where will type identity be needed? I can see that for optimisations, e.g., for memoisation, but that wouldn't necessarily need to be accurate.

Yes, memoization for Candid decoding (only use-case so far) and yes, don’t need to be accurate (that’s what I wanted to say). We could rephrase this as “must support loop breakers”? Or what is the right technical term for the property we need?

I don't see a use-case yet for Motoko subyping on typreps (thankfully).

  • The typrep values need to be composable (e.g. from the typrep for t one can construct the typrep for ?t). This must also work for infinite types (e.g. from t construct let t2 = ?(t, t2) in t2).

This will probably be the most tricky part, since it requires a form of substitution, so graph copy. (Fortunately not involving reduction, since we do not have higher-order polymorphism (yet?).)

It’s probaly not too bad; you statically compile code that allocates and builds the graph with holes, and then put in the pointers to t. I woudn’t call it tricky, just a fun programming puzzle. Not harder than compiling a bunch of mutually recursive closures that capture a free variable, which we can already do.

  • heavy_division_sign Motoko values are inductive; this needs to be coinductive. Unclear what breaks if we start to conflate the two

In Motoko you could represent this with var indirections, though, can't you?

You could, but since we don't need the properties of mutable boxes (e.g. have identity), don’t actually see the gain over just creating a coinductive structure.

  • heavy_minus_sign Either have to deal with ropes for the strings, or introduce the invariant that strings in a TypRep are always contiguous.

Let's try to keep strings out of typereps?

See above, I don’t think we can.

I don't have a specific opinion regarding an embedded vs native representation. Would it be realistic to do some prototyping to see which one is simpler in practice?

Actually, I don’t think that choice is very important, most of the hard work is orthogonal to that.

Slightly leaning towards using the native representation, so that we can maybe write some useful helper functions in Motoko.

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 15, 2021

On second thought, it must be possible to implement the Candid decoding statically; it's just the partial evaluation of the dynamic one (using typrep) to the statically created typerep. But probably not a lot of fun, with the memorization, and lots of code bloat. So this just as an aside

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 15, 2021

I'm not worried about the field name strings. We already embed them in the code (show, or in Candid deserialisation for the error message about missing fields); moving that into the type rep will only reduce bloat. And they will all be just pointers to statically allocated strings (ah: and hence always contiguous), so effectively just words. What is your worry about them?

@rossberg
Copy link
Contributor

Hm, now this is why I'm so reluctant to let hacks like show into the language. Eventually they become a legacy cost infecting other mechanisms.

Ideally, typereps should have minimal overhead and only contain the information necessary to implement the language semantics. Syntactic names, whether variables or labels, aren't semantically relevant up to equivalence, so shouldn't need to be represented at runtime.

But FWIW, this is a debugging feature, isn't it? In principle, we could reasonably produce text names only in debugging mode. Whether it's worth bothering is a different question.

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 15, 2021

I am not sure. Pretty-printing, or decoding/encoding , are very fundamental use cases for any kind of generic or deriving features. And I really don't see the problem with representing them with a unique number that happens to be a pointer to static memory. 🤷‍♂️

@crusso
Copy link
Contributor

crusso commented Feb 16, 2021

Aren't the hashes a leaky abstraction in the first place? I mean, in principle, users need to worry about hash collisions, which depends on the concrete choice, not distinction, of names, even if collisions are rare in practice.

Using a table index to represent label names doesn't seem like a terrible idea, though it would complicate separate compilation (and representing static data) in a way that hashing avoids.

Is the Motoko TypRep sketched above useful as stands? Shouldn't it carry a phantom type index so we can actually to strongly typed interpretation of the typrep.

I imagine show and eq would have type:

show : <A> TypRep<A> -> A -> Text
eq: <A> TypRep<A> -> (A,A) -> Bool

But if we add indexing, what would TypRep even look like for quantified function types or records/variants? I think PureScript uses row types to deals with these, but we don't have them.

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 16, 2021

Using a table index to represent label names doesn't seem like a terrible idea, though it would complicate separate compilation (and representing static data) in a way that hashing avoids.

What is the advantage of a table over simply treating it like we treat other static string constants?

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 16, 2021

Is the Motoko TypRep sketched above useful as stands?

I was am discussing the representation of TypRep in memory here. I expect the use initial cases to be “compile.ml generates typrep; RTS uses it”. You seem to be thinking one step further, about exposing typrep in IR (or source?). I agree that then we want some phantom type thingy (or make it a primitive type former).

@nomeata nomeata changed the title Type representation design Type representation representation design Feb 16, 2021
@crusso
Copy link
Contributor

crusso commented Feb 16, 2021

Using a table index to represent label names doesn't seem like a terrible idea, though it would complicate separate compilation (and representing static data) in a way that hashing avoids.

What is the advantage of a table over simply treating it like we treat other static string constants?

You mean dedup and allocate in static memory, and use static address for equality? I guess that would work for now, but break on separate compilation. I was imaging some scheme where each sep. compiled module has a table used to lookup runtime representations of labels, with tables patched on linking so they agree on label representation.

@nomeata
Copy link
Collaborator Author

nomeata commented Feb 16, 2021

Where do we need (at runtime) to compare the strings? We need them

  • to implement show,
  • for error messages when deserialization
  • to calculate candid hashes from
  • to calculate Motoko hashes from
  • (anything else?)

For the hashes, it may be worth including them in the typrep, to avoid doing the hashing dynamically, e.g. something like #record : [(Text; Int32; Int32; TypRep)];.

But I don’t see a use-case for comparing the labels of two type reps, unleess I am missing something. (Or did you think I was proposing to change our value representation as well, replacing field hashes with pointers to strings? That was not my intention here, sorry if I was unclear.)

@crusso
Copy link
Contributor

crusso commented Feb 16, 2021

But I don’t see a use-case for comparing the labels of two type reps, unleess I am missing something. (Or did you think I was proposing to change our value representation as well, replacing field hashes with pointers to strings? That was not my intention here, sorry if I was unclear.)

I did, in fact, interpret it this way...

@nomeata
Copy link
Collaborator Author

nomeata commented Mar 9, 2021

Playing around a little bit with this. I think it makes sense to pursue approach one and use Motoko-values (or rather, IR values), if only for the singular reason that we can generate the values in an IR-to-IR pass, so that we get type-checking of the result, and can run the IR interpreter on them.

I am pondering if we should allow cyclic values in our IR language, so that the typrep of

type T = ??T;

which looks like this in IR

  (LetD (VarP @typ_rep<0=?!0>) (PrimE (TagE opt) (VarE @typ_rep<0=?!0>)))

will compile. Would that be a viable direction?

@nomeata
Copy link
Collaborator Author

nomeata commented Mar 10, 2021

Would that be a viable direction?

Hmm, not sure. We’d have to

  • beef up the interpreter (at least the IR interpreter) to be able to handle cyclic data structures
  • no longer get the “accessing identifier before its definition” error from the IR interpreter (because we’d have to allow that)

Probably too big of a change.

I guess I can insert

type TypRep = {
  #rec : { mut typ : TypRep };  // or #rec : [mut TypRep], more effiicent
  …

into the value representation; this would allow me to tie the knot, and consumers of that type can just follow that dereference. This would work in the interpreter, but we’d get these extra indirections in the backend.

Maybe not too bad, I think I can still compile that to static memory. These mutable cells would become GC roots, because the backend doesn’t know that they are write-once.

Maybe good enough to get started, the rest is just optimization.

@nomeata
Copy link
Collaborator Author

nomeata commented Mar 10, 2021

Maybe good enough to get started, the rest is just optimization.

So, this seems to work. Not the most efficent in-memory representation right now, but maybe first get this working, and then refactor if needed (e.g. use less #rec indirections, make sure it can be compiled into static memory.)

Guess the next step is to actually use the type rep for something interesting, such as pretty-printing.

I am a bit torn whether I want to implement that completely in rust, or in compile.ml (with maybe some helper functions in internals.mo), or some mix.

@rossberg rossberg added compiler Motoko → Wasm P2 medium priority, resolve within a couple of milestones typing Involves the type system labels May 18, 2021
@nomeata
Copy link
Collaborator Author

nomeata commented Jan 14, 2022

NB: with dfinity/candid#311 changing the way Candid decoding works again we don't need this anymore, so the second bullet in the original motivation doesn't apply anymore. The rest does, but makes this less pressing.

@ggreif ggreif changed the title Type representation representation design Type representation design Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler Motoko → Wasm P2 medium priority, resolve within a couple of milestones typing Involves the type system
Projects
None yet
Development

No branches or pull requests

3 participants