`CodeBlock` source mapping #862

vlopes11 · 2023-04-17T21:50:22Z

vlopes11
Apr 17, 2023

A [CodeBlock] is either a pointer to a list of [CodeBlock] (Join, Split, Loop, Call, Proxy), or a list of [Operation] (Span). It is indexed by its [Digest] root.

A function F(CodeBlock::Span) |-> x will also map the [CodeBlock] itself.

CodeBlock::Join(Span(a), Span(b)) |-> (x, y) : F(Span(a)) |-> x, F(Span(b)) |-> y

A mapping from the root of a [CodeBlock::Span] to a list of code locations, given a span is a ordered list of operations, will ultimately map any operation of a [CodeBlock] into a unique [SourceLocation]. This of course extends to any recursion of the other variants of [CodeBlock]. We, therefore, given F, map any [Operation] of a [CodeBlock] to a unique [SourceLocation].

In order to create such map, we need to:

Link the parsed [Token] with a source location
Extend the link from the parsed [Token] to the [Instruction]
Extend the link from the [Instruction] to the [Operation]

And, whenever we create a list of [Operation] for a [CodeBlock::Span], we link the root of such span to the list of [SourceLocation].

The PR #861 introduces the foundation that will unblock [1].

To unblock [2], we need an efficient mapping from an [Instruction] to a [SourceLocation]. This must be done efficiently to avoid unnecessary increase of the memory space for the compilation. A sequence of parsed [Instruction]s is represented as [ProgramAst], that is a sequence of [Node]; [Node] behaves similarly to [CodeBlock], so we can use a similar approach: map a unique identifier of [ProgramAst] into a ordered list of [SourceLocation] to achieve a Instruction |-> SourceLocation mapping.

From [2], we extend this mapping to [AssemblyContext], and transpose the Instruction |-> SourceLocation to Operation |-> SourceLocation. This is trivial because every instruction always compiles to one or more [Operation]; we just take the instruction map and extend for the [Operation] count created for a [CodeBlock::Span].

bobbinth · 2023-04-18T08:49:09Z

bobbinth
Apr 18, 2023
Maintainer

Link the parsed [Token] with a source location

Extend the link from the parsed [Token] to the [Instruction]

Extend the link from the [Instruction] to the [Operation]

I agree with the first two points, but I was thinking about the third one somewhat differently. Specifically, the task of a source map would be to map an operation to source location, probably bypassing the need to map operations to instructions. The way this could work is as follows:

Every cycle in the VM (i.e., when we execute a program via execute_iter) can be uniquely identified by the path in the MAST and operation index. To provide this info, we'd need to add something like block_hash to the VmState struct. Then, the task of a source map would be to map:

(mast_path, op_idx) -> SourceLocation

For MAST path, we don't actually need to store the full path - probably hash of the path (using a fast hash function - i.e., BLAKE3 with 160-bit output) will be sufficient.

For example, for a program:

JOIN
    SPAN push.1 push.2 add END
    SPAN push.3, push.4 mul END
END

MAST paths for blocks would be:

JOIN: hash(join_root)
SPAN1: hash(join_root | span1_root)
SPAN2: hash(join_root | span2_root)

Then, for example, when we execute push.2 on the VM, the caller of execute_iter would be able to able to uniquely identify it as

hash(join_root | span1_root)::2

And then the source map would tell us that hash(join_root | span1_root)::2 map to (line=1, column=16) in the source.

0 replies

vlopes11 · 2023-05-22T11:35:24Z

vlopes11
May 22, 2023
Author

We might want an optional context that will provide mappings from a CodeBlock under execution context to a sequence of locations. Some preliminary remarks:

In CodeBlock, we might have CodeBlock::Span directly in case of very simple programs, such as:

begin add end

The CodeBlock is an enum that will contain compound sequences (join, split, loop), and the sequences (span). It will also contain zero operations blocks (call, proxy).

All of the above must be bound to source locations. It means they shouldn't necessarily map from an operation, having the Call as example: a call exists as AST instruction, so it must be mapped to a location.

We also have another exception: Join. This will not have a mapped instruction as it will be arbitrarily generated to optimize the CodeBlock execution, without having an instruction counterpart. It means: there is no linkable MASM instruction that will generate a Join. However, this exception can be ignored; an user, for instance, cannot set a breakpoint on a join, as this won't be decorated nor have an associated instruction, so we can just use SourceLocation::default there.

Considering this exception treatment for Join, we can define that all CodeBlock variants will contain, at least, one SourceLocation that will point to the block initialization. In the prior MASM example, it will point to begin.

We aim to minimize the execution overhead for blocks that doesn't contain locations. The compilation time for Node increased by 0.89% by introducing a wrapper CodeBody that will map instructions to a vector of SourceLocation. Even when we don't have locations/heap allocation (i.e. we use [].to_vec()), we have such overhead.

As suggested, we could have a map from a CodeBlock to a sequence of SourceLocation and use this map only for execute_iter so execute would never suffer such overhead. However, we have an edge case to consider:

export.foo add end
export.bar add end

Both procedures above will contain the same span CodeBlock root, but their locations exists on different places. It means the CodeBlock root alone is not sufficient to be the index of this map.

The suggestion above mentions using the MAST path to index the locations, and that might work, but we don't have such ability yet.

We could introduce a CodeBlockIdentifier that will allow the creation of such path for more complex situations, like this one

export.foo add end
begin call.foo end

In the example above, we have a call function that will be a single operation inside the begin body. The MAST paths of the snippet above would be [main, call] -> [loc(begin), loc(call), loc(end)], [main, call, foo] -> [loc(export.foo), loc(add), loc(end)].

We do have some exceptions to such case, such as when we have an exec instruction that is mapped to a single location, but will be compiled to multiple operations in distinct locations, but then we can treat such exception in the future.

The initial main change would be on Assembler::compile internal implementation and it would, for each compiled body, keep track of the MAST path and associate with an optional context. Alternatively, we can create a Assembler::compile_with_locations to avoid such overhead in Assembler::compile.

Assembler::compile_with_locations would return a context that contains the aforementioned mapping, and such mapping would be passed to fn execute_iter; alternatively, we can embed that mapping into Program, having it filled optionally.

13 replies

bobbinth May 23, 2023
Maintainer

I'm sorry - but I'm still not following. Are these calls listed in reverse order then?

add_block([], CodeBlock(join(begin, push, split(add, mul))), locations(push, if.true, end))
add_block([if.true], CodeBlock(split([add, if.true], mul)), locations(if.true, add, if.true, mul, end))
add_block([if.true, if.true], CodeBlock(split(mtree_get, mtree_set)), locations(if.true, mtree_get, mtree_set, end))

And why are we skipping SPAN blocks here (i.e., not calling add_block() for SPAN blocks)?

Another thing which is unclear is how we'd map locations to operations within a span. Is the mapping supposed be done inside add_block()? As far as I can tell, add_block() takes locations as the 3rd parameter - but because an instruction could map to multiple operations, it needs some extra info to figure out the mapping.

bobbinth May 23, 2023
Maintainer

Could we do a slightly simpler example:

being
  push.1
  if.true
    add
    assertz
  else
    mul
  end
end

The structure of this program (i.e., the actual operations which will be executed by the VM) is:

JOIN
  SPAN
    PAD
  END
  SPLIT
    SPAN
      ADD
      EQZ
      ASSERT
    END
    SPAN
      MUL
    END
  END
END

How would we use add_block() to process this program?

vlopes11 May 23, 2023
Author

The order of the calls is more of an implementation detail as the map doesn't depend on it - an entry won't ever be replaced given the unique property of the keys. We will, in practice, be calling that function once we backtrack from a compiled node, but that order does not define the correctness of the result.

first :=
  SPAN
    PAD
  END

second :=
  SPAN
    ADD
    EQZ
    ASSERT
  END

third :=
  SPAN
    MUL
  END

split :=
  SPLIT
    second
    third
  END

full :=
  JOIN
    first
    split
  END

add_block([], full, [(1, 1), (2, 3), (3, 3), (9, 1)])
add_block([(SPLIT, 0)], second, [(3, 3), (4, 5), (5, 5), (5, 5), (8, 3)])
add_block([(SPLIT, 1)], third, [(3, 3), (7, 5), (8, 3)])

With such calls, we will end up with a map described as follows:

H([], full.root) -> [(1, 1), (2, 3), (3, 3), (9, 1)]
H([(SPLIT, 0)], second.root) -> [(3, 3), (4, 5), (5, 5), (5, 5), (8, 3)]
H([(SPLIT, 1)], third.root) -> [(3, 3), (7, 5), (8, 3)]

When we are executing, let's say we have a breakpoint on the first assert. We will halt on the following node of the CodeBlock: JOIN.1.SPLIT.0.2: ASSERT. From the nodes of such CodeBlock, we can fully compute the key H[(SPLIT, 0), second.root], fetch the locations [(3, 3), (4, 5), (5, 5), (5, 5), (8, 3)], and pick its third element (the first one is always reserved to the block declaration): (5, 5)

bobbinth May 23, 2023
Maintainer

OK - I think I'm getting it now. So, when we execute a program via execute_iter() for every cycle we'd output an identifier. This identifier would consists of two things: (1) hash of the path to the current node, and (2) an index to a location within the current node.

So, for the program above, it would be something like:

JOIN      H([JOIN]): 0
SPAN      H([JOIN, 0, SPAN]): 0
PAD       H([JOIN, 0, SPAN]): 1
END       H([JOIN, 0, SPAN]): 2
SPLIT     H([JOIN, 1, SPLIT]): 0
SPAN      H([JOIN, 1, SPLIT, 0, SPAN]): 0  
ADD       H([JOIN, 1, SPLIT, 0, SPAN]): 1  
EQZ       H([JOIN, 1, SPLIT, 0, SPAN]): 2  
ASSERT    H([JOIN, 1, SPLIT, 0, SPAN]): 3  
END       H([JOIN, 1, SPLIT, 0, SPAN]): 4  
SPAN      H([JOIN, 1, SPLIT, 1, SPAN]): 0
MUL       H([JOIN, 1, SPLIT, 1, SPAN]): 1
END       H([JOIN, 1, SPLIT, 1, SPAN]): 2
END       H([JOIN, 1, SPLIT]): 1
END       H([JOIN]): 1

Is this correct?

vlopes11 May 23, 2023
Author

Could be that! We have some options to make the key unique, and this is one that would work well: we suffix each node with its index, if the node contains more than one variant.

The code root suffix is optional and can be discarded, but the information is available. What do you think if I make a proof of concept PR with this approach?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`CodeBlock` source mapping #862

{{title}}

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CodeBlock source mapping #862

vlopes11 Apr 17, 2023

Replies: 2 comments · 13 replies

bobbinth Apr 18, 2023 Maintainer

vlopes11 May 22, 2023 Author

bobbinth May 23, 2023 Maintainer

bobbinth May 23, 2023 Maintainer

vlopes11 May 23, 2023 Author

bobbinth May 23, 2023 Maintainer

vlopes11 May 23, 2023 Author

`CodeBlock` source mapping #862

vlopes11
Apr 17, 2023

Replies: 2 comments 13 replies

bobbinth
Apr 18, 2023
Maintainer

vlopes11
May 22, 2023
Author

bobbinth May 23, 2023
Maintainer

bobbinth May 23, 2023
Maintainer

vlopes11 May 23, 2023
Author

bobbinth May 23, 2023
Maintainer

vlopes11 May 23, 2023
Author