diff --git a/proposals/0018-work-graphs.md b/proposals/0018-work-graphs.md new file mode 100644 index 00000000..2bb0212e --- /dev/null +++ b/proposals/0018-work-graphs.md @@ -0,0 +1,579 @@ + + +# Work Graphs + +* Proposal: [0018](0018-work-graphs.md) +* Author(s): [Chris Bieneman](https://github.com/llvm-beanz), + [Greg Roth](https://github.com/pow2clk) +* Sponsor: [Greg Roth](https://github.com/pow2clk), [Tex + Riddell](https://github.com/tex3d) +* Status: Complete +* Version: Shader Model 6.8 + +## Introduction + +This document describes Work Graphs, a new feature for GPU-based work generation +built on new Shader Model 6.8 DXIL features. + +Full documentation of the Work Graphs feature including DirectX runtime, HLSL, +and DXIL documentation is available in the [Work Graphs Spec on +GitHub](https://microsoft.github.io/DirectX-Specs/d3d/WorkGraphs.html). This +document subsumes that document specifically as it relates to HLSL and the +compiler features. + +## Motivation + +Current language and runtime limitations make generating work on GPU threads +insufficient to meet the needs of some workloads. If existing GPU features (like +ExecuteIndirect) can't sufficiently generate work the application must generate +the work from the CPU resulting in unnecessary round-tripping between the GPU +and CPU. The Work Graphs feature solves this problem by enabling more robust +GPU-based work creation APIs. + +## Proposed solution + +### HLSL Additions + +The Work Graphs feature allows an application to specify a set of tasks as +_nodes_ in a graph representing a more complex workload. Each _node_ has a fixed +shader which takes one or more _input records_ as input and can produce one or +more _output records_ as output. + +### Launch Modes + +_Node_ shaders have one of three _launch modes_: + +* Thread +* Broadcasting +* Coalescing + +_Thread launch_ nodes represent an individual thread of work that processes a +single _input record_. Thread launch nodes have no visible thread group, do not +require the `numthreads` attribute, have no access to groupshared memory, and +cannot use group-scope memory or sync barriers. + +_Broadcasting launch_ nodes represent a grid of work operating on a single +_input record_. Each input record to a broadcasting node launches a full +dispatch grid. The size of the dispatch grid can either be fixed for the node or +specified in the input. + +_Coalescing launch_ nodes represent a thread group operating on a shared array +of _input records_. The shader declares the maximum number of records per thread +group. + +### Records + +_Input records_ represent inputs to node shaders, and _output records_ represent +outputs from node shaders. Records can be singular or arrays. _Input records_ +can be read-only or read-write, while _output records_ are read-write but +uninitialized when created. + +## Detailed design + +### Node Entry Functions + +Node shaders are represented as entry functions built into library targets. Node shaders have +similar capabilities and execution semantics to compute shaders. Shader entries +annotated as `[shader("node")]` are usable as work graph nodes. + +As with all previous shader types, a shader entry function may have only one +shader stage annotation. + +#### Function Attributes + +All node shaders except _thread launch_ nodes must specify the thread group size +using the `[numthreads(, , )]` attribute. + +Node shaders support existing compute shader function annotations: `numthreads` + and `wavesize`. +They also have new annotations unique to node shaders. + +> Consistent with HLSL grammar, attribute names are case-insensitive. Any +> attributes that take string arguments, the string argument values are +> case-sensitive. + +**`[NodeLaunch("")]`** + +Valid values for node launch mode are `broadcasting`, `coalescing`, or `thread`. +If the `NodeLaunch` attribute is not specified on a node entry the default +launch mode is `broadcasting`. + +**`[NodeIsProgramEntry]`** + +Indicates a node that may be invoked as an entry directly by the API. +Shaders that can receive input records from both inside and outside the work +graph (i.e. from a command list) require this attribute. Graphs can have more +than one entry shader that receives inputs from outside the graph. +This property is implied if no other work graph nodes target this node, +so in that case the attribute is optional and may be omitted. + +**`[NodeID("", = 0)]`** + +Use the provided name to represent this node instead of the function name. +If present, the index parameter specifies the index into the named node array. +If this attribute is not present, the function name is the node name, +and the index is `0`. + +**`[NodeLocalRootArgumentsTableIndex()]`** + +The specified index indicates the record index into the local root arguments +table bound to the work graph. If this attribute is not specified and the shader +has a local root signature, the index defaults to an unallocated table location. + +**`[NodeShareInputOf("", = 0 )]`** + +Share the inputs for the node specified by name and optional index. +If present, the index parameter specifies the index into the named node array. +Both nodes must have identical input records, the same launch mode, and +identical dispatch grid size (if fixed). + +**`[NodeDispatchGrid(, , )]`** + +Specifies the size of the dispatch grid. Broadcast launch nodes must specify +either the dispatch grid size or maximum dispatch grid size in source. The `x` +`y` and `z` parameters individually cannot exceed (2^16)-1 (65535), and `x*y*z` +cannot exceed (2^24)-1 (16,777,215). + +**`[NodeMaxDispatchGrid(, , )]`** + +Specifies the maximum dispatch grid size when the dispatch grid size is +specified on the input record using the [`SV_DispatchGrid`](#sv_dispatchgrid) semantic. The `x` `y` +and `z` parameters individually cannot exceed (2^16)-1 (65535), and `x*y*z` cannot +exceed (2^24)-1 (16,777,215). + +**`[NodeMaxRecursionDepth()]`** + +Specifies the maximum recursion depth for a node. This attribute is required if +one of the outputs of a node is the ID of the node itself. + +#### Entry Parameters + +Specific types for input parameters depend on the node launch mode, but all node +shaders support two categories of parameters: + +* Zero or one [Node Input Parameter](#node-input-objects). +* Zero or more [Node Output Parameter(s)](#node-output-objects). + +##### Node Input Objects + +Node input objects come in three categories, one for each launch mode, with +read-only and read-write variants: + +* `{RW}ThreadNodeInputRecord` - for thread launch nodes. +* `{RW}DispatchNodeInputRecord` - for broadcasting launch nodes. +* `{RW}GroupNodeInputRecords`- for coalescing launch nodes. + +The pseudo-HLSL code below describes the basic interface of the +`{RW}{Thread|Dispatch|Group}NodeInputRecord{s}` classes: + +```c++ +namespace detail { +///@brief Common interfaces for Thread and Dispatch node input record types. +template class NodeInputRecordInterface { + /// @brief Get a copy of the underlying record. + RecordTy Get() const; + + /// @brief Get a writable reference to the underlying record. + /// + /// Only available for RW object variants. The non-const Get() returns a + /// reference to the underlying data. + std::enable_if_t::type &Get(); +}; + +/// @brief Interface for GroupNodeInputRecords and RWGroupNodeInputRecords +template class GroupNodeInputRecordsBase { + /// @brief Returns the number of records that have been coalesced into the + /// current thread group + + /// Returned value is in the range [1..._MaxCount_], where _MaxCount_ is + /// specified by the `[MaxRecords(_MaxCount_)]` attribute applied to the + /// parameter declaration. + uint Count() const; + + /// @brief Get a copy of the underlying record at the specified index. + RecordTy Get(uint Index) const; + + /// @brief Get a writable reference to the underlying record. + /// @param Index The record index to access. + /// + /// Only available for RW object variants. The non-const Get() returns a + /// reference to the underlying data. + std::enable_if_t::type &Get(uint Index); + + /// @brief Get a copy of the underlying record at the specified index. + /// @param Index The record index to access. + RecordTy operator[](uint Index) const; + + /// @brief Get a writable reference to the underlying record. + /// @param Index The record index to access. + /// + /// Only available for non-RW object variants. The non-const operator[] + /// returns a reference to the underlying data. + std::enable_if_t::type &operator[](uint Index); +}; + +} // namespace detail + +template +using ThreadNodeInputRecord = detail::NodeInputRecordInterface; + +template +using RWThreadNodeInputRecord = + detail::NodeInputRecordInterface; + +template +using DispatchNodeInputRecord = + detail::NodeInputRecordInterface; + +template +class RWDispatchNodeInputRecord + : public detail::NodeInputRecordInterface { + /// @brief Allows thread groups to coordinate reading and writing to a shared + /// set of records. + /// @returns Returns `false` for thread groups that are not the last to finish + /// and `true` for the last thread group to call this method. + /// + /// This method must be called by all threads in a dispatch or not called at + /// all. The call must be in dispatch grid uniform control flow. The callsite + /// must be uniform across all threads in all thread groups in the dispatch + /// grid. This method may be called at most once per thread. + /// + /// Any violation of these requirements is undefined behavior. + /// + /// This method returns `false` for thread groups that are not the last to + /// finish and these thread groups are not allowed to read or write to the + /// input. Reading or writing the input after this call on a thread that + /// returned `false` is undefined behavior. + /// + /// This method returns `true` for the last thread group to finish. That + /// thread group can continue reading and writing to the input. + bool FinishedCrossGroupSharing(); +}; + +template +using GroupNodeInputRecords = detail::GroupNodeInputRecordsBase; + +template +using RWGroupNodeInputRecords = detail::GroupNodeInputRecordsBase; +``` + +Coalescing launch nodes also accept the `EmptyNodeInput` input object for cases +without record data. The pseudo-HLSL interface for `EmptyNodeInput` is: + +```c++ +class EmptyNodeInput { + /// @brief Returns the number of records that have been coalesced into the + /// current thread group. + /// + /// Returns 1..._MaxCount_, where _MaxCount_ is specified by the + /// `[MaxRecords(_MaxCount_)]` attribute applied to the parameter declaration. + uint Count() const; +}; +``` + +##### System Value Parameters + +Broadcasting and Coalescing Launch shaders support a subset of compute shader +system value inputs. +These have the same types, meanings, and usages as they do for compute shaders. + +|system value semantic | supported launch modes | description | +|------------------------------|:---------:|-------------| +| `SV_GroupThreadID` | Broadcasting, Coalescing | Thread ID within group | +| `SV_GroupIndex` | Broadcasting, Coalescing | Flattened thread index within group | +| `SV_GroupID` | Broadcasting | Group ID within dispatch | +| `SV_DispatchThreadID` | Broadcasting | Thread ID within dispatch | + +##### Node Output Objects + +Node output objects either allocate records or increment a counter for empty +records. The following pseudo-HLSL defines the interfaces for the `NodeOutput` + and `EmptyNodeOutput` objects: + +```c++ +template class NodeOutput { + /// @brief Allocate a new ThreadNodeOutputRecords for this thread. + /// @returns A handle that collects a set of thread output records. + /// @param NumRecords The number of records to return for the calling thread. + /// + /// ThreadNodeOutputRecords are per-thread output records. Each thread can + /// produce a different number of outputs which are each unique per thread. + /// + /// Must be called in thread group uniform control flow. The value of + /// `NumRecords` is not required to be uniform. If the value of `NumRecords` + /// is `0`, the object returned is zero-sized and cannot be indexed on that + /// thread. + ThreadNodeOutputRecords GetThreadNodeOutputRecords(uint NumRecords); + + /// @brief Allocate GroupNodeOutputRecords for this thread group. + /// @returns A handle that collects a set of group node output records. + /// @param NumRecords The number of records to return. + /// + /// GroupNodeOutputRecords are per-group output records. The output record set + /// is shared across the thread group and the threads work cooperatively to + /// produce the output records. + /// + /// Must be called in thread group uniform control flow. The value of + /// `NumRecords` and `this` must be uniform across the thread group. If the + /// value of `NumRecords` is `0`, the object returned is zero-sized and cannot + /// be indexed. + /// + /// This method may not be called from _thread launch_ shaders since they do + /// not have a thread group. + GroupNodeOutputRecords GetGroupNodeOutputRecords(uint NumRecords); + + /// @brief Returns true if the specified output node is in the work graph. + bool IsValid() const; +}; + +class EmptyNodeOutput { + /// @brief Adds `Count` empty output records to the node output, where this + /// `Count` is specified per-thread. The total number added is the sum of + /// `Count` values for each thread in the group. + /// + /// Must be called in thread group uniform control flow. + void ThreadIncrementOutputCount(uint Count); + + /// @brief Adds `Count` empty output records to the node output, once for the + /// group, instead of summing the value across threads. + /// + /// Must be called in thread group uniform control flow. The value of + /// `Count` and `this` must be uniform across the thread group. + void GroupIncrementOutputCount(uint Count); + + /// @brief Identifies if the output node is valid to write to. + /// @returns True if the specified output node is in the work graph, or for + /// recursive nodes if the maximum recursion limit has not been reached. + bool IsValid() const; +}; +``` + +Array variations of the node output objects also exist exposing subscript +operators to index the individual output. The following pseudo-HLSL defines the +interfaces for the `NodeOutputArray` and `EmptyNodeOutputArray` objects: + +```c++ +namespace detail { +template class NodeOutputArrayBase { + /// @brief Returns the node output for the specified index. + /// @param Index The record index to access. + NodeOutputTy &operator[](uint Index); +}; +} // namespace detail + +template +using NodeOutputArray = detail::NodeOutputArrayBase>; + +using EmptyNodeOutputArray = detail::NodeOutputArrayBase; +``` + +Each node output can contain zero or more thread or group node output records +which feed into other nodes for processing. + +The following pseudo-HLSL defines the interfaces for the +`ThreadNodeOutputRecords` and `GroupNodeOutputRecords` objects: + +```c++ +namespace detail { +template class NodeOutputRecordsBase { + /// @brief Get a copy of the underlying record. + RecordTy &Get(uint Index); + + /// @brief Mark the output node as completed. + /// + /// Each thread producing an output must call `OutputComplete` at least once. + /// Calling `OutputComplete()` signals to the runtime that the node output + /// memory is finalized. The behavior of writes to the output after this call + /// is undefined. + /// + /// Calls to `OutputComplete` must be in thread group uniform control flow + /// otherwise the behavior is undefined. + void OutputComplete(); +}; +} // namespace detail + +template +using ThreadNodeOutputRecords = detail::NodeOutputRecordsBase; + +template +using GroupNodeOutputRecords = detail::NodeOutputRecordsBase; +``` + +##### Entry Parameter Attributes + +> Consistent with HLSL grammar attribute names are case-insensitive. Any +> attributes that take string arguments, the string argument values are +> case-sensitive. + +**`[MaxRecords()]`** + +Applies to node inputs in coalescing launch nodes or outputs for any launch mode. + +Required for node inputs for coalescing launch nodes, this attribute restricts +the maximum number of records per thread group. Implementations are not required +to fill to the specified maximum. + +When applied to node outputs, this attribute restricts the maximum number of +records produced to the output. When applied to a `NodeOutputArray`, the maximum +applies as the sum of all records across the output array, not per-node. + +Node outputs require either the `MaxRecords` or `MaxRecordsSharedWith` +attribute. + +**`[MaxRecordsSharedWith()]`** + +This attribute applies to node outputs. The named parameter must have the +`MaxRecords` attribute. This attribute and the `MaxRecords` attribute are +mutually exclusive. + +The node output that this attribute is applied to shares a maximum record +allocation with the named node output parameter. + +Node outputs require either the `MaxRecords` or `MaxRecordsSharedWith` +attribute. + +**`[NodeID("", = 0)]`** + +This attribute applies to output nodes and defines the name and index of the +output node. If not provided, the default index is 0. + +If this attribute is not present on an output, the default node ID for the +output is the name of the parameter, and the index is the default index (0). + +**`[AllowSparseNodes]`** + +This attribute applies to outputs and allows the work graph to be created even +if there is not a node defined for the specified output. If the output is an +array, each element may or may not have a downstream node defined in the graph. +`IsValid()` can be used to determine whether an output node is defined in the +graph. + +**`[NodeArraySize()]`** + +Specifies the output array size for `NodeOutputArray` or `EmptyNodeOutputArray` +objects. + +### New Built-in Functions + +#### GetRemainingRecursionLevels + +```c++ +/// @brief Returns the number of recursion levels remaining against the declared +/// `NodeMaxRecursionDepth`. +/// +/// Returns 0 for leaf nodes and if the current node is not recursive. +uint GetRemainingRecursionLevels(); +``` + +For nodes that recurse, the `GetRemainingRecursionLevels()` function returns the +number of remaining recursion levels before reaching the node's maximum +recursion depth. + +#### Barrier + +```c++ +enum MEMORY_TYPE_FLAG { + UAV_MEMORY = 0x00000001, + GROUP_SHARED_MEMORY = 0x00000002, + NODE_INPUT_MEMORY = 0x00000004, + NODE_OUTPUT_MEMORY = 0x00000008, + ALL_MEMORY = 0x0000000f, +}; + +enum SEMANTIC_FLAG { + GROUP_SYNC = 0x00000001, + GROUP_SCOPE = 0x00000002, + DEVICE_SCOPE = 0x00000004, +}; + +/// @brief Request a barrier for a set of memory types and/or thread group +/// execution sync. +/// @param MemoryTypeFlags Flag bits as defined by MEMORY_TYPE_FLAG. +/// @param SemanticFlags Flag bits as defined by SEMANTIC_FLAG. +/// +/// `Barrier` must be called from thread group uniform control flow when +/// `SemanticFlags` includes `GROUP_SYNC`. +void Barrier(uint MemoryTypeFlags, uint SemanticFlags); + +/// @brief Request a barrier for just the memory used by an object. +/// @param TargetObject The object or resource which owns the memory to apply +/// the barrier to. +/// @param SemanticFlags Flag bits as defined by SEMANTIC_FLAG. +/// +/// The TargetObject parameter can be a particular node input/output record +/// object or UAV resource. Groupshared variables are not currently supported. +/// +/// `Barrier` must be called from thread group uniform control flow when +/// `SemanticFlags` includes `GROUP_SYNC`. +void Barrier(Object TargetObject, uint SemanticFlags); +``` + +The Work Graphs feature introduces a new more flexible implementation of the memory barrier +functions. This function is available in all shader types (including non-node +shaders). + +The new `Barrier` function implements a superset of the existing memory barrier +functions which are still supported (i.e. `AllMemoryBarrier{WithGroupSync}()`, +`GroupMemoryBarrier{WithGroupSync}()`, `DeviceMemoryBarrier{WithGroupSync}()`). + +In the context of a node shader, `Barrier` enables requesting a memory barrier on +input and/or output record memory specifically, while the implementation is free +to store the data in any memory region. + +The pseudo-code below shows implementing the existing HLSL memory barrier +functions using the new `Barrier` function. + +```C++ +void AllMemoryBarrier() { Barrier(ALL_MEMORY, DEVICE_SCOPE); } + +void AllMemoryBarrierWithGroupSync() { + Barrier(ALL_MEMORY, DEVICE_SCOPE | GROUP_SYNC); +} + +void DeviceMemoryBarrier() { + Barrier(UAV_MEMORY, DEVICE_SCOPE); +} + +void DeviceMemoryBarrierWithGroupSync() { + Barrier(UAV_MEMORY, DEVICE_SCOPE | GROUP_SYNC); +} + +void GroupMemoryBarrier() { Barrier(GROUP_SHARED_MEMORY, GROUP_SCOPE); } + +void GroupMemoryBarrierWithGroupSync() { + Barrier(GROUP_SHARED_MEMORY, GROUP_SCOPE | GROUP_SYNC); +} + +``` + +### New Structure Attributes + +> Consistent with HLSL grammar attribute names are case-insensitive. Any +> attributes that take string arguments, the string argument values are +> case-sensitive. + +#### **`[NodeTrackRWInputSharing]`** + +If a `RWDispatchNodeInputRecord` is used for cross-group sharing and calls +`FinishedCrossGroupSharing`, the struct type `T` must have the +`[NodeTrackRWInputSharing]` attribute applied to it. This allocates memory in +the record allocation to track thread completion. + +### New Structure System Values + +#### SV_DispatchGrid + +`uint/uint2/uint3/uint16_t/uint16_t2/uint16_t3 SV_DispatchGrid` + +`SV_DispatchGrid` can optionally appear anywhere in a record. + +If the record arrives at a [broadcasting launch node](#launch-modes) that doesn't declare a fixed dispatch grid size via `[NodeDispatchGrid(x,y,z)]`, `SV_DispatchGrid` becomes the dynamic grid size used to launch at the node. The value has no special significance in other contexts. + +## Acknowledgments + +This spec is an extensive collaboration between the Microsoft HLSL and Direct3D +teams and IHV partners. + +Special thanks to Claire Andrews, Amar Patel, and Tex Riddell + +