Skip to content

Commit

Permalink
Correct grammar and spelling mistakes in Readme.
Browse files Browse the repository at this point in the history
(thanks ChatGPT!)
  • Loading branch information
KarelPeeters committed Jan 29, 2024
1 parent 2ac867f commit 23206a3
Showing 1 changed file with 60 additions and 39 deletions.
99 changes: 60 additions & 39 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@

## Overview

A neural network inference library, written in/for Rust. It can run ONNX files either on the CPU or on Nvidia GPUs using cuda/cudnn/cublas
A neural network inference library, written in/for Rust. It can run ONNX files either on the CPU or on Nvidia GPUs using
cuda/cudnn/cublas.

It is general enough to run all kinds of networks, it has been tested with:
* Simple fully connected networks
Expand Down Expand Up @@ -77,17 +78,19 @@ let outputs: Vec<DTensor> = prepared.eval( & inputs);

## System requirements

To use the cuda crates, the appropriate libraries need to be installed on this system, they are not downloaded automatically:
To use the CUDA crates, the appropriate libraries need to be installed on this system; they are not downloaded
automatically:

* Cuda (includes Cuda, cuBLAS, NVRTC): [installer](https://developer.nvidia.com/cuda-downloads), follow the
* CUDA (includes CUDA, cuBLAS, NVRTC): [installer](https://developer.nvidia.com/cuda-downloads), follow the
instructions.
Ensure that the environment variable `CUDA_PATH` points to the root directory of the install. (ie. `CUDA_PATH/bin/`
should exists)
* Cudnn: [archive file](https://developer.nvidia.com/cudnn), to be extract to a location of your choosing.
If you chose the same location as `CUDA_PATH` you don't need to do anything else. Otherwise, set the environment
variable `CUDNN_PATH` to the root directory of the cuDNN installation. (ie. `CUDNN_PATH/bin` should exists)
Ensure that the environment variable `CUDA_PATH` points to the root directory of the install (i.e., `CUDA_PATH/bin/`
should exist).
* cuDNN: [archive file](https://developer.nvidia.com/cudnn), to be extracted to a location of your choosing.
If you choose the same location as `CUDA_PATH`, you don't need to do anything else. Otherwise, set the environment
variable `CUDNN_PATH` to the root directory of the cuDNN installation (i.e., `CUDNN_PATH/bin` should exist).

The project has been tested with Cuda `v12.2` and cuDNN version `v8.9.5`. Newer versions might work, but this is not guaranteed since cuda sometimes changes the name of or removed certain functions.
The project has been tested with CUDA `v12.2` and cuDNN version `v8.9.5`. Newer versions might work, but this is not
guaranteed since CUDA sometimes changes the name of or removes certain functions.

## Internals

Expand All @@ -105,24 +108,35 @@ The typical pipeline is shown in the first figure below. The second figure shows

Central is the _Graph IR_, the intermediate representation for neural network graphs.

The structure is an [SSA](https://en.wikipedia.org/wiki/Static_single-assignment_form)-style directed acyclic graph, where nodes are values with a shape, data type and the operation that computes it. These values are abstract, they don't have strides or memory locations yet.
The structure is an [SSA](https://en.wikipedia.org/wiki/Static_single-assignment_form)-style directed acyclic graph,
where nodes are values with a shape, data type, and the operation that computes it. These values are abstract; they
don't have strides or memory locations yet.

The operations are similar to those of other frameworks, but are kept as orthogonal as possible. Some example operations: convolution, matmul, reshape, broadcast, slice, unary, binary, reduce, softmax, ... See [the docs](https://docs.rs/kn-graph/latest/kn_graph/graph/enum.Operation.html) for the full list of graph operations.
The operations are similar to those of other frameworks but are kept as orthogonal as possible. Some example operations:
convolution, matmul, reshape, broadcast, slice, unary, binary, reduce, softmax, ...
See [the docs](https://docs.rs/kn-graph/latest/kn_graph/graph/enum.Operation.html) for the full list of graph
operations.

The graph can be constructed directly in code using the [graph builder api](https://docs.rs/kn-graph/0.2.1/kn_graph/graph/struct.Graph.html), but for convenience an ONNX loader exists. It can read ONNX files and convert the supported subset of operations into those supported by the IR.
The graph can be constructed directly in code using
the [graph builder API](https://docs.rs/kn-graph/0.2.1/kn_graph/graph/struct.Graph.html), but for convenience, an ONNX
loader exists. It can read ONNX files and convert the supported subset of operations into those supported by the IR.

Because the graph IR is much more orthogonal than the ONNX specification, many ONNX operations are decomposed into separate steps, some examples:

* ONNX binary operations implicitly broadcast their operands, but this step is a separate operation in the IR.
* ONNX convolution and matmul have a built-in optional bias operand, this also becomes separate broadcast plus binary addition operation.
* ONNX convolution and matmul have a built-in optional bias operand; this also becomes a separate broadcast plus binary
addition operation.

To figure out if a ONNX operation is supported, check the branches of the top-level match statement in the `visit_node` function in [`load.rs`](https://github.com/KarelPeeters/Kyanite/blob/master/kn-graph/src/onnx/load.rs). Many common operations are already implemented, and adding more operations shouldn't be too hard.
To figure out if an ONNX operation is supported, check the branches of the top-level match statement in the `visit_node`
function in [`load.rs`](https://github.com/KarelPeeters/Kyanite/blob/master/kn-graph/src/onnx/load.rs). Many common
operations are already implemented, and adding more operations shouldn't be too hard.

For a larger example of a typical graph, see [stable_diffusion_piece.svg](./docs/stable_diffusion_piece.svg), a small section takes from the start start of the stable diffusion model.
For a larger example of a typical graph, see [stable_diffusion_piece.svg](./docs/stable_diffusion_piece.svg), a small
section taken from the start of the stable diffusion model.

### Optimizer

The graph can optionally be optimized by the _optimizer_. Since the graph is append-only, a new graph is returned.
The graph can optionally be optimized by the _optimizer_. Since the graph is append-only, a new graph is returned.

The optimizations that are currently implemented are:

Expand All @@ -134,38 +148,43 @@ The optimizations that are currently implemented are:

### CPU executor

Finally the graph needs to be executed. There is a simple _CPU executor_ that just directly runs each operation. No major optimizations are attempted here, except for using BLAS routines for matmuls and im2col for convolutions. It's important that this executor is as simple as possible because it serves as the baseline for unit tests that check the correctness of the GPU executor.
Finally, the graph needs to be executed. There is a simple _CPU executor_ that just directly runs each operation. No
major optimizations are attempted here, except for using BLAS routines for matmuls and im2col for convolutions. It's
important that this executor is as simple as possible because it serves as the baseline for unit tests that check the
correctness of the GPU executor.

### Cuda Executor

The second (and more useful) way to run these graphs is with the _Cuda executor_. This involves running the graph though the _Cuda Planner_, which outputs a predetermined schedule of Cuda operations, and allocates the necessary memory buffers. This is split out as a separate step so this expensive planning step only needs to be carried out once per network architecture, the resulting plan can then be reused many times in the executor.
The second (and more useful) way to run these graphs is with the _Cuda executor_. This involves running the graph
through the _Cuda Planner_, which outputs a predetermined schedule of Cuda operations and allocates the necessary memory
buffers. This is split out as a separate step so this expensive planning step only needs to be carried out once per
network architecture; the resulting plan can then be reused many times in the executor.

The planner has the following major responsibilities:

* Determine the memory layout of tensors: the strides and the memory offsets

* This implicitly handles most reshape, broadcast, stride, ... operations.
* Buffers are also reused if possible, minimizing total memory usage. There is much room for improvement here, currently this is just a single pass algorithm.
* This implicitly handles most reshape, broadcast, stride, ... operations.
* Buffers are also reused if possible, minimizing total memory usage. There is much room for improvement here;
currently, this is just a single pass algorithm.

[//]: # (TODO show memory usage graphs?)

* Decide which cudnn/cublas operations to run for convolutions and matmuls If possible, operations are fused together. Some examples:
* cudnn supports a single "convolution + residual + bias + relu" operation
* cublas matmuls can include a transpose of either input matrix, and equivalently the output by swapping the inputs.
* cudnn and cublas operations sometimes include a "scalar" argument that is multiplied by some of the operands
* Decide which cuDNN/cuBLAS operations to run for convolutions and matmuls. If possible, operations are fused together.
Some examples:
* cuDNN supports a single "convolution + residual + bias + relu" operation
* cuBLAS matmuls can include a transpose of either input matrix, and equivalently the output by swapping the inputs.
* cuDNN and cuBLAS operations sometimes include a "scalar" argument that is multiplied by some of the operands

[//]: # (TODO more fusing examples: cublas + scale + transpose, are there others?)
[//]: # (TODO more fusing examples: cuBLAS + scale + transpose, are there others?)

* Compile custom kernels for the remaining scalar and compound operations using an _autokernel_ framework based on [NVRTC (Runtime Compilation)](https://docs.nvidia.com/cuda/nvrtc/index.html).

* The operations handled by *autokernel* are: scalar operations, reduce, softmax, layernorm, gather.
* Handwritten kernel templates are used, with details such as tensor shapes, strides, scalar operations, ... substituted in before compilation at runtime.
* More operator fusion happens here
* Multiple scalar operations get compiled to a single kernel
* Constant scalars are inlined
* Some compound kernels support fusing input or output scalar operations


This final operator fusion can be significant and save a lot of redundant transfers to and from main memory. The same performance could be achieved by manually writing kernels for each used combination of operations, but the combinatorial explosion and associated maintenance would be huge.

An example generated scalar kernel with some handwritten clarifying comments is shown below:
Expand Down Expand Up @@ -253,32 +272,34 @@ __global__ void scalar_kernel(

## Comparison to other crates

See [Are We Learning Yet?](https://www.arewelearningyet.com/neural-networks/) for a full list of potential alternatives.
See [Are We Learning Yet?](https://www.arewelearningyet.com/neural-networks/) for a full list of potential alternatives.

### Rust wrappers around existing runtimes

* Pytorch wrapper: [tch](https://crates.io/crates/tch)
* Tensorflow wrapper: [tensorflow](https://crates.io/crates/tensorflow)
* OnnxRuntime wrapper: [ort](https://github.com/pykeio/ort)
* PyTorch wrapper: [tch](https://crates.io/crates/tch)
* TensorFlow wrapper: [tensorflow](https://crates.io/crates/tensorflow)
* ONNXRuntime wrapper: [ort](https://github.com/pykeio/ort)

Positives:
* extensive support for many neural network operations
* support for many different backends (CPU, GPU (Nvidia + Amd), TPU, ...)
* support for many different backends (CPU, GPU (Nvidia + AMD), TPU, ...)

Negatives
* not always great support for loading onnx files (ort is great at this though, as the name suggests)
* not always great support for loading ONNX files (ort is great at this though, as the name suggests)
* large and somewhat black-box external dependency
* less operator fusion in many cases, although this is expected to improve in the future

Performance should be about the same as Kyanite for cases where operator fusion does not matter much, all libraries mostly use the same underlying cudnn and cublas kernels.
Performance should be about the same as Kyanite for cases where operator fusion does not matter much; all libraries
mostly use the same underlying cuDNN and cuBLAS kernels.

### From-scratch Rust projects

* [tract](https://github.com/sonos/tract): larger coverage of the ONNX specification, but only does CPU inference

* [tract](https://github.com/sonos/tract): larger coverage of the ONNX specification but only does CPU inference

## Development

While developing this crate, to update the onnx proto, the [prost-build crate](https://crates.io/crates/prost-build) is used. This requires that `protoc` is installed and that the `PROTOC` environment variable is set to point to the executable. See their installation instructions (or the error message the build script shows if any) for more details.
While developing this crate, to update the ONNX proto, the [prost-build crate](https://crates.io/crates/prost-build) is
used. This requires that `protoc` is installed and that the `PROTOC` environment variable is set to point to the
executable. See their installation instructions (or the error message the build script shows if any) for more details.

To actually update the proto definition, replace `kn-graph/proto/onnx.proto3` with the newer version and run `cargo run --bin proto-to-rust`. Then commit both the `onnx.proto3` file and the generated `onnx.rs` file.
To actually update the proto definition, replace `kn-graph/proto/onnx.proto3` with the newer version and run `cargo run --bin proto-to-rust`. Then commit both the `onnx.proto3` file and the generated `onnx.rs` file.

0 comments on commit 23206a3

Please sign in to comment.