Correct grammar and spelling mistakes in Readme.

(thanks ChatGPT!)
KarelPeeters · Jan 29, 2024 · 23206a3 · 23206a3
1 parent 2ac867f
commit 23206a3
Showing 1 changed file with 60 additions and 39 deletions.
diff --git a/Readme.md b/Readme.md
@@ -27,7 +27,8 @@
 
 ## Overview
 
-A neural network inference library, written in/for Rust. It can run ONNX files either on the CPU or on Nvidia GPUs using cuda/cudnn/cublas
+A neural network inference library, written in/for Rust. It can run ONNX files either on the CPU or on Nvidia GPUs using
+cuda/cudnn/cublas.
 
 It is general enough to run all kinds of networks, it has been tested with:
 * Simple fully connected networks
@@ -77,17 +78,19 @@ let outputs: Vec<DTensor> = prepared.eval( & inputs);
 
 ## System requirements
 
-To use the cuda crates, the appropriate libraries need to be installed on this system, they are not downloaded automatically:
+To use the CUDA crates, the appropriate libraries need to be installed on this system; they are not downloaded
+automatically:
 
-* Cuda (includes Cuda, cuBLAS, NVRTC): [installer](https://developer.nvidia.com/cuda-downloads), follow the
+* CUDA (includes CUDA, cuBLAS, NVRTC): [installer](https://developer.nvidia.com/cuda-downloads), follow the
   instructions.
-  Ensure that the environment variable `CUDA_PATH` points to the root directory of the install. (ie. `CUDA_PATH/bin/`
-  should exists)
-* Cudnn: [archive file](https://developer.nvidia.com/cudnn), to be extract to a location of your choosing.
-  If you chose the same location as `CUDA_PATH` you don't need to do anything else. Otherwise, set the environment
-  variable `CUDNN_PATH` to the root directory of the cuDNN installation. (ie. `CUDNN_PATH/bin` should exists)
+  Ensure that the environment variable `CUDA_PATH` points to the root directory of the install (i.e., `CUDA_PATH/bin/`
+  should exist).
+* cuDNN: [archive file](https://developer.nvidia.com/cudnn), to be extracted to a location of your choosing.
+  If you choose the same location as `CUDA_PATH`, you don't need to do anything else. Otherwise, set the environment
+  variable `CUDNN_PATH` to the root directory of the cuDNN installation (i.e., `CUDNN_PATH/bin` should exist).
 
-The project has been tested with Cuda `v12.2` and cuDNN version `v8.9.5`. Newer versions might work, but this is not guaranteed since cuda sometimes changes the name of or removed certain functions. 
+The project has been tested with CUDA `v12.2` and cuDNN version `v8.9.5`. Newer versions might work, but this is not
+guaranteed since CUDA sometimes changes the name of or removes certain functions.
 
 ## Internals
 
@@ -105,24 +108,35 @@ The typical pipeline is shown in the first figure below. The second figure shows
 
 Central is the _Graph IR_, the intermediate representation for neural network graphs.
 
-The structure is an [SSA](https://en.wikipedia.org/wiki/Static_single-assignment_form)-style directed acyclic graph, where nodes are values with a shape, data type and the operation that computes it. These values are abstract, they don't have strides or memory locations yet. 
+The structure is an [SSA](https://en.wikipedia.org/wiki/Static_single-assignment_form)-style directed acyclic graph,
+where nodes are values with a shape, data type, and the operation that computes it. These values are abstract; they
+don't have strides or memory locations yet.
 
-The operations are similar to those of other frameworks, but are kept as orthogonal as possible. Some example operations: convolution, matmul, reshape, broadcast, slice, unary, binary, reduce, softmax, ... See [the docs](https://docs.rs/kn-graph/latest/kn_graph/graph/enum.Operation.html) for the full list of graph operations.
+The operations are similar to those of other frameworks but are kept as orthogonal as possible. Some example operations:
+convolution, matmul, reshape, broadcast, slice, unary, binary, reduce, softmax, ...
+See [the docs](https://docs.rs/kn-graph/latest/kn_graph/graph/enum.Operation.html) for the full list of graph
+operations.
 
-The graph can be constructed directly in code using the [graph builder api](https://docs.rs/kn-graph/0.2.1/kn_graph/graph/struct.Graph.html), but for convenience an ONNX loader exists. It can read ONNX files and convert the supported subset of operations into those supported by the IR.
+The graph can be constructed directly in code using
+the [graph builder API](https://docs.rs/kn-graph/0.2.1/kn_graph/graph/struct.Graph.html), but for convenience, an ONNX
+loader exists. It can read ONNX files and convert the supported subset of operations into those supported by the IR.
 
 Because the graph IR is much more orthogonal than the ONNX specification, many ONNX operations are decomposed into separate steps, some examples:
 
 * ONNX binary operations implicitly broadcast their operands, but this step is a separate operation in the IR.
-* ONNX convolution and matmul have a built-in optional bias operand, this also becomes separate broadcast plus binary addition operation.
+* ONNX convolution and matmul have a built-in optional bias operand; this also becomes a separate broadcast plus binary
+  addition operation.
 
-To figure out if a ONNX operation is supported, check the branches of the top-level match statement in the `visit_node` function in [`load.rs`](https://github.com/KarelPeeters/Kyanite/blob/master/kn-graph/src/onnx/load.rs). Many common operations are already implemented, and adding more operations shouldn't be too hard.
+To figure out if an ONNX operation is supported, check the branches of the top-level match statement in the `visit_node`
+function in [`load.rs`](https://github.com/KarelPeeters/Kyanite/blob/master/kn-graph/src/onnx/load.rs). Many common
+operations are already implemented, and adding more operations shouldn't be too hard.
 
-For a larger example of a typical graph, see [stable_diffusion_piece.svg](./docs/stable_diffusion_piece.svg), a small section takes from the start start of the stable diffusion model.
+For a larger example of a typical graph, see [stable_diffusion_piece.svg](./docs/stable_diffusion_piece.svg), a small
+section taken from the start of the stable diffusion model.
 
 ### Optimizer
 
-The graph can optionally be optimized by the _optimizer_. Since the graph is append-only, a new graph is returned. 
+The graph can optionally be optimized by the _optimizer_. Since the graph is append-only, a new graph is returned.
 
 The optimizations that are currently implemented are:
 
@@ -134,38 +148,43 @@ The optimizations that are currently implemented are:
 
 ### CPU executor
 
-Finally the graph needs to be executed. There is a simple _CPU executor_ that just directly runs each operation. No major optimizations are attempted here, except for using BLAS routines for matmuls and im2col for convolutions. It's important that this executor is as simple as possible because it serves as the baseline for unit tests that check the correctness of the GPU executor.
+Finally, the graph needs to be executed. There is a simple _CPU executor_ that just directly runs each operation. No
+major optimizations are attempted here, except for using BLAS routines for matmuls and im2col for convolutions. It's
+important that this executor is as simple as possible because it serves as the baseline for unit tests that check the
+correctness of the GPU executor.
 
 ### Cuda Executor
 
-The second (and more useful) way to run these graphs is with the _Cuda executor_. This involves running the graph though the _Cuda Planner_, which outputs a predetermined schedule of Cuda operations, and allocates the necessary memory buffers. This is split out as a separate step so this expensive planning step only needs to be carried out once per network architecture, the resulting plan can then be reused many times in the executor.
+The second (and more useful) way to run these graphs is with the _Cuda executor_. This involves running the graph
+through the _Cuda Planner_, which outputs a predetermined schedule of Cuda operations and allocates the necessary memory
+buffers. This is split out as a separate step so this expensive planning step only needs to be carried out once per
+network architecture; the resulting plan can then be reused many times in the executor.
 
 The planner has the following major responsibilities:
 
 * Determine the memory layout of tensors: the strides and the memory offsets
-
-  * This implicitly handles most reshape, broadcast, stride, ... operations.
-  * Buffers are also reused if possible, minimizing total memory usage. There is much room for improvement here, currently this is just a single pass algorithm. 
+    * This implicitly handles most reshape, broadcast, stride, ... operations.
+    * Buffers are also reused if possible, minimizing total memory usage. There is much room for improvement here;
+      currently, this is just a single pass algorithm.
 
 [//]: # (TODO show memory usage graphs?)
 
-* Decide which cudnn/cublas operations to run for convolutions and matmuls If possible, operations are fused together. Some examples:
-    * cudnn supports a single "convolution + residual + bias + relu" operation
-    * cublas matmuls can include a transpose of either input matrix, and equivalently the output by swapping the inputs.
-    * cudnn and cublas operations sometimes include a "scalar" argument that is multiplied by some of the operands
+* Decide which cuDNN/cuBLAS operations to run for convolutions and matmuls. If possible, operations are fused together.
+  Some examples:
+    * cuDNN supports a single "convolution + residual + bias + relu" operation
+    * cuBLAS matmuls can include a transpose of either input matrix, and equivalently the output by swapping the inputs.
+    * cuDNN and cuBLAS operations sometimes include a "scalar" argument that is multiplied by some of the operands
 
-[//]: # (TODO more fusing examples: cublas + scale + transpose, are there others?)
+[//]: # (TODO more fusing examples: cuBLAS + scale + transpose, are there others?)
 
 * Compile custom kernels for the remaining scalar and compound operations using an _autokernel_ framework based on [NVRTC (Runtime Compilation)](https://docs.nvidia.com/cuda/nvrtc/index.html).
-
   * The operations handled by *autokernel* are: scalar operations, reduce, softmax, layernorm, gather.
   * Handwritten kernel templates are used, with details such as tensor shapes, strides, scalar operations, ... substituted in before compilation at runtime.
   * More operator fusion happens here
     * Multiple scalar operations get compiled to a single kernel
     * Constant scalars are inlined
     * Some compound kernels support fusing input or output scalar operations
 
-
 This final operator fusion can be significant and save a lot of redundant transfers to and from main memory. The same performance could be achieved by manually writing kernels for each used combination of operations, but the combinatorial explosion and associated maintenance would be huge.
 
 An example generated scalar kernel with some handwritten clarifying comments is shown below:
@@ -253,32 +272,34 @@ __global__ void scalar_kernel(
 
 ## Comparison to other crates
 
-See [Are We Learning Yet?](https://www.arewelearningyet.com/neural-networks/) for a full list of potential alternatives. 
+See [Are We Learning Yet?](https://www.arewelearningyet.com/neural-networks/) for a full list of potential alternatives.
 
 ### Rust wrappers around existing runtimes
 
-* Pytorch wrapper: [tch](https://crates.io/crates/tch)
-* Tensorflow wrapper: [tensorflow](https://crates.io/crates/tensorflow)
-* OnnxRuntime wrapper: [ort](https://github.com/pykeio/ort)
+* PyTorch wrapper: [tch](https://crates.io/crates/tch)
+* TensorFlow wrapper: [tensorflow](https://crates.io/crates/tensorflow)
+* ONNXRuntime wrapper: [ort](https://github.com/pykeio/ort)
 
 Positives:
 * extensive support for many neural network operations
-* support for many different backends (CPU, GPU (Nvidia + Amd), TPU, ...)
+* support for many different backends (CPU, GPU (Nvidia + AMD), TPU, ...)
 
 Negatives
-* not always great support for loading onnx files (ort is great at this though, as the name suggests)
+* not always great support for loading ONNX files (ort is great at this though, as the name suggests)
 * large and somewhat black-box external dependency
 * less operator fusion in many cases, although this is expected to improve in the future
 
-Performance should be about the same as Kyanite for cases where operator fusion does not matter much, all libraries mostly use the same underlying cudnn and cublas kernels. 
+Performance should be about the same as Kyanite for cases where operator fusion does not matter much; all libraries
+mostly use the same underlying cuDNN and cuBLAS kernels.
 
 ### From-scratch Rust projects
 
-* [tract](https://github.com/sonos/tract): larger coverage of the ONNX specification, but only does CPU inference
-
+* [tract](https://github.com/sonos/tract): larger coverage of the ONNX specification but only does CPU inference
 
 ## Development
 
-While developing this crate, to update the onnx proto, the [prost-build crate](https://crates.io/crates/prost-build) is used. This requires that `protoc` is installed and that the `PROTOC` environment variable is set to point to the executable. See their installation instructions (or the error message the build script shows if any) for more details.
+While developing this crate, to update the ONNX proto, the [prost-build crate](https://crates.io/crates/prost-build) is
+used. This requires that `protoc` is installed and that the `PROTOC` environment variable is set to point to the
+executable. See their installation instructions (or the error message the build script shows if any) for more details.
 
-To actually update the proto definition, replace `kn-graph/proto/onnx.proto3` with the newer version and run `cargo run --bin proto-to-rust`. Then commit both the `onnx.proto3` file and the generated `onnx.rs` file.
+To actually update the proto definition, replace `kn-graph/proto/onnx.proto3` with the newer version and run `cargo run --bin proto-to-rust`. Then commit both the `onnx.proto3` file and the generated `onnx.rs` file.