This repository is part of my journey to learn the underlying infrastructure of deep learning systems through implementing a minimal version of the PyTorch library.
If you're interested in learning more, I highly recommend checking out the excellent MiniTorch lectures and Youtube playlist by Prof. Rush, and the self-study guide by Gabriel Mukobi that answers some common questions.
Topics covered:
- Basic Neural Networks and Modules
- Autodifferentiation for Scalars
- Tensors, Views, and Strides
- Parallel Tensor Operations
- GPU / CUDA Programming in NUMBA
- Convolutions and Pooling
- Advanced NN Functions
My virtual environment is based on Python 3.8. The Google Colab notebook (for accessing GPU) in Module 3 also requires Python 3.8.
conda create --name myenv python=3.8
conda activate myenv
Install dependencies
pip install -r requirements.txt
pip install -r requirements.extra.txt
pip install -Ue .
conda install llvmlite
pip install altair==4.0
Make sure that everything is installed by running python and then checking:
import minitorch
Make sure pre-commit enforces style and typing guidelines defined in the config file (.pre-commit-config.yaml)
pre-commit run --all-files
Automatically run pre-commit on git commit
pre-commit install
Tour of Repo:
.
|-- minitorch/
| `-- core of torch libarry
|-- project/
| `-- code for building ML using minitorch
`-- tests/
`-- test cases for minitorch
To access the autograder:
- Module 0: https://classroom.github.com/a/qDYKZff9
- Module 1: https://classroom.github.com/a/6TiImUiy
- Module 2: https://classroom.github.com/a/0ZHJeTA0
- Module 3: https://classroom.github.com/a/U5CMJec1
- Module 4: https://classroom.github.com/a/04QA6HZK
- Quizzes: https://classroom.github.com/a/bGcGc12k
- Operators
- Testing and Debugging
- Functional Python
- Modules
- Visualization
To run visualization for Module 0, use:
streamlit run project/app.py -- 0
Then you should be able to see interactive data visualizations and you could "turn the knobs" of the parameters to explore model behaviour.
- Numerical Derivatives
- Scalars
- Chain Rule
- Backpropagation
- Training
PyTorch terminology in computational graph:
- Leaf nodes: variables created from scratch on the left hand side (e.g.,
minitorch.Scalar(0.0)
) - Non-leaf nodes: variables created with a Function
- Constant nodes: terms that passed in that is not a variable (scalars without a history, e.g.,
minitorch.Scalar(0.0, None)
)
Backpropagation algorithm:
- Call topological sort. The result should start from the right of the computational graph (i.e., the output)
- Create a dict of variables and derivatives
- For each node in topological sort
- If variable is a leaf, then add its final derivative
- If the variable is not a leaf
- Call backward with its derivative as
$d$ - Loop through all the variables and derivative
- Accumulate derivatives for the variable
- Call backward with its derivative as
To run visualization for Module 1, use:
streamlit run project/app.py -- 1
Here's the training result for the XOR dataset.
The parameters are
PTS = 50
DATASET = minitorch.datasets["Xor"](PTS)
HIDDEN = 10
RATE = 0.5
- Tensor Data - Indexing
- Tensor Broadcasting
- Tensor Operations
- Gradients and Autograd
- Training
Dimension
- Number of dimensions
- 1-dimensional tensors: vectors
- 2-dimensional tensors: matrices
- 3-dimensional tensors: cubes
Shape
- Number of cells per dimension
- 1-dimensional tensors: (5,) (row vector of size 5)
- 2-dimensional tensors: (2,3)
- 3-dimensional tensors: (2,3,3)
Size
- The number of scalars a tensor can store
- Tensor of shape (5,2) can store the same amount of data as a tensor of shape (1,5,2)
- Note:
torch.Tensor.shape
is an alias fortorch.Tensor.size()
Visual Convention
-
Indexing convention differes among different systems.
-
For this tutorial, 3D tensor indexing follows
[depth, row, columns]
-
For example,
tensor[0,1,2]
would give this blue cube in a cube of shape(2,3,5)
Storage
- This is where the core data of the tensor is kept.
- It is always a 1-D array of numbers of length
size
, no matter thedim
orshape
of the tensor. - Keeping a 1-D storage allows us to have tensors with different shapes point to the same type of underlying data.
Strides
-
A tuple that provides the mapping from user
indexing
(high-dimensional index) to theposition
(1-D index) in the 1-Dstorage
. -
Prefer contiguous strides: the right-most stride is 1, bigger strides to the left
-
For row vectors, strides are (1,)
-
For column vectors, strides are (1,1)
-
Strides of (2,1): each row moves 2 steps in storage and each column moves 1 step. A size 10 tensor with strides of (2,1) will give us a matrix of shape (5,2)
-
Strides of (1,2): each row moves 1 step in storage and each column moves 2 steps. A size 10 tensor with strides of (1,2) will give us a matrix of shape (2,5)
-
Strides of (6,3,1): each row moves 3 step in storage, each column moves 1 step, and each depth moves 6 steps. A size 12 tensor with strides of (6,3,1) will give us a matrix of shape (2,2,3)
-
Strides enable 3 operations:
- Indexing: assuming strides of
(s1, s2, ...)
and we want to look uptensor[index1, index2, ...]
, its position in storage isstorage[s1*index1 + s2*index2 + ...]
- Movement: how to move to the next row/column?
- Reverse indexing: how to find
index
based onposition
?
- Indexing: assuming strides of
View
- Returns a new tensor with the same data as the self tensor but of a different shape.
>>> x = torch.randn(4, 4) >>> x.size() torch.Size([4, 4]) >>> y = x.view(16) >>> y.size() torch.Size([16]) >>> z = x.view(-1, 8) # the size -1 is inferred from other dimensions >>> z.size() torch.Size([2, 8])
Permute
- Returns a view of the original tensor input with its dimensions permuted.
- Similar to matrix transpose
>>> x = torch.randn(2, 3, 5) >>> x.size() torch.Size([2, 3, 5]) >>> torch.permute(x, (2, 0, 1)).size() torch.Size([5, 2, 3])
Map
- Apply to all elements, e.g.,
t1.log()
,t1.exp()
,-t1
map(fn: Callable[[float], float]) -> Callable[[Tensor], Tensor]
Zip
- Apply to all pairs of same shape, e.g.,
t1 + t2
,t1 * t2
(not mat mul),t1 < t2
zip(fn: Callable[[float, float], float]) -> Callable[[Tensor, Tensor], Tensor]
Reduce
- Reduce the full tensor or reduce 1 dimension, e.g.,
t1.sum()
,t1.mean(1)
reduce(fn: Callable[[float, float], float], start: float = 0.0) -> Callable[[Tensor, int], Tensor]
- If no argument is given, then it returns a tensor of shape 1.
- If 1 argument is given, then it reduces the given dimension.
- In MiniTorch, we keep the dimension that we reduced. However, in Torch and Numpy, the default behaviour is to remove the dimension entirely.
>>> t = minitorch.rand((3,4,5)).mean(1) >>> t.shape (3,1,5) >>> t = torch.mean(torch.rand(3,4,5), 1) >>> t.shape torch.Size([3, 5]) >>> t = torch.mean(torch.rand(3,4,5), 1, keepdim=True) >>> t.shape torch.Size([3, 1, 5])
Broadcasting
- Augmenting map & zip to make tensors with different shapes work together
- This avoids using for-loops and intermediate terms by making it easier to apply certain operations many times
- Rules to determine the output shape:
- Dimensions of size 1 broadcast with anything.
- E.g.,
t1.view(4,3) + t2.view(1,3)
gives output shape of (4,3) - E.g.,
t1.view(4,3,1) + t2.view(4,1,5)
gives output shape of (4,3,5)
- E.g.,
- Zip automatically adds dim of size 1 to the left
- E.g.,
t1.view(4,3) + tensor([1,2,3])
gives output shape of (4,3) - E.g.,
t1.view(4,3,1) + t2.view(1,5)
gives output shape of (4,3,5)
- E.g.,
- Extra dimensions of size 1 can be added with
view
operation- Note:
t1.view(4,3) + tensor([1,2,3,4])
would fail because tensor of shape (4,3) cannot be added with tensor of shape (1,4).t1.view(4,3,1) + t2.view(4,5)
would also fail - E.g.,
t1.view(4,3) + tensor([1,2,3,4]).view(4,1)
gives output shape of (4,3)
- Note:
- Dimensions of size 1 broadcast with anything.
- Broadcasting is only about shapes. It does not take strides into account in any way. It is purely a high-level protocol.
Special PyTorch Syntax (not available in MiniTorch)
- Using the
none
keyword to create the extra dimension of 1>>> x = torch.tensor([1,2,3]) >>> x.shape torch.Size([3]) >>> x[None].shape torch.Size([1, 3]) >>> x[:, None].shape torch.Size([3, 1])
torch.arange
- Returns a 1-D tensor with values from the interval
[start, end)
taken with common differencestep
beginning fromstart
.>>> torch.arange(1, 2.5, 0.5) tensor([ 1.0000, 1.5000, 2.0000])
- Returns a 1-D tensor with values from the interval
torch.where
- This function acts as a if-else statement and applies to every dimension of a tensor simultaneously
>>> x = torch.where( torch.tensor([True, False]), torch.tensor([1, 1]), torch.tensor([0, 0]), ) >>> x tensor([1,0])
- This function acts as a if-else statement and applies to every dimension of a tensor simultaneously
Terminology
- Scalar --> Derivative
- Tensor --> Gradient (storing many derivatives)
Derivatives
- Function with a tensor input is like multiple args
- Function with a tensor output is like multiple functions
- Backward: chain rule from each output to each input
- Rule to determine the output shape of derivatives:
- The derivative of a function
$f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is$df: \mathbb{R}^n \rightarrow \mathbb{R}^{m\times n}$
- The derivative of a function
- Tensor-to-scalar example:
-
$G([x_1, x_2, x_3]) = x_1 \cdot x_2 \cdot x_3$ is a a tensor-to-scalar function, and$f(z)$ is a scalar-to-scalar function -
$G'([x_1, x_2, x_3]) = [x_2 \cdot x_3, x_1 \cdot x_3, x_1 \cdot x_2]$ is a tensor-to-tensor function, and$f'(z) = d$ - Applying chain rule,
$f'(G([x_1, x_2, x_3])) = [x_2x_3d, x_1x_3d, x_1x_2d]$
-
- Tensor-to-tensor example:
-
$G([x_1, x_2]) = [G^1(\textbf{x}), G^2(\textbf{x})] = [z_1, z_2] = \textbf{z}$ is a tensor-to-tensor function, where$G^1(\textbf{x}) = x_1$ and$G^2(\textbf{x}) = x_1x_2$ .$f(\textbf{z})$ is a tensor-to-scalar function -
$G'^1_{x_1} = 1$ ,$G'^1_{x_2} = 0$ ,$G'^2_{x_1} = x_2$ ,$G'^2_{x_2} = x_1$ ,$f'(z_1) = d_1$ ,$f'(z_2) = d_2$ - Applying chain rule, $f'{x_1}(G(\textbf{x})) = d_1 \times 1 + d_2 \times x_2$ and $f'{x_1}(G(\textbf{x})) = d_1 \times 0 + d_2 \times x_1$
- More generally, $f'{x_j}(G(\textbf{x})) = \sum_i d_i G'^{i}{x_j}(\textbf{x})$
- There's a
$d$ value for each of the output arguments
-
Backward for Map operations:
-
$G'^i_{x_j}(\textbf{x}) = 0$ if$i \neq j$ -
$f'{x_j}(G(\textbf{x})) = d_j G'^j{x_j}(\textbf{x})$
-
In backward, we only need to compute the derivative for each scalar position and then apply a
mul
map, i.e.mul_map(g'(x), d_out)
Backward for Zip operations:
-
$G'^i_{x_j}(\textbf{x}, \textbf{y}) = 0$ if$i \neq j$ -
$f'{x_j}(G(\textbf{x}, \textbf{y})) = d_j G'^j{x_j}(\textbf{x}, \textbf{y})$, $f'{y_j}(G(\textbf{x}, \textbf{y})) = d_j G'^j{y_j}(\textbf{x}, \textbf{y})$
-
In backward, we only need to compute the derivative for each scalar position and then apply a
mul
map, i.e.mul_map(g'(x, y), d_out)
Backward for Reduce operations:
-
$G'^i_{x_j}(\textbf{x}; dim) = 0$ if$i \neq j$ -
$f'{x_j}(G(\textbf{x}; dim)) = d_j G'^j{x_j}(\textbf{x}; dim)$
-
In backward, we only need to compute the derivative for each scalar position and then apply a
mul
map, i.e.mul_map(g'(x, y), d_out)
. Hered_out
is broadcasted to match the shape ofg'(x, y)
- Parallelization
- Matrix Multiplication
- CUDA Operations
- CUDA Matrix
- Training
Numba is a numerical JIT (Just-in-Time) compiler for Python.
numba.jit
- A function decorator that tells Numba to compile a Python function into native machine code using just-in-time (JIT) compilation.
- It can be used to speed up the execution of the function by compiling it to machine code, which can be faster than interpreting the Python code.
numba.njit
- Similar to
numba.jit
, with stricter requirements for the types of input and output that the function can accept and return thannumba.jit
. - Faster and more memory-efficient than
numba.jit
.
numba.prange
- Running loops in parallel
- Nested parallelism does not occur: only the outermost of nested
prange
loops executes in parallel and any innerprange
are treated as standardrange
. - Warning: You have to be a bit careful to ensure that the loop actually can be parallelized. In short, this means that steps cannot depend on each other and each iteration cannot write to the same output value.
- Invalid example 1:
@njit(parallel=True) def prange_wrong_result(x): y = np.zeros(4) for i in prange(len(x)): # accumulating into the same element of `y` from different # parallel iterations of the loop results in a race condition y[i % 4] += x[i] return y
- Invalid example 2:
@njit(parallel=True) def prange_wrong_result(x): y = 0 for i in prange(len(x)): # different order of x[i] will produce different results y = (y + x[i])*5 return y
- Invalid example 1:
- For instance, when implementing reduce, you have to be careful to mix parallel and non-parallel loops.
- Unsupported operations
- Mutating a list is not threadsafe. Invalid example:
@njit(parallel=True) def invalid(): z = [] for i in prange(10000): z.append(i) return z
- Induction variables are not associated with thread ID. Invalid example:
@njit(parallel=True) def invalid(): n = get_num_threads() z = [0 for _ in range(n)] for i in prange(100): z[i % n] += i return z
- Mutating a list is not threadsafe. Invalid example:
Putting njit
and prange
together:
from numba import njit, prange
def map(fn):
# Change 1: Move function from Python to JIT version.
fn = njit()(fn)
def _map(out, input):
# Change 3: Run the loop in parallel (prange)
for i in prange(len(out)):
out[i] = fn(input[i])
# Change 2: Internal _map must be JIT version as well.
return njit()(_map)
- Another approach we can use to help improve tensor computations is to specialize commonly used combinations of operators.
- Combining operations can eliminate unnecessary intermediate tensors. This is particularly helpful for saving memory.
- In minitorch we will do this fusion by customizing special operations (e.g., MatMul).
Mattrix Multiplication
-
In past modules we have done matrix multiplication by applying a broadcasted
zip
and then areduce
. For example, consider a tensor of size (2, 4) and a tensor of size (4, 3). They are viewed as tensors of size (2, 4, 1) and (1, 4, 3). We first zip these together with broadcasting to produce a tensor of size (2, 4, 3). And then reduce it to a tensor of size (2, 1, 3), which we can view as (2, 3). -
An alternative option is to fuse together these two operations. We can then skip computing the intermediate value and directly compute the output.
-
We do this by 1) walking over the output indices, 2) seeing which indices are reduced, and 3) then seeing which were part of the zip.
-
Forwad
$f(A,B) = AB$
-
Backward
$g'_A(f(A,B)) = \textbf{d}B^T$ $g'_B(f(A,B)) = A^T\textbf{d}$
- There is a nice Numba library extension that allows us to code for GPUs directly in Python.
CUDA
-
CUDA is a proprietary extension to C++ for Nvidia devices.
-
Thread
- A thread can run code and store a small amount of states.
- Each thread has a tiny amount of fixed local memory it can manipulate, which has to be constant size.
- All of the threads share the same CUDA code
-
Block
- Threads hang out together in blocks.
- You can determine the size of the blocks, but there are a lot of restrictions.
- You can also have square or even cubic blocks.
- Each thread knows exactly where it is in the block. It gets this information in local variables telling it the
thread index
.
- Threads in the same block can also talk to each other through
shared memory
. This is another constant chunk of memory that is associated with the block and can be accessed and written to by all of these threads
-
Grid
- Blocks come together to form a grid.
- Each of the blocks has exactly the same size and shape, and all have their own shared memory.
- Each thread also knows its position in the global grid. The global position x, y for a thread is
def kernel(): x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y print(x, y)
-
CUDA code
- Map a tensor size of (7,3)
# Helper function to call in CUDA # @cuda.jit(device=True) def times(a, b): return a * b # Main cuda launcher # @cuda.jit() def my_func(input, out): # Create some local memory local = cuda.local.array(5) # Find my position. x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y # Compute some information local[1] = 10 # Guard some of the threads. if x < out.shape[0] and y < out.shape[1]: # Compute some global value out[x, y] = times(input[x, y], local[1]) # launch my_func with instructions for how to set up the blocks and grid threadsperblock = (4, 3) blockspergrid = (1, 3) my_func[blockspergrid, threadsperblock](in, out)
- 1D Convolution
- 2D Convolution
- Pooling
- Softmax and Dropout
- Extra Credit
- Training an Image Classifier