- Computation Graphs
- Variables, Operators and Implicit Dynamic Graphs
- Forward and Backward Passes
- Practical Examples
NNKit is a framework for building and training neural networks. It provides a collection of nodes to implement dynamic net topologies as well as optimizers to train these networks and a few other helper algorithms for things such as mini batch partitioning.
A note about mathematical conventions: Contrary to the more popular mathematical convention, NNKit vectors (and therefore vectors in this post) are assumed to be row-oriented by default. This stems from two personal preferences:
- It makes the input to a neural net consistent regardless of it being a single datapoint or a design matrix of n data points. That is, a single datapoint is of dimension and a sample of n points has dimentions ; datapoints in the design matrix are not
- It allows laying out (and thinking about) the dimensions of weight matrices for a whole layer as .
This being said, at some point I might switch the framework to column-oriented convention.
NNKit is based around the concept of computation graphs. A computation graph is a general way of describing an arbitrary function as a composition of individual operations.
For example, Figure 1 shows a hypothetical 3-class classifier as a neural network diagram. The network has a 3-3 topology, meaning 2 layers (excluding inputs) and 3 units per layer. As is customary in neural net diagrams, each ‘neuron’ performs a weighted linear combination of its inputs followed by a nonlinear transformation (i.e.: the activation function). In this example the first layer uses ReLU activations and the second uses Softmax, so outputs of the network can be interpreted as the probability that an input to the net maps to one of three classes. Each layer also has a dummy constant input of 1 to add a bias term into each linear combination.
In NNKit, the same classifier is described as a vectorized computation graph (Figure 2). This representation has two main differences with the one in Figure 1:
Neuron computations are broken down into its constituent operations, each with as a separate node: multiplication, addition of bias and nonlinear activation. A computation graph in general could also decompose activations into primitive operations but I chose to implement these as atomic operations.
Nodes for variables and operations are multidimensional. This allows to pack the equivalent of a layer’s several neurons into tensors and carry them out in parallel. Another observation from the vectorized quality of the graph is that it allows thinking about blocks of computation as implicit layers with shapes such as 2-in to 3-out, given by the shapes of the variable (parameter) nodes.
NNKit graphs are made up of two types of nodes: variables (
NetVar class) and operators (
variables hold a value in their
dataproperty and this value can be arbitrarily set. They also have a
gattribute (for gradient) which after a backward pass holds the gradient of the computation graph output w.r.t the variable. Examples of variables are inputs to a graph or parameters such as weights and biases.
NetVarsubclasses for which the
dataproperty is automatically (and only once) derived on instance initialization from their operands, which are other variables passed to their initializer. Examples of operators are multiplication, addition, ReLU and Softmax.
The operator initialization mechanism has two implications:
Graphs are Implicit:
There is no graph class in the framework. Instead, as operators are instantiated with other nodes as arguments they keep references to their parents, implicitly defining a dependency graph (Figure 2) thru which gradients will backpropagate. This being said, it is easy to create convenience wrapper classes for graphs or neural net types. The framework includes a
FFN (feed-forward network) convenience class as an example.
Graphs are Dynamic:
Because operators compute their value only once on initialization, one must re-instantiate these on subsequent forward passes, keeping parameters around between passes, in order to persist them. This also means that upon instantiation, a node will hold the value of the forward pass up to that point in the implicit graph. This mechanism makes graph topology dynamic, since it can change between passes depending on which operators are instantiated. This is very useful for sequential models of varying input length for example.
Eq. 1 defines the 3-class classifier from earlier as a function and Eq. 2 shows its decomposition into individual operators and variables:
Code 1 shows the NNKit implementation of Eq. 2, where arbitrary values have been assigned to the variables.
import nnkit as nn # Variables: x = nn.NetVar([1,2]) w1 = nn.NetVar([[3,4,5], [6,7,8]]) b1 = nn.NetVar([1,2,3]) w2 = nn.NetVar([[1,2,3], [3,4,5], [6,7,8]]) b2 = nn.NetVar([1,2,3]) # Layer 1 Operators: mul1 = nn.Multiply(x, w1) add1 = nn.Add(mul1, b1) relu1 = nn.ReLU(add1) # Layer 2 Operators: mul2 = nn.Multiply(relu1, w2) add2 = nn.Add(mul2, b2) smax2 = nn.SoftMax(add2)
As mentioned before, all nodes have a
data property which holds the value of the node. In the case of operators this is the value of the forward pass of the graph as constructed up to that point. So the
data of the last operator in a graph holds the value of the graph’s output (Figure 3).
The backward pass is computed by calling
back() on any operator. This causes each operator (starting from the one
back() was called on) to compute the gradient of itself w.r.t. its parents. Each node’s gradient is then accessible thru its
g attribute. Since it’s possible to call
back() on any operator, it’s possible to do backpropagation for a subgraph of a larger graph, though typically
back() is just called on the last node in a graph.
Eq. 3 shows the forward and backward pass computations at the operator level for a
Multiply node implementing a 2-in, 3-out layer (equivalent to the output of three neurons). As noted before, the operator itself has no concept of dimension. The 2-in, 3-out shape is actually derived from the operand shapes in this particular case. Code 2 shows the implementation of this node in NNKit and hints to how it’s possible to extend the framework with additional operators.
class Multiply(NetOp): """Multiplication. y = xw """ def __init__(self, x, w): """ :param x: NetVar: input 1. :param w: NetVar: input 2. """ super().__init__( x.data @ w.data, x, w ) def _back(self, x, w): x.g += self.g @ w.data.T w.g += x.data.T @ self.g super()._back()
Optimizers are algorithms that minimize some notion of distance or error between a graph’s output and a target. This notion of distance is given by a loss function (sometimes called a cost function when averaging or summing multiple losses).
An optimization step consists of nudging model parameters (denoted ) in the direction opposite of the gradient of the loss function w.r.t. the parameters. In NNKit a parameter is simply any
NetVar passed to an optimizer for which derivatives have been computed.
As an example, Eq 4. defines Gradient Descent with momentum and Code 3 shows this optimizer as implemented in NNKit.
class GD(Optimizer): """Gradient descent with optional momentum. Implements the following update for each parameter p: m = β1m + (1-β1)df/dp p = p - αm Attributes: . learnRate: α ∈ [0,1] how big of an adjustment each parameter undergoes during an optimization step. . momentum: β1 ∈ [0,1) over how many samples the exponential moving average m takes place. If set to 0 momentum is disabled and the algorithm becomes simply gradient descent. """ def __init__(self, params): super().__init__(params) self.momentum = 0.9 self.m = [np.zeros_like(p.data) for p in params] def step(self): for i, p in enumerate(self.params): self.m[i] = self.momentum * self.m[i] + (1 - self.momentum) * p.g p.data -= self.learnRate * self.m[i] p.reset()
NNKit currently implements Gradient Descent with momentum and Adam (which degenerates to RMSProp if momentum is set to 0.). Code 4 shows an example very similar to the one in Code 1 but using the
FFN convenience class along with an optimizer and a loss node.
The loss node can be appended to the graph as shown in the code, which could make sense from a computation graph point of view, but it can also be used standalone, resembling the functional notation in Eq. 4. It’s also worth noting that even though the
FFN class has a
vars property, the network’s parameters could have just been declared outside the network instance and passed directly to the optimizer.
import nnkit as nn # 1. input: x = nn.NetVar([1, 2]) # 2. initialize (2,3,3) network: net = nn.FFN( (nn.Multiply, nn.rand2(2,3)), (nn.Add, nn.rand2(3)), (nn.ReLU,), (nn.Multiply, nn.rand2(3,3)), (nn.Add, nn.rand2(3)), (nn.SoftMax, ) ) # 3. create optimizer and pass parameters to optimize: optimizer = nn.GD(net.vars) # 4. add cross-entropy loss node at the end of the net to optimize against: target = nn.NetVar([0, 0, 1]) net.topology.append((nn.CELoss, target)) # 5. optimize (a.k.a. train, a.k.a. learn): loss = float('inf') while loss > 0.01: print(loss) # 5.a forward pass: instantiate all operators (variables are persisted from step 2): loss = net(x).item() # 5.b backprop: compute loss output w.r.t parameters: net.back() # 5.c optimize parameters: optimizer.step() # 6. remove loss node. Network is trained and ready for testing / performing. net.topology.pop()