NNKit
A Python framework for dynamic neural networks
github project
 About
 Computation Graphs
 Variables, Operators and Implicit Dynamic Graphs
 Forward and Backward Passes
 Optimizers
 Practical Examples
About
NNKit is a framework for building and training neural networks. It provides a collection of nodes to implement dynamic net topologies as well as optimizers to train these networks and a few other helper algorithms for things such as mini batch partitioning.
A note about mathematical conventions: Contrary to the more popular mathematical convention, NNKit vectors (and therefore vectors in this post) are assumed to be roworiented by default. This stems from two personal preferences:
 It makes the input to a neural net consistent regardless of it being a single datapoint or a design matrix of n data points. That is, a single datapoint is of dimension and a sample of n points has dimentions ; datapoints in the design matrix are not
 It allows laying out (and thinking about) the dimensions of weight matrices for a whole layer as .
This being said, at some point I might switch the framework to columnoriented convention.
Computation Graphs
NNKit is based around the concept of computation graphs. A computation graph is a general way of describing an arbitrary function as a composition of individual operations.
For example, Figure 1 shows a hypothetical 3class classifier as a neural network diagram. The network has a 33 topology, meaning 2 layers (excluding inputs) and 3 units per layer. As is customary in neural net diagrams, each ‘neuron’ performs a weighted linear combination of its inputs followed by a nonlinear transformation (i.e.: the activation function). In this example the first layer uses ReLU activations and the second uses Softmax, so outputs of the network can be interpreted as the probability that an input to the net maps to one of three classes. Each layer also has a dummy constant input of 1 to add a bias term into each linear combination.
In NNKit, the same classifier is described as a vectorized computation graph (Figure 2). This representation has two main differences with the one in Figure 1:

Neuron computations are broken down into its constituent operations, each with as a separate node: multiplication, addition of bias and nonlinear activation. A computation graph in general could also decompose activations into primitive operations but I chose to implement these as atomic operations.

Nodes for variables and operations are multidimensional. This allows to pack the equivalent of a layer’s several neurons into tensors and carry them out in parallel. Another observation from the vectorized quality of the graph is that it allows thinking about blocks of computation as implicit layers with shapes such as 2in to 3out, given by the shapes of the variable (parameter) nodes.
Variables, Operators and Implicit Dynamic Graphs
NNKit graphs are made up of two types of nodes: variables (NetVar
class) and operators (NetOp
class).

variables hold a value in their
data
property and this value can be arbitrarily set. They also have ag
attribute (for gradient) which after a backward pass holds the gradient of the computation graph output w.r.t the variable. Examples of variables are inputs to a graph or parameters such as weights and biases. 
operators are
NetVar
subclasses for which thedata
property is automatically (and only once) derived on instance initialization from their operands, which are other variables passed to their initializer. Examples of operators are multiplication, addition, ReLU and Softmax.
The operator initialization mechanism has two implications:
Graphs are Implicit:
There is no graph class in the framework. Instead, as operators are instantiated with other nodes as arguments they keep references to their parents, implicitly defining a dependency graph (Figure 2) thru which gradients will backpropagate. This being said, it is easy to create convenience wrapper classes for graphs or neural net types. The framework includes a FFN
(feedforward network) convenience class as an example.
Graphs are Dynamic:
Because operators compute their value only once on initialization, one must reinstantiate these on subsequent forward passes, keeping parameters around between passes, in order to persist them. This also means that upon instantiation, a node will hold the value of the forward pass up to that point in the implicit graph. This mechanism makes graph topology dynamic, since it can change between passes depending on which operators are instantiated. This is very useful for sequential models of varying input length for example.
Forward and Backward Passes
Eq. 1 defines the 3class classifier from earlier as a function and Eq. 2 shows its decomposition into individual operators and variables:
Code 1 shows the NNKit implementation of Eq. 2, where arbitrary values have been assigned to the variables.
import nnkit as nn
# Variables:
x = nn.NetVar([1,2])
w1 = nn.NetVar([[3,4,5], [6,7,8]])
b1 = nn.NetVar([1,2,3])
w2 = nn.NetVar([[1,2,3], [3,4,5], [6,7,8]])
b2 = nn.NetVar([1,2,3])
# Layer 1 Operators:
mul1 = nn.Multiply(x, w1)
add1 = nn.Add(mul1, b1)
relu1 = nn.ReLU(add1)
# Layer 2 Operators:
mul2 = nn.Multiply(relu1, w2)
add2 = nn.Add(mul2, b2)
smax2 = nn.SoftMax(add2)
Forward Pass:
As mentioned before, all nodes have a data
property which holds the value of the node. In the case of operators this is the value of the forward pass of the graph as constructed up to that point. So the data
of the last operator in a graph holds the value of the graph’s output (Figure 3).
Backward Pass:
The backward pass is computed by calling back()
on any operator. This causes each operator (starting from the one back()
was called on) to compute the gradient of itself w.r.t. its parents. Each node’s gradient is then accessible thru its g
attribute. Since it’s possible to call back()
on any operator, it’s possible to do backpropagation for a subgraph of a larger graph, though typically back()
is just called on the last node in a graph.
Eq. 3 shows the forward and backward pass computations at the operator level for a Multiply
node implementing a 2in, 3out layer (equivalent to the output of three neurons). As noted before, the operator itself has no concept of dimension. The 2in, 3out shape is actually derived from the operand shapes in this particular case. Code 2 shows the implementation of this node in NNKit and hints to how it’s possible to extend the framework with additional operators.
class Multiply(NetOp):
"""Multiplication.
y = xw
"""
def __init__(self, x, w):
"""
:param x: NetVar: input 1.
:param w: NetVar: input 2.
"""
super().__init__(
x.data @ w.data,
x, w
)
def _back(self, x, w):
x.g += self.g @ w.data.T
w.g += x.data.T @ self.g
super()._back()
Optimizers
Optimizers are algorithms that minimize some notion of distance or error between a graph’s output and a target. This notion of distance is given by a loss function (sometimes called a cost function when averaging or summing multiple losses).
An optimization step consists of nudging model parameters (denoted ) in the direction opposite of the gradient of the loss function w.r.t. the parameters. In NNKit a parameter is simply any NetVar
passed to an optimizer for which derivatives have been computed.
As an example, Eq 4. defines Gradient Descent with momentum and Code 3 shows this optimizer as implemented in NNKit.
class GD(Optimizer):
"""Gradient descent with optional momentum.
Implements the following update for each parameter p:
m = β1m + (1β1)df/dp
p = p  αm
Attributes:
. learnRate: α ∈ [0,1]
how big of an adjustment each parameter undergoes during an optimization step.
. momentum: β1 ∈ [0,1)
over how many samples the exponential moving average m takes place.
If set to 0 momentum is disabled and the algorithm becomes simply gradient descent.
"""
def __init__(self, params):
super().__init__(params)
self.momentum = 0.9
self.m = [np.zeros_like(p.data) for p in params]
def step(self):
for i, p in enumerate(self.params):
self.m[i] = self.momentum * self.m[i] + (1  self.momentum) * p.g
p.data = self.learnRate * self.m[i]
p.reset()
NNKit currently implements Gradient Descent with momentum and Adam (which degenerates to RMSProp if momentum is set to 0.). Code 4 shows an example very similar to the one in Code 1 but using the FFN
convenience class along with an optimizer and a loss node.
The loss node can be appended to the graph as shown in the code, which could make sense from a computation graph point of view, but it can also be used standalone, resembling the functional notation in Eq. 4. It’s also worth noting that even though the FFN
class has a vars
property, the network’s parameters could have just been declared outside the network instance and passed directly to the optimizer.
import nnkit as nn
# 1. input:
x = nn.NetVar([1, 2])
# 2. initialize (2,3,3) network:
net = nn.FFN(
(nn.Multiply, nn.rand2(2,3)),
(nn.Add, nn.rand2(3)),
(nn.ReLU,),
(nn.Multiply, nn.rand2(3,3)),
(nn.Add, nn.rand2(3)),
(nn.SoftMax, )
)
# 3. create optimizer and pass parameters to optimize:
optimizer = nn.GD(net.vars)
# 4. add crossentropy loss node at the end of the net to optimize against:
target = nn.NetVar([0, 0, 1])
net.topology.append((nn.CELoss, target))
# 5. optimize (a.k.a. train, a.k.a. learn):
loss = float('inf')
while loss > 0.01:
print(loss)
# 5.a forward pass: instantiate all operators (variables are persisted from step 2):
loss = net(x).item()
# 5.b backprop: compute loss output w.r.t parameters:
net.back()
# 5.c optimize parameters:
optimizer.step()
# 6. remove loss node. Network is trained and ready for testing / performing.
net.topology.pop()
Practical Examples
For examples of the framework being applied to specific cases take a look at my posts on digits classification and deep Qlearning.