Neural Networks
from Scratch

The math, intuition, and code behind how neural networks actually learn — built up from a single neuron to a working training loop. Based on Andrej Karpathy's micrograd tutorial.

Concepts
8
Lines of Python
~50
Demos
5
Source
2h

Companion to Part 1: How LLMs Work. All concepts and code traced directly to Karpathy's micrograd lecture.

Q: What even is a neural network?
Chapter 1 · Motivation

The Problem
We're Solving

Before we build anything, let's understand what we're trying to do. We have some inputs and we want to predict an output. For example: given these 4 measurements about a person, predict whether they'll like a movie.

The challenge: we don't know the formula. We can't write the rules by hand. Instead, we want a system that learns the formula from examples — just by seeing a lot of input/output pairs.

A neural network is that system. It starts as a completely random function. Then it looks at thousands of examples and adjusts itself — tiny nudge by tiny nudge — until its predictions get good. This process is called training.

The Connection to Part 1 GPT does exactly this — at massive scale. Its "inputs" are tokens, its "output" is the next token, and it trained on 15 trillion examples. The math is the same as what we'll build here. Just more of it.

Our Training Dataset

A tiny example to make it concrete. Four inputs, one target output:

Training Examples (xs → ys)
# inputs: [x1, x2, x3, x4] xs = [ [2.0, 3.0, -1.0, 1.0], [3.0, -1.0, 0.5, 1.0], [0.5, 1.0, 1.0, 1.0], [1.0, 1.0, -1.0, 1.0], ] # targets: what we want the network to output ys = [1.0, -1.0, -1.0, 1.0]
xs is our input data — 4 examples, each with 4 numbers. ys is the "right answer" for each example: 1.0 or -1.0. The network starts knowing nothing, and has to learn to map xs to ys.
Why -1 and 1? We use tanh as our activation function (more on this in §2), which outputs values between -1 and 1. So our targets are -1 and 1 to match that range.
Chapter 2 · The Building Block

What is
a Neuron?

A neuron is just a tiny mathematical function. Think of it as a dimmer switch: it takes a bunch of inputs, decides how much each one matters (the weights), adds a personal default lean (the bias), and squishes the result to a bounded range.

The formula: output = tanh(w₁·x₁ + w₂·x₂ + b)

Where w₁, w₂ are weights ("how much does this input matter?"), b is the bias ("what's my default lean when inputs are zero?"), and tanh squishes the result to always land between -1 and 1.

class Value: def __init__(self, data): self.data = data # the actual number self.grad = 0.0 # gradient (filled in later) def __mul__(self, other): return Value(self.data * other.data) def __add__(self, other): return Value(self.data + other.data) def tanh(self): import math t = math.tanh(self.data) return Value(t)
self.data is just a number — like 0.5 or -1.3. self.grad starts at zero and gets filled in during backpropagation (§7). The __mul__ and __add__ methods let us use normal Python math operators (+, *) with Value objects. tanh squishes any number to the range (-1, 1).
Why tanh? Without an activation function, stacking neurons just produces a linear function — no matter how many layers you add. tanh introduces non-linearity, which is what lets networks learn complex patterns.

Neuron Playground

Drag the sliders to change the weights and bias. Watch the neuron's output update live. Fixed inputs: x₁ = 0.5, x₂ = −0.3.

Interactive Neuron
0.50
-0.30
0.00
Raw sum: 0.00 tanh output: 0.00
Chapter 3 · Architecture

Layers &
the MLP

One neuron isn't enough to learn complex patterns. We stack them into layers, and stack layers into a Multi-Layer Perceptron (MLP). Think of it as an assembly line: the first layer looks at raw inputs, the next layer looks at what the first layer found, and so on.

Every connection between neurons is one weight. A network with 4 inputs → 3 neurons → 1 output has (4×3) + (3×1) = 15 weights, plus biases. GPT-4 has the same structure, just with 405 billion weights.

class Neuron: def __init__(self, nin): self.w = [Value(random.uniform(-1,1)) for _ in range(nin)] self.b = Value(0) def __call__(self, x): act = sum(wi*xi for wi,xi in zip(self.w, x)) + self.b return act.tanh() class Layer: def __init__(self, nin, nout): self.neurons = [Neuron(nin) for _ in range(nout)] def __call__(self, x): return [n(x) for n in self.neurons] class MLP: def __init__(self, nin, nouts): sizes = [nin] + nouts self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))] def __call__(self, x): for layer in self.layers: x = layer(x) return x[0] if len(x) == 1 else x
Neuron(nin) creates a neuron with nin random weights. Layer(nin, nout) creates nout neurons that each accept nin inputs. MLP(4, [3,1]) creates a 4→3→1 network. __call__ is Python's way of making an object callable — so model(x) runs the forward pass.

MLP Architecture

A 4→3→1 network: 4 inputs, one hidden layer of 3 neurons, 1 output.

Create the model in one line
n = MLP(4, [3, 1]) # 4 inputs → 3 hidden → 1 output
That's it. We now have a randomly-initialized network with 4×3 + 3×1 = 15 weights plus 4 biases = 19 learnable parameters.
Chapter 4 · Computing the Output

The Forward
Pass

The forward pass is simply: run data through the network from left to right. Each neuron fires in turn, passes its output to the next layer, and eventually we get a prediction out the end.

At the start, since weights are random, predictions are meaningless. But we can still run the forward pass — we need it to compute how wrong we are (§5), which tells us how to improve.

# Run one training example through the network x = [Value(2.0), Value(3.0), Value(-1.0), Value(1.0)] prediction = n(x) print(prediction.data) # e.g. 0.23
n(x) calls the MLP's __call__ method, which loops through each layer and applies it to x. Each layer call applies each neuron: compute w·x + b, apply tanh, pass output to next layer. The final value is our prediction. It's just a chain of multiplications and additions.
Everything is a number At every step, we're just computing numbers. The complexity comes from doing it for billions of parameters simultaneously, but the operation on each one is simple arithmetic.

Forward Pass Visualizer

Click "Run Forward Pass" to watch data flow through a 2→3→1 network. Each node lights up with its computed value.

Interactive Forward Pass
Chapter 5 · Measuring Error

Loss — How
Wrong Are We?

After the forward pass, we have predictions. Now we need to measure how wrong they are. The loss function is a report card: one single number that summarizes "here is how bad your predictions are right now."

We use Mean Squared Error (MSE): for each example, square the difference between our prediction and the target, then average them all. Squaring ensures the loss is always positive and punishes big mistakes harder.

The goal of training: make this loss number as small as possible.

# Run all training examples through the network ypred = [n(x) for x in xs] loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred)) print(loss.data) # e.g. 4.73 — very wrong at first
For each example: yout is our network's prediction, ygt is the "ground truth" target. (yout - ygt)**2 squares the error. If we predicted 0.23 but the target was 1.0, the error is (0.23−1.0)² = 0.59. We sum all 4 example errors to get one number. The training job is to push this number toward zero.

Loss Landscape

The loss is a function of all the weights. Think of it as a hilly terrain — we want to roll the ball to the lowest point. Click "Step" to run one gradient descent update.

Gradient Descent on Loss Curve
0.10
w = 3.00 · loss = 1.00
Loss is always a single number This is crucial. We need one number so we have one direction to optimize. If we had a vector of losses, we wouldn't know which way to move.
Chapter 6 · The Math of Change

Derivatives —
Which Way is Downhill?

We want to push the loss down. To do that, we need to know: for each weight, does increasing it make the loss go up or down? That's exactly what a derivative tells us.

Imagine standing on a hill. You can't see the whole landscape, but you can feel which direction your foot is going downhill. The derivative is that "feel" — the slope of the loss at your current position.

Key rules we'll use:

  • Addition: d/dx(a + b) = 1 — adding passes the gradient through unchanged
  • Multiplication: d/dx(a · b) = b — the gradient scales by the other factor
  • tanh: d/dx(tanh(x)) = 1 − tanh²(x) — the chain rule handles this
The Chain Rule in one sentence If A affects B and B affects the loss, then A's effect on the loss = (A's effect on B) × (B's effect on the loss). We'll use this in §7 to propagate gradients backwards through the whole network.

Derivative Visualizer

Drag the point along f(x) = x². The tangent line shows the slope (derivative) at that position. Notice: at x=0 the slope is 0; it gets steeper as x moves away from center.

f(x) = x² — Drag the point
x = 1.00  ·  f(x) = 1.00  ·  f'(x) = 2.00
Gradient vs Derivative A derivative is for a function of one variable. A gradient is the same idea for many variables — one slope per weight. The gradient of the loss tells us the slope in every weight direction at once.
Chapter 7 · Assigning Blame

Backpropagation

We know the loss. We know derivatives. But we have hundreds of weights — how do we know which ones to blame, and by how much?

Backpropagation solves this efficiently. It walks the computation graph backwards — from the loss, through every operation, back to every weight — computing each weight's gradient using the chain rule.

Karpathy's key insight: "The only thing backprop does is apply the chain rule recursively." Nothing more. It's mechanical, not magical.

for p in n.parameters(): p.grad = 0.0 ypred = [n(x) for x in xs] loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred)) loss.backward() # Now every p has p.grad: "increase me → loss changes by p.grad"
p.grad = 0.0 clears leftover gradients from the previous training step — they'd otherwise accumulate. loss.backward() walks backwards through every operation that produced loss, applying the chain rule at each node. After this call, every weight has a .grad that says "increase me → loss increases by this much."

Expression Graph

A simple expression: e = tanh((a·b) + a). Click "Forward Pass" to compute values, then "Backprop" to watch gradients flow backwards.

Computation Graph — a=2, b=-3
Why backwards? We already know the loss gradient (it's 1.0 — the loss with respect to itself). We work backwards because each node only needs the gradient from the node ahead of it (closer to the loss) to compute its own gradient. It's a clean recursive application of the chain rule.
Chapter 8 · Learning

Gradient Descent —
The Training Loop

We have gradients. Now we use them. The update rule is simple: for each weight, move it a tiny step in the opposite direction of its gradient. Opposite because the gradient points uphill — we want to go downhill.

# Create the network n = MLP(4, [3, 1]) for step in range(100): ypred = [n(x) for x in xs] # 1. forward loss = sum((yout-ygt)**2 for ygt,yout in zip(ys,ypred)) # 2. loss for p in n.parameters(): p.grad=0.0 # 3. zero grads loss.backward() # 4. backprop for p in n.parameters(): p.data -= 0.01 * p.grad # 5. update print(f"step {step} loss {loss.data:.4f}")
Line by line: Forward pass — run all 4 examples, get 4 predictions. Loss — square the errors and sum; one number. Zero grads — gradients accumulate, reset before each backward. Backprop — fill every .grad via chain rule. Updatep.data -= 0.01 * p.grad moves each weight a tiny step opposite its gradient. 0.01 is the learning rate. Repeat 100 times: loss drops from ~4 to near 0.

Loss During Training

The full loop, in words
  1. Run the data forward — get predictions
  2. Measure how wrong they are (loss)
  3. Walk backwards — compute how much each weight contributed to the error
  4. Nudge each weight in the direction that reduces error
  5. Repeat
Full Pipeline

From Random Weights to Predictions

01
Define a Neuron
Each neuron: weighted sum of inputs + bias → tanh. Weights start random. The Value class tracks numbers and their gradients.
Value classweightsbiastanh
02
Stack into an MLP
Neurons → Layers → MLP. The output of each layer feeds the next. Every connection is one weight. Our 4→3→1 network has 19 parameters.
NeuronLayerMLP
03
Forward Pass
Run training data left to right through the network. Each neuron computes and passes its output to the next layer. Result: one prediction per example.
__call__activations
04
Compute Loss
Mean Squared Error: sum of (prediction − target)². One scalar number. Big at the start, shrinks toward zero as training proceeds.
MSEscalar loss
05
Backpropagation
Walk the computation graph backwards. Chain rule at every node. Each weight gets a .grad: "increase me → loss changes by this much."
backward()chain rulegradients
06
Gradient Descent → Repeat
p.data -= lr * p.grad for every weight. Tiny step downhill. Zero gradients. Repeat forward → loss → backprop → update for hundreds of steps. Loss reaches ~0.
learning rateweight updatetraining loop
Repost Author: ynarwal · Based on Andrej Karpathy's lecture View Original ↗
← Back to Reposts ⚠️ Translated repost, all rights belong to the original author