Why Maths for Machine Learning?
- Understand which areas of mathematics underpin machine learning
- Know why each area matters β not just what it is, but where it shows up
- Read basic mathematical notation without freezing
- Have a clear roadmap for the rest of this series
The Problem
You can get surprisingly far in machine learning without touching the maths. Scikit-learnβs model.fit(X, y) does not ask you to derive anything. You can train a neural network in ten lines of Keras without knowing what a gradient is.
Until something goes wrong.
Your model does not converge. Your loss function explodes. Your predictions are confidently wrong. You stare at a learning rate and have no intuition for whether 0.001 is too large or too small. You read a paper and the notation looks like a foreign language.
The maths is not academic decoration. It is the why behind every design decision in ML:
Why does gradient descent work? β Calculus
Why can PCA compress 50 features into 5? β Linear algebra
Why does Naive Bayes assume feature independence? β Probability
Why does cross-entropy measure how wrong a prediction is? β Information theory
You do not need a maths degree. You need enough to build intuition, debug problems, and read documentation without guessing.
The Four Pillars
Machine learning stands on four areas of mathematics. Each one appears at specific, predictable points. The mindmap below shows how the series is structured β four pillars branching into their topics, with dashed boxes showing where each connects to ML.
1. Linear Algebra
What it is: The mathematics of vectors, matrices, and linear transformations.
Where it shows up in ML:
| ML Concept | Linear Algebra Behind It |
|---|---|
A dataset | A matrix β each row is a sample, each column is a feature |
A feature vector | A point in n-dimensional space |
PCA | Eigenvalue decomposition β find the directions of maximum variance |
Neural network layers | Matrix multiplication: output = weights Γ input + bias |
Word embeddings | Vectors in high-dimensional space where direction encodes meaning |
When you call model.fit(X, y), X is a matrix. The model multiplies it by weight matrices, adds bias vectors, and applies transformations. Every forward pass through a neural network is a chain of matrix operations.
import numpy as np
# A dataset IS a matrix
X = np.array([
[45, 62, 120], # server 1: [cpu, mem, disk_io]
[92, 88, 450], # server 2
[38, 55, 95], # server 3
])
# A neural network layer IS matrix multiplication
weights = np.array([
[0.2, -0.1],
[0.5, 0.3],
[0.1, 0.4],
]) # shape: (3 features, 2 neurons)
bias = np.array([0.1, -0.2])
output = X @ weights + bias # matrix multiply + broadcast add
print(output.shape) # (3 samples, 2 neurons)2. Calculus
What it is: The mathematics of change β derivatives tell you how fast something is changing and in which direction.
Where it shows up in ML:
| ML Concept | Calculus Behind It |
|---|---|
Loss function | A surface in parameter space β calculus finds the lowest point |
Gradient descent | Follow the negative derivative downhill to minimise loss |
Backpropagation | Chain rule β propagate derivatives backwards through layers |
Learning rate | Step size along the gradient β too big overshoots, too small stalls |
Regularisation | Adding a penalty term changes the shape of the loss surface |
Every model learns by minimising a loss function. The derivative tells you which direction is "downhill". Gradient descent takes a step in that direction. Repeat until you reach the bottom.
# Gradient descent in one dimension
# Minimise f(x) = xΒ² (minimum at x=0)
x = 5.0 # start here
learning_rate = 0.1
for step in range(20):
gradient = 2 * x # derivative of xΒ² is 2x
x = x - learning_rate * gradient
print(f"Step {step:2d}: x = {x:.4f}, f(x) = {x**2:.4f}")That is the entire optimisation loop for every ML model, from linear regression to GPT. The complexity scales, but the principle does not change.
3. Probability and Statistics
What it is: The mathematics of uncertainty β quantifying what we know, what we donβt, and how confident we should be.
Where it shows up in ML:
| ML Concept | Probability Behind It |
|---|---|
Classification | Predicting class probabilities, not just labels |
Naive Bayes | Bayes' theorem with conditional independence assumption |
Logistic regression | Sigmoid function outputs a probability between 0 and 1 |
Overfitting | The model fits noise (random variation) instead of signal |
Cross-validation | Statistical sampling to estimate generalisation performance |
Confidence intervals | How much to trust a modelβs reported accuracy |
When a model says "this server has a 87% chance of failing", that number comes from probability theory. When you evaluate whether that model is reliable, you use statistics.
# Bayes' theorem in action
# P(failure | high_cpu) = P(high_cpu | failure) Γ P(failure) / P(high_cpu)
p_failure = 0.05 # 5% of servers fail
p_high_cpu_given_failure = 0.90 # 90% of failures had high CPU
p_high_cpu = 0.20 # 20% of all servers have high CPU
p_failure_given_high_cpu = (
p_high_cpu_given_failure * p_failure / p_high_cpu
)
print(f"P(failure | high CPU) = {p_failure_given_high_cpu:.1%}")
# 22.5% β much higher than the base rate of 5%4. Information Theory
What it is: The mathematics of information β quantifying surprise, uncertainty, and the difference between distributions.
Where it shows up in ML:
| ML Concept | Information Theory Behind It |
|---|---|
Decision tree splits | Information gain β choose the feature that reduces entropy the most |
Cross-entropy loss | The standard loss function for classification β measures how far predictions are from truth |
KL divergence | Measures the difference between two probability distributions (used in VAEs, distillation) |
Mutual information | Feature selection β how much does knowing feature X tell you about target Y? |
When a decision tree chooses to split on cpu_avg > 80 rather than disk_io > 200, it is because that split produces the largest reduction in entropy β the most information gain. Cross-entropy loss works the same way: it penalises predictions that are confident and wrong.
Reading Mathematical Notation
The biggest barrier is often notation, not concepts. Here is a survival guide:
| Symbol | Read As | Meaning |
|---|---|---|
x | "x" | A scalar (single number) |
x | "x vector" or "bold x" | A vector (list of numbers) |
X | "capital X" | A matrix (grid of numbers) |
Ξ£ | "sum of" | Add up a series of terms |
β | "product of" | Multiply a series of terms |
βf/βx | "partial derivative of f with respect to x" | How f changes when you nudge x, holding everything else fixed |
βf | "nabla f" or "gradient of f" | Vector of all partial derivatives β points uphill |
||x|| | "norm of x" | Length/magnitude of a vector |
xT | "x transpose" | Flip rows and columns |
P(A|B) | "probability of A given B" | How likely A is, knowing B happened |
argminx f(x) | "the x that minimises f" | The input that gives the smallest output |
β | "in" or "element of" | x β β means x is a real number |
βn | "R n" | n-dimensional real number space |
You do not need to memorise this table. You need to recognise the symbols when you see them. Bookmark this page and come back when you hit unfamiliar notation.
The Roadmap
This series covers each pillar in depth, building from first principles:
Why Maths for Machine Learning? β you are here
Linear Algebra: Vectors and Spaces β vectors, dot products, basis, span, and why features are points in space
Linear Algebra: Matrices and Transformations β matrix operations, eigenvalues, SVD, and how PCA and neural network layers work
Calculus: Derivatives and the Chain Rule β limits, derivatives, partial derivatives, and the engine behind backpropagation
Calculus: Optimisation and Gradient Descent β loss surfaces, learning rates, momentum, and how every model actually learns
Probability: Distributions and Bayes' Theorem β random variables, distributions, conditional probability, and the maths behind Naive Bayes
Statistics: Estimation and Hypothesis Testing β mean, variance, confidence intervals, and knowing when your model results are real
Information Theory: Entropy and Cross-Entropy β entropy, KL divergence, cross-entropy loss, and why decision trees split where they do
Each post will cross-reference the ML Fundamentals series so you can see where the maths connects to the algorithms.
How to Use This Series
If you are following the ML Fundamentals series: read the maths posts as companions. When Part 6 (Neural Networks) references backpropagation, the calculus posts give you the full derivation.
If you are starting here: work through in order. Each post builds on the previous one. The exercises use Python so you can verify the maths computationally β no pen-and-paper proofs required.
If you already know some maths: skip to the parts you are fuzzy on. Each post is self-contained enough to read independently.
Next
Part 2: Linear Algebra β Vectors and Spaces. Starting from what a vector actually is, building up to dot products, basis vectors, and why your dataset is a point cloud in n-dimensional space.
Videos:
- 3Blue1Brown β Essence of Linear Algebra (16 videos) β the single best visual introduction to linear algebra. Watch before or alongside Parts 2-3.
- 3Blue1Brown β Essence of Calculus (12 videos) β derivatives, integrals, and the chain rule visualised beautifully. Pairs with Parts 4-5.
- StatQuest with Josh Starmer β short, clear explanations of statistics and ML concepts. Excellent for Parts 6-7.
- 3Blue1Brown β Bayes theorem β the geometry of changing beliefs. Pairs with Part 6.
Reading:
- Mathematics for Machine Learning (Deisenroth, Faisal, Ong) β free PDF textbook covering all four pillars in depth.
- Khan Academy β Maths β if any topic in this series moves too fast, Khan Academy has step-by-step lessons from the very basics.
1. Notation practice. Read this expression and describe in plain English what it computes: Ξ£i=1^n^ (yi β Ε·i)Β². Hint: it is one of the metrics from link:/posts/data-preprocessing-and-evaluation/[ML Fundamentals Part 2].
2. Identify the maths. Pick any ML algorithm you have used (even model.fit()). Look up its documentation or a tutorial. List which of the four pillars (linear algebra, calculus, probability, information theory) it relies on.
3. Bayes by hand. Your monitoring shows that 8% of deployments cause incidents. Of deployments that caused incidents, 75% were deployed on Fridays. Overall, 20% of deployments happen on Fridays. What is the probability that a Friday deployment will cause an incident?
Comments
Loading comments...