Why Maths for Machine Learning?

Mar 24, 2026 · 9 min read AI Augmented

🎯 What You Will Learn

Understand which areas of mathematics underpin machine learning
Know why each area matters — not just what it is, but where it shows up
Read basic mathematical notation without freezing
Have a clear roadmap for the rest of this series

📋 Prerequisites

None — this series starts from first principles. If you can add, multiply, and read a graph, you have enough to begin. Familiarity with link:/posts/what-is-machine-learning/[ML Fundamentals Part 1] helps for context but is not required.

The Problem

You can get surprisingly far in machine learning without touching the maths. Scikit-learn’s model.fit(X, y) does not ask you to derive anything. You can train a neural network in ten lines of Keras without knowing what a gradient is.

Until something goes wrong.

Your model does not converge. Your loss function explodes. Your predictions are confidently wrong. You stare at a learning rate and have no intuition for whether 0.001 is too large or too small. You read a paper and the notation looks like a foreign language.

The maths is not academic decoration. It is the why behind every design decision in ML:

Why does gradient descent work? → Calculus
Why can PCA compress 50 features into 5? → Linear algebra
Why does Naive Bayes assume feature independence? → Probability
Why does cross-entropy measure how wrong a prediction is? → Information theory

You do not need a maths degree. You need enough to build intuition, debug problems, and read documentation without guessing.

The Four Pillars

Machine learning stands on four areas of mathematics. Each one appears at specific, predictable points. The mindmap below shows how the series is structured — four pillars branching into their topics, with dashed boxes showing where each connects to ML.

1. Linear Algebra

What it is: The mathematics of vectors, matrices, and linear transformations.

Where it shows up in ML:

ML Concept	Linear Algebra Behind It
A dataset	A matrix — each row is a sample, each column is a feature
A feature vector	A point in n-dimensional space
PCA	Eigenvalue decomposition — find the directions of maximum variance
Neural network layers	Matrix multiplication: output = weights × input + bias
Word embeddings	Vectors in high-dimensional space where direction encodes meaning

ML Concept

Linear Algebra Behind It

A dataset

A matrix — each row is a sample, each column is a feature

A feature vector

A point in n-dimensional space

PCA

Eigenvalue decomposition — find the directions of maximum variance

Neural network layers

Matrix multiplication: output = weights × input + bias

Word embeddings

Vectors in high-dimensional space where direction encodes meaning

When you call model.fit(X, y), X is a matrix. The model multiplies it by weight matrices, adds bias vectors, and applies transformations. Every forward pass through a neural network is a chain of matrix operations.

import numpy as np

# A dataset IS a matrix
X = np.array([
    [45, 62, 120],   # server 1: [cpu, mem, disk_io]
    [92, 88, 450],   # server 2
    [38, 55, 95],    # server 3
])

# A neural network layer IS matrix multiplication
weights = np.array([
    [0.2, -0.1],
    [0.5,  0.3],
    [0.1,  0.4],
])  # shape: (3 features, 2 neurons)

bias = np.array([0.1, -0.2])

output = X @ weights + bias  # matrix multiply + broadcast add
print(output.shape)  # (3 samples, 2 neurons)

2. Calculus

What it is: The mathematics of change — derivatives tell you how fast something is changing and in which direction.

Where it shows up in ML:

ML Concept	Calculus Behind It
Loss function	A surface in parameter space — calculus finds the lowest point
Gradient descent	Follow the negative derivative downhill to minimise loss
Backpropagation	Chain rule — propagate derivatives backwards through layers
Learning rate	Step size along the gradient — too big overshoots, too small stalls
Regularisation	Adding a penalty term changes the shape of the loss surface

ML Concept

Calculus Behind It

Loss function

A surface in parameter space — calculus finds the lowest point

Gradient descent

Follow the negative derivative downhill to minimise loss

Backpropagation

Chain rule — propagate derivatives backwards through layers

Learning rate

Step size along the gradient — too big overshoots, too small stalls

Regularisation

Adding a penalty term changes the shape of the loss surface

Every model learns by minimising a loss function. The derivative tells you which direction is "downhill". Gradient descent takes a step in that direction. Repeat until you reach the bottom.

# Gradient descent in one dimension
# Minimise f(x) = x² (minimum at x=0)

x = 5.0           # start here
learning_rate = 0.1

for step in range(20):
    gradient = 2 * x    # derivative of x² is 2x
    x = x - learning_rate * gradient
    print(f"Step {step:2d}: x = {x:.4f}, f(x) = {x**2:.4f}")

That is the entire optimisation loop for every ML model, from linear regression to GPT. The complexity scales, but the principle does not change.

3. Probability and Statistics

What it is: The mathematics of uncertainty — quantifying what we know, what we don’t, and how confident we should be.

Where it shows up in ML:

ML Concept	Probability Behind It
Classification	Predicting class probabilities, not just labels
Naive Bayes	Bayes' theorem with conditional independence assumption
Logistic regression	Sigmoid function outputs a probability between 0 and 1
Overfitting	The model fits noise (random variation) instead of signal
Cross-validation	Statistical sampling to estimate generalisation performance
Confidence intervals	How much to trust a model’s reported accuracy

ML Concept

Probability Behind It

Classification

Predicting class probabilities, not just labels

Naive Bayes

Bayes' theorem with conditional independence assumption

Logistic regression

Sigmoid function outputs a probability between 0 and 1

Overfitting

The model fits noise (random variation) instead of signal

Cross-validation

Statistical sampling to estimate generalisation performance

Confidence intervals

How much to trust a model’s reported accuracy

When a model says "this server has a 87% chance of failing", that number comes from probability theory. When you evaluate whether that model is reliable, you use statistics.

# Bayes' theorem in action
# P(failure | high_cpu) = P(high_cpu | failure) × P(failure) / P(high_cpu)

p_failure = 0.05                # 5% of servers fail
p_high_cpu_given_failure = 0.90  # 90% of failures had high CPU
p_high_cpu = 0.20               # 20% of all servers have high CPU

p_failure_given_high_cpu = (
    p_high_cpu_given_failure * p_failure / p_high_cpu
)

print(f"P(failure | high CPU) = {p_failure_given_high_cpu:.1%}")
# 22.5% — much higher than the base rate of 5%

4. Information Theory

What it is: The mathematics of information — quantifying surprise, uncertainty, and the difference between distributions.

Where it shows up in ML:

ML Concept	Information Theory Behind It
Decision tree splits	Information gain — choose the feature that reduces entropy the most
Cross-entropy loss	The standard loss function for classification — measures how far predictions are from truth
KL divergence	Measures the difference between two probability distributions (used in VAEs, distillation)
Mutual information	Feature selection — how much does knowing feature X tell you about target Y?

ML Concept

Information Theory Behind It

Decision tree splits

Information gain — choose the feature that reduces entropy the most

Cross-entropy loss

The standard loss function for classification — measures how far predictions are from truth

KL divergence

Measures the difference between two probability distributions (used in VAEs, distillation)

Mutual information

Feature selection — how much does knowing feature X tell you about target Y?

When a decision tree chooses to split on cpu_avg > 80 rather than disk_io > 200, it is because that split produces the largest reduction in entropy — the most information gain. Cross-entropy loss works the same way: it penalises predictions that are confident and wrong.

Reading Mathematical Notation

The biggest barrier is often notation, not concepts. Here is a survival guide:

Symbol	Read As	Meaning
x	"x"	A scalar (single number)
x	"x vector" or "bold x"	A vector (list of numbers)
X	"capital X"	A matrix (grid of numbers)
Σ	"sum of"	Add up a series of terms
∏	"product of"	Multiply a series of terms
∂f/∂x	"partial derivative of f with respect to x"	How f changes when you nudge x, holding everything else fixed
∇f	"nabla f" or "gradient of f"	Vector of all partial derivatives — points uphill
\|\|x\|\|	"norm of x"	Length/magnitude of a vector
x^T	"x transpose"	Flip rows and columns
P(A\|B)	"probability of A given B"	How likely A is, knowing B happened
argmin_x f(x)	"the x that minimises f"	The input that gives the smallest output
∈	"in" or "element of"	x ∈ ℝ means x is a real number
ℝⁿ	"R n"	n-dimensional real number space

Symbol

Read As

Meaning

"x"

A scalar (single number)

"x vector" or "bold x"

A vector (list of numbers)

"capital X"

A matrix (grid of numbers)

"sum of"

Add up a series of terms

∏

"product of"

Multiply a series of terms

∂f/∂x

"partial derivative of f with respect to x"

How f changes when you nudge x, holding everything else fixed

∇f

"nabla f" or "gradient of f"

Vector of all partial derivatives — points uphill

||x||

"norm of x"

Length/magnitude of a vector

x^T

"x transpose"

Flip rows and columns

P(A|B)

"probability of A given B"

How likely A is, knowing B happened

argmin_x f(x)

"the x that minimises f"

The input that gives the smallest output

∈

"in" or "element of"

x ∈ ℝ means x is a real number

ℝⁿ

"R n"

n-dimensional real number space

You do not need to memorise this table. You need to recognise the symbols when you see them. Bookmark this page and come back when you hit unfamiliar notation.

The Roadmap

This series covers each pillar in depth, building from first principles:

Why Maths for Machine Learning? — you are here
Linear Algebra: Vectors and Spaces — vectors, dot products, basis, span, and why features are points in space
Linear Algebra: Matrices and Transformations — matrix operations, eigenvalues, SVD, and how PCA and neural network layers work
Calculus: Derivatives and the Chain Rule — limits, derivatives, partial derivatives, and the engine behind backpropagation
Calculus: Optimisation and Gradient Descent — loss surfaces, learning rates, momentum, and how every model actually learns
Probability: Distributions and Bayes' Theorem — random variables, distributions, conditional probability, and the maths behind Naive Bayes
Statistics: Estimation and Hypothesis Testing — mean, variance, confidence intervals, and knowing when your model results are real
Information Theory: Entropy and Cross-Entropy — entropy, KL divergence, cross-entropy loss, and why decision trees split where they do

Each post will cross-reference the ML Fundamentals series so you can see where the maths connects to the algorithms.

How to Use This Series

If you are following the ML Fundamentals series: read the maths posts as companions. When Part 6 (Neural Networks) references backpropagation, the calculus posts give you the full derivation.

If you are starting here: work through in order. Each post builds on the previous one. The exercises use Python so you can verify the maths computationally — no pen-and-paper proofs required.

If you already know some maths: skip to the parts you are fuzzy on. Each post is self-contained enough to read independently.

Part 2: Linear Algebra — Vectors and Spaces. Starting from what a vector actually is, building up to dot products, basis vectors, and why your dataset is a point cloud in n-dimensional space.

📚 Resources

Videos:

3Blue1Brown — Essence of Linear Algebra (16 videos) — the single best visual introduction to linear algebra. Watch before or alongside Parts 2-3.
3Blue1Brown — Essence of Calculus (12 videos) — derivatives, integrals, and the chain rule visualised beautifully. Pairs with Parts 4-5.
StatQuest with Josh Starmer — short, clear explanations of statistics and ML concepts. Excellent for Parts 6-7.
3Blue1Brown — Bayes theorem — the geometry of changing beliefs. Pairs with Part 6.

Reading:

Mathematics for Machine Learning (Deisenroth, Faisal, Ong) — free PDF textbook covering all four pillars in depth.
Khan Academy — Maths — if any topic in this series moves too fast, Khan Academy has step-by-step lessons from the very basics.

🔬 Try It Yourself

1. Notation practice. Read this expression and describe in plain English what it computes: Σ~~i=1~~^n^ (yi − ŷi)². Hint: it is one of the metrics from link:/posts/data-preprocessing-and-evaluation/[ML Fundamentals Part 2].

2. Identify the maths. Pick any ML algorithm you have used (even model.fit()). Look up its documentation or a tutorial. List which of the four pillars (linear algebra, calculus, probability, information theory) it relies on.

3. Bayes by hand. Your monitoring shows that 8% of deployments cause incidents. Of deployments that caused incidents, 75% were deployed on Fridays. Overall, 20% of deployments happen on Fridays. What is the probability that a Friday deployment will cause an incident?

Comments

Loading comments...

The Problem

The Four Pillars

1. Linear Algebra

2. Calculus

3. Probability and Statistics

4. Information Theory

Reading Mathematical Notation

The Roadmap

How to Use This Series

Next

Comments