Regression and Decision Boundaries

🤖

AI Disclosure: This post was written by Claude Opus 4.6. References to “I” refer to the AI author, not the site owner.

AI edit history
DateModelAction
2026-03-25Claude Opus 4.6authored
🎯 What You Will Learn
  • Understand linear regression: the math, the assumptions, and how to interpret coefficients
  • Fit curves with polynomial regression and recognise the bias-variance tradeoff
  • Apply Ridge (L2) and Lasso (L1) regularisation to prevent overfitting
  • Use logistic regression for classification despite its name
  • Grasp the Perceptron as the simplest neural unit and understand its limitations
  • Visualise and compare decision boundaries across classifiers
📋 Prerequisites
link:/posts/what-is-machine-learning/[Part 1: What Is Machine Learning?] — supervised learning, features, and labels. link:/posts/data-preprocessing-and-evaluation/[Part 2: Data Pre-processing and Evaluation] — scaling, splitting, and evaluation metrics. Parts 3–4 — Python ML toolkit setup and classification fundamentals (KNN, Decision Trees, Naive Bayes, SVM).

Parts 1–4 gave you classification: discrete outputs, categories, yes/no. Now we cross into the other half of supervised learning — regression, where the output is a continuous number. How much CPU will this host use in an hour? When will this disk fill up? How long will this request take?

We will also circle back to classification with logistic regression and the Perceptron — two models that draw a line (literally) between classes. By the end you will be able to visualise exactly where and why a model makes its decisions.

Linear Regression

Linear regression is the most fundamental predictive model. It fits a straight line through your data to predict a continuous target.

The Math: Ordinary Least Squares

The model predicts:

ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Where β₀ is the intercept (the prediction when all features are zero) and β₁…​βₙ are the coefficients — how much the prediction changes per unit change in each feature.

The "least squares" part means we find the coefficients that minimise the sum of squared residuals — the total squared distance between each prediction and the actual value:

minimise Σ(yᵢ - ŷᵢ)²

Squaring the errors penalises large mistakes more heavily than small ones. A prediction that is off by 20% hurts four times as much as one that is off by 10%.

Assumptions

Linear regression assumes:

  • Linearity — the relationship between features and target is approximately linear

  • Independence — observations are independent of each other

  • Homoscedasticity — the variance of errors is roughly constant across all levels of the features

  • Normality — residuals are approximately normally distributed (matters most for confidence intervals)

In practice, mild violations are tolerable. But if the relationship is clearly curved, a straight line will systematically miss — and no amount of data will fix that.

Implementation: Predicting CPU Usage

Suppose you have historical data: the number of active connections on a web server and the resulting CPU usage.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Simulated: active connections vs CPU usage
np.random.seed(42)
connections = np.random.randint(10, 500, size=100)
cpu_usage = 5 + 0.15 * connections + np.random.normal(0, 5, size=100)
cpu_usage = np.clip(cpu_usage, 0, 100)

X = connections.reshape(-1, 1)
y = cpu_usage

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"Intercept (β₀): {model.intercept_:.2f}")
print(f"Coefficient (β₁): {model.coef_[0]:.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"R²:   {r2_score(y_test, y_pred):.3f}")

# Visualise
plt.scatter(X_test, y_test, alpha=0.6, label='Actual')
plt.plot(
    np.sort(X_test, axis=0),
    model.predict(np.sort(X_test, axis=0)),
    color='red', linewidth=2, label='Predicted'
)
plt.xlabel('Active Connections')
plt.ylabel('CPU Usage (%)')
plt.title('Linear Regression: Connections → CPU')
plt.legend()
plt.show()

Interpreting Coefficients

The output tells you:

  • Intercept (β₀) ≈ 5.0 — with zero connections, baseline CPU is ~5% (OS overhead, background processes)

  • Coefficient (β₁) ≈ 0.15 — each additional connection adds ~0.15% CPU usage

This is the power of linear regression for ops: the coefficients are directly interpretable. You can tell your team "each new connection costs 0.15% CPU" and plan capacity accordingly.

Multiple Features

Real predictions use multiple inputs. Here is a model predicting response time from several server metrics:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

np.random.seed(42)
n = 200

data = pd.DataFrame({
    'cpu_pct': np.random.uniform(10, 95, n),
    'mem_pct': np.random.uniform(20, 90, n),
    'active_conns': np.random.randint(5, 300, n),
    'disk_io_mbps': np.random.uniform(1, 500, n),
})

# Response time depends on all four features
data['response_ms'] = (
    20
    + 0.8 * data['cpu_pct']
    + 0.3 * data['mem_pct']
    + 0.05 * data['active_conns']
    + 0.02 * data['disk_io_mbps']
    + np.random.normal(0, 5, n)
)

X = data.drop(columns=['response_ms'])
y = data['response_ms']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

print(f"R²:   {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f} ms")
print("\nFeature importance (standardised coefficients):")
for name, coef in sorted(
    zip(X.columns, model.coef_), key=lambda x: abs(x[1]), reverse=True
):
    print(f"  {name:15s} {coef:+.3f}")

After scaling, the coefficients are comparable. The largest absolute coefficient tells you which feature has the most impact on response time — typically CPU in this scenario.

Polynomial Regression

Linear regression fails when the relationship curves. Disk usage over time often follows a non-linear trend — slow growth that accelerates as the filesystem fills.

Fitting Curves

Polynomial regression is still a linear model internally — it creates new features like x², x³, etc., and fits a linear model to the expanded feature set.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

# Simulated: disk usage over 30 days (accelerating growth)
np.random.seed(42)
days = np.arange(1, 31).reshape(-1, 1)
disk_pct = 20 + 0.5 * days.ravel() + 0.05 * days.ravel()**2 + np.random.normal(0, 2, 30)
disk_pct = np.clip(disk_pct, 0, 100)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, degree in enumerate([1, 2, 3]):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(days, disk_pct)
    y_pred = model.predict(days)

    rmse = np.sqrt(mean_squared_error(disk_pct, y_pred))
    r2 = r2_score(disk_pct, y_pred)

    axes[i].scatter(days, disk_pct, alpha=0.6)
    days_smooth = np.linspace(1, 30, 100).reshape(-1, 1)
    axes[i].plot(days_smooth, model.predict(days_smooth), color='red', linewidth=2)
    axes[i].set_title(f'Degree {degree} (RMSE={rmse:.1f}, R²={r2:.3f})')
    axes[i].set_xlabel('Day')
    axes[i].set_ylabel('Disk Usage (%)')

plt.tight_layout()
plt.show()

Degree 1 (linear) misses the curve. Degree 2 (quadratic) captures the acceleration. Degree 3 (cubic) fits slightly better on training data but is starting to chase noise.

The Bias-Variance Tradeoff

This is the central tension in machine learning, and polynomial regression makes it visible:

  • High bias (underfitting) — a degree-1 model cannot represent a curved relationship. It is systematically wrong regardless of how much data you give it.

  • High variance (overfitting) — a degree-15 model fits the training data perfectly but oscillates wildly between data points. It memorises noise.

The sweet spot is a model complex enough to capture the real pattern but not so complex that it fits the noise. Cross-validation (from Part 2) is how you find it.

from sklearn.model_selection import cross_val_score

for degree in [1, 2, 3, 5, 10]:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    scores = cross_val_score(model, days, disk_pct, cv=5, scoring='neg_mean_squared_error')
    rmse = np.sqrt(-scores.mean())
    print(f"Degree {degree:2d}: CV RMSE = {rmse:.2f}")

You will see the CV error decrease from degree 1 to 2, then start climbing again at higher degrees — the classic U-shaped curve.

Forecasting Disk Capacity

With a well-fitted polynomial model, you can extrapolate (cautiously) to answer "when does this disk hit 90%?"

model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(days, disk_pct)

# Forecast out to 60 days
future_days = np.arange(1, 61).reshape(-1, 1)
forecast = model.predict(future_days)

days_to_90 = future_days[forecast >= 90]
if len(days_to_90) > 0:
    print(f"Disk hits 90% around day {days_to_90[0][0]}")
else:
    print("Disk stays below 90% in the forecast window")

plt.scatter(days, disk_pct, alpha=0.6, label='Observed')
plt.plot(future_days, forecast, color='red', linewidth=2, label='Forecast')
plt.axhline(y=90, color='orange', linestyle='--', label='90% threshold')
plt.xlabel('Day')
plt.ylabel('Disk Usage (%)')
plt.title('Disk Capacity Forecast')
plt.legend()
plt.show()

A word of caution: polynomial extrapolation diverges quickly outside the training range. For long-term forecasting, time-series methods (covered later in the series) are more robust.

Regularisation: Ridge and Lasso

When you have many features or use polynomial expansion, overfitting becomes likely. Regularisation adds a penalty for large coefficients, forcing the model to stay simple.

Ridge Regression (L2)

Ridge adds the sum of squared coefficients to the loss:

minimise Σ(yᵢ - ŷᵢ)² + α Σ βⱼ²

The penalty α controls the tradeoff. Higher α means smaller coefficients, simpler model, more bias, less variance. Ridge shrinks coefficients toward zero but never exactly to zero — all features stay in the model.

Lasso Regression (L1)

Lasso adds the sum of absolute coefficients:

minimise Σ(yᵢ - ŷᵢ)² + α Σ |βⱼ|

The key difference: Lasso can shrink coefficients to exactly zero, effectively removing features. This makes Lasso a feature selection tool as well as a regression model.

When to Use Which

MethodUse WhenBehaviour

Ridge (L2)

Many features all contribute somewhat

Shrinks all coefficients; keeps every feature

Lasso (L1)

Many features but you suspect only a few matter

Sets irrelevant coefficients to exactly zero

ElasticNet

You want the benefits of both

Combines L1 and L2 penalties

Implementation: Server Response Time with Regularisation

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

# Generate data with some irrelevant features
np.random.seed(42)
n = 150

data = pd.DataFrame({
    'cpu_pct': np.random.uniform(10, 95, n),
    'mem_pct': np.random.uniform(20, 90, n),
    'active_conns': np.random.randint(5, 300, n),
    'disk_io_mbps': np.random.uniform(1, 500, n),
    'uptime_days': np.random.randint(1, 365, n),        # irrelevant
    'hostname_hash': np.random.randint(0, 1000, n),     # irrelevant
    'random_noise_1': np.random.normal(0, 1, n),        # noise
    'random_noise_2': np.random.normal(0, 1, n),        # noise
})

data['response_ms'] = (
    20
    + 0.8 * data['cpu_pct']
    + 0.3 * data['mem_pct']
    + 0.05 * data['active_conns']
    + np.random.normal(0, 5, n)
)

X = data.drop(columns=['response_ms'])
y = data['response_ms']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

models = {
    'Linear':     LinearRegression(),
    'Ridge':      Ridge(alpha=1.0),
    'Lasso':      Lasso(alpha=0.5),
}

for name, model in models.items():
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    non_zero = np.sum(np.abs(model.coef_) > 0.01)
    print(f"{name:10s}  RMSE={rmse:.2f}  R²={r2:.3f}  Non-zero coefficients: {non_zero}")

print("\nLasso coefficients:")
for name, coef in zip(X.columns, models['Lasso'].coef_):
    marker = " ← zeroed" if abs(coef) < 0.01 else ""
    print(f"  {name:18s} {coef:+.4f}{marker}")

Lasso correctly identifies uptime_days, hostname_hash, and the noise columns as irrelevant and drives their coefficients to zero. Ridge keeps all coefficients non-zero but makes them small.

Logistic Regression

Despite the name, logistic regression is a classification algorithm. It predicts the probability that an observation belongs to a class.

The Sigmoid Function

Logistic regression wraps a linear model in the sigmoid function (also called the logistic function):

σ(z) = 1 / (1 + e⁻ᶻ)

Where z = β₀ + β₁x₁ + β₂x₂ + …​ — the same linear combination as before. The sigmoid squashes any real number into the range (0, 1), giving us a probability.

import numpy as np
import matplotlib.pyplot as plt

z = np.linspace(-8, 8, 200)
sigma = 1 / (1 + np.exp(-z))

plt.plot(z, sigma, linewidth=2)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('z (linear combination)')
plt.ylabel('σ(z) — probability')
plt.title('The Sigmoid Function')
plt.show()

When the linear combination is strongly positive, the probability is near 1. When strongly negative, near 0. The decision boundary is where the probability is 0.5 — i.e., where the linear combination equals zero.

Implementation: Healthy vs Unhealthy Servers

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score

np.random.seed(42)
n = 300

# Healthy servers: lower CPU, lower error count
cpu_healthy = np.random.normal(40, 15, n // 2)
errors_healthy = np.random.normal(5, 3, n // 2)

# Unhealthy servers: higher CPU, higher error count
cpu_unhealthy = np.random.normal(80, 12, n // 2)
errors_unhealthy = np.random.normal(30, 10, n // 2)

X = np.vstack([
    np.column_stack([cpu_healthy, errors_healthy]),
    np.column_stack([cpu_unhealthy, errors_unhealthy]),
])
y = np.array([0] * (n // 2) + [1] * (n // 2))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

model = LogisticRegression(random_state=42)
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)
y_prob = model.predict_proba(X_test_s)[:, 1]

print(classification_report(y_test, y_pred, target_names=['healthy', 'unhealthy']))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")

print(f"\nCoefficients: cpu={model.coef_[0][0]:.3f}, errors={model.coef_[0][1]:.3f}")
print(f"Intercept: {model.intercept_[0]:.3f}")

The coefficients tell you which features push the prediction toward "unhealthy". Positive coefficient means higher values of that feature increase the probability of being unhealthy.

Decision Boundary

Logistic regression draws a linear boundary. On one side: healthy. On the other: unhealthy.

import matplotlib.pyplot as plt
import numpy as np

# Create mesh grid for decision boundary
x_min, x_max = X[:, 0].min() - 5, X[:, 0].max() + 5
y_min, y_max = X[:, 1].min() - 5, X[:, 1].max() + 5
xx, yy = np.meshgrid(
    np.linspace(x_min, x_max, 300),
    np.linspace(y_min, y_max, 300)
)

grid = np.column_stack([xx.ravel(), yy.ravel()])
grid_scaled = scaler.transform(grid)
Z = model.predict(grid_scaled).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlGn_r')
plt.scatter(X[y == 0, 0], X[y == 0, 1], alpha=0.5, label='Healthy', c='green', s=20)
plt.scatter(X[y == 1, 0], X[y == 1, 1], alpha=0.5, label='Unhealthy', c='red', s=20)
plt.xlabel('CPU Usage (%)')
plt.ylabel('Error Count (per min)')
plt.title('Logistic Regression: Decision Boundary')
plt.legend()
plt.colorbar(label='Predicted Class')
plt.show()

The boundary is a straight line in 2D (a hyperplane in higher dimensions). Everything on the green side is predicted healthy; everything on the red side is predicted unhealthy. This is a linear decision boundary.

The Perceptron

The Perceptron is the simplest possible neural network — a single neuron. Understanding it is the bridge from classical ML to neural networks.

How It Works

A Perceptron computes:

output = 1 if (w₁x₁ + w₂x₂ + ... + wₙxₙ + b) > 0 else 0

That is: take a weighted sum of inputs, add a bias, and apply a step function. If the result is positive, output 1; otherwise, output 0.

The learning algorithm is straightforward:

  1. Initialise weights randomly

  2. For each training sample, compute the prediction

  3. If wrong, nudge the weights: wᵢ = wᵢ + η(y - ŷ)xᵢ where η is the learning rate

  4. Repeat until convergence

from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Same healthy/unhealthy server data from above
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

perceptron = Perceptron(max_iter=1000, eta0=0.1, random_state=42)
perceptron.fit(X_train_s, y_train)

y_pred = perceptron.predict(X_test_s)
print(f"Perceptron accuracy: {accuracy_score(y_test, y_pred):.3f}")

The Linearity Limitation: The XOR Problem

The Perceptron can only learn linearly separable patterns. If you cannot draw a straight line between the classes, a single Perceptron fails.

The classic example is XOR — where the output is 1 when exactly one input is 1:

import numpy as np
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt

# XOR: not linearly separable
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

perceptron = Perceptron(max_iter=1000, random_state=42)
perceptron.fit(X_xor, y_xor)
print(f"XOR accuracy: {perceptron.score(X_xor, y_xor):.2f}")  # ~0.50 — random

# Visualise why it fails
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# XOR — not linearly separable
for cls, marker, colour in [(0, 'o', 'red'), (1, 's', 'green')]:
    mask = y_xor == cls
    axes[0].scatter(X_xor[mask, 0], X_xor[mask, 1],
                    marker=marker, c=colour, s=100, label=f'Class {cls}')
axes[0].set_title('XOR — No linear boundary exists')
axes[0].legend()

# AND — linearly separable
y_and = np.array([0, 0, 0, 1])
perceptron_and = Perceptron(max_iter=1000, random_state=42)
perceptron_and.fit(X_xor, y_and)
print(f"AND accuracy: {perceptron_and.score(X_xor, y_and):.2f}")  # 1.00

for cls, marker, colour in [(0, 'o', 'red'), (1, 's', 'green')]:
    mask = y_and == cls
    axes[1].scatter(X_xor[mask, 0], X_xor[mask, 1],
                    marker=marker, c=colour, s=100, label=f'Class {cls}')
axes[1].set_title('AND — Linearly separable')
axes[1].legend()

plt.tight_layout()
plt.show()

This limitation — the inability to solve XOR — was what stalled neural network research in the 1960s. The solution, which we will cover in Part 6, is to stack multiple Perceptrons into layers. A multi-layer network can learn non-linear boundaries.

Visualising Decision Boundaries

Different classifiers draw different decision boundaries. Seeing them side by side builds intuition for which model fits which problem.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate two moons dataset — a non-linearly separable problem
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

classifiers = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Perceptron': Perceptron(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'SVM (Linear)': SVC(kernel='linear', random_state=42),
}

fig, axes = plt.subplots(2, 3, figsize=(16, 10))

x_min, x_max = X_train_s[:, 0].min() - 1, X_train_s[:, 0].max() + 1
y_min, y_max = X_train_s[:, 1].min() - 1, X_train_s[:, 1].max() + 1
xx, yy = np.meshgrid(
    np.linspace(x_min, x_max, 200),
    np.linspace(y_min, y_max, 200)
)
grid = np.column_stack([xx.ravel(), yy.ravel()])

for ax, (name, clf) in zip(axes.ravel(), classifiers.items()):
    clf.fit(X_train_s, y_train)
    Z = clf.predict(grid).reshape(xx.shape)
    acc = accuracy_score(y_test, clf.predict(X_test_s))

    ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlGn_r')
    ax.scatter(X_train_s[y_train == 0, 0], X_train_s[y_train == 0, 1],
               c='green', s=10, alpha=0.5)
    ax.scatter(X_train_s[y_train == 1, 0], X_train_s[y_train == 1, 1],
               c='red', s=10, alpha=0.5)
    ax.set_title(f'{name}\nAccuracy: {acc:.3f}')

plt.tight_layout()
plt.show()

What you will see:

  • Logistic Regression and Perceptron draw straight lines — they cannot separate the moons

  • Decision Tree creates axis-aligned rectangular regions

  • KNN produces irregular boundaries that follow the local density of points

  • SVM (RBF) draws a smooth, non-linear curve that follows the gap between moons

  • SVM (Linear) draws a straight line, similar to logistic regression

The lesson: no single model is best for all problems. The shape of your data determines which boundary works.

Practical Comparison: When to Use What

ModelBest ForLimitationsInterpretable?

Linear Regression

Continuous targets with roughly linear relationships — CPU forecasting, capacity planning

Cannot capture curves; sensitive to outliers

Yes — coefficients are directly meaningful

Polynomial Regression

Curved relationships — disk growth, non-linear resource scaling

Overfits easily at high degrees; extrapolation is unreliable

Somewhat — higher-order terms lose intuition

Ridge (L2)

Many correlated features — server metric bundles

Does not zero out features

Yes — all features retained

Lasso (L1)

Feature selection + regression — finding which metrics actually matter

Can be unstable with highly correlated features

Yes — irrelevant features zeroed out

Logistic Regression

Binary/multi-class classification with linear boundaries — server health, alert/no-alert

Cannot learn non-linear boundaries without feature engineering

Yes — coefficients show feature importance

Perceptron

Simple, fast binary classification — linearly separable problems

Cannot solve non-linearly separable problems (XOR)

Yes — weights are interpretable

Rules of Thumb

  1. Start with linear or logistic regression. They are fast, interpretable, and often good enough. If they work, you are done.

  2. If the relationship curves, try polynomial features with degree 2 or 3. Use cross-validation to pick the degree.

  3. If you have many features, add regularisation. Use Lasso if you want automatic feature selection; Ridge if all features contribute.

  4. If the decision boundary is non-linear, move to SVM with RBF kernel, Decision Trees, or the ensemble methods coming in later parts.

  5. If you need to explain the model to other engineers or to management, linear/logistic regression wins. The coefficients tell a story.

📚 Resources

Videos:

  • StatQuest — Linear Regression — fitting a line, R², and residuals explained clearly.
  • StatQuest — Logistic Regression — from linear to sigmoid, decision boundaries.
  • StatQuest — Ridge vs Lasso Regression — L1/L2 regularisation visualised.
  • 3Blue1Brown — But what is a neural network? — the perceptron is the first step toward this.

Reading:

  • Scikit-learn — Linear Models — official docs for linear, logistic, Ridge, and Lasso regression.
  • Scikit-learn — Ridge coefficients as a function of regularisation — visual example of how Ridge shrinks coefficients.

Companion: link:/posts/why-maths-for-machine-learning/[Maths for ML] — Parts 4-5 cover the calculus behind gradient descent and optimisation that powers these models.

🔬 Try It Yourself

1. Predict response times. Export response time data from your monitoring system along with concurrent connections, CPU, and memory. Fit a linear regression. Which feature has the largest standardised coefficient? Does the R² suggest the model captures the relationship?

2. Forecast a resource. Pick a resource metric that is trending (disk, memory, log volume). Fit polynomial models of degree 1–5 and use cross-validation to find the best degree. Extrapolate: when does it hit a critical threshold?

3. Regularisation showdown. Add 10 random noise columns to your dataset. Train Linear, Ridge, and Lasso models. Does Lasso correctly zero out the noise features? How does Ridge compare?

4. Classify your servers. Label your hosts as healthy/unhealthy based on your own criteria. Train a logistic regression and plot the decision boundary (using 2 features). Where does the boundary fall? Does it match your intuition?

5. The XOR test. Create an XOR-like dataset: servers that are unhealthy when CPU is high AND error count is low, OR when CPU is low AND error count is high. Can logistic regression solve it? What about a Decision Tree?

Next

Part 6: Neural Networks from Scratch — building a multi-layer Perceptron by hand, understanding backpropagation, and seeing how stacking neurons solves the limitations we hit in this post.

Comments

Loading comments...