Regression and Decision Boundaries
AI Disclosure: This post was written by Claude Opus 4.6. References to “I” refer to the AI author, not the site owner.
AI edit history
| Date | Model | Action |
|---|---|---|
| 2026-03-25 | Claude Opus 4.6 | authored |
- Understand linear regression: the math, the assumptions, and how to interpret coefficients
- Fit curves with polynomial regression and recognise the bias-variance tradeoff
- Apply Ridge (L2) and Lasso (L1) regularisation to prevent overfitting
- Use logistic regression for classification despite its name
- Grasp the Perceptron as the simplest neural unit and understand its limitations
- Visualise and compare decision boundaries across classifiers
Parts 1–4 gave you classification: discrete outputs, categories, yes/no. Now we cross into the other half of supervised learning — regression, where the output is a continuous number. How much CPU will this host use in an hour? When will this disk fill up? How long will this request take?
We will also circle back to classification with logistic regression and the Perceptron — two models that draw a line (literally) between classes. By the end you will be able to visualise exactly where and why a model makes its decisions.
Linear Regression
Linear regression is the most fundamental predictive model. It fits a straight line through your data to predict a continuous target.
The Math: Ordinary Least Squares
The model predicts:
ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
Where β₀ is the intercept (the prediction when all features are zero) and β₁…βₙ are the coefficients — how much the prediction changes per unit change in each feature.
The "least squares" part means we find the coefficients that minimise the sum of squared residuals — the total squared distance between each prediction and the actual value:
minimise Σ(yᵢ - ŷᵢ)²
Squaring the errors penalises large mistakes more heavily than small ones. A prediction that is off by 20% hurts four times as much as one that is off by 10%.
Assumptions
Linear regression assumes:
Linearity — the relationship between features and target is approximately linear
Independence — observations are independent of each other
Homoscedasticity — the variance of errors is roughly constant across all levels of the features
Normality — residuals are approximately normally distributed (matters most for confidence intervals)
In practice, mild violations are tolerable. But if the relationship is clearly curved, a straight line will systematically miss — and no amount of data will fix that.
Implementation: Predicting CPU Usage
Suppose you have historical data: the number of active connections on a web server and the resulting CPU usage.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Simulated: active connections vs CPU usage
np.random.seed(42)
connections = np.random.randint(10, 500, size=100)
cpu_usage = 5 + 0.15 * connections + np.random.normal(0, 5, size=100)
cpu_usage = np.clip(cpu_usage, 0, 100)
X = connections.reshape(-1, 1)
y = cpu_usage
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Intercept (β₀): {model.intercept_:.2f}")
print(f"Coefficient (β₁): {model.coef_[0]:.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
# Visualise
plt.scatter(X_test, y_test, alpha=0.6, label='Actual')
plt.plot(
np.sort(X_test, axis=0),
model.predict(np.sort(X_test, axis=0)),
color='red', linewidth=2, label='Predicted'
)
plt.xlabel('Active Connections')
plt.ylabel('CPU Usage (%)')
plt.title('Linear Regression: Connections → CPU')
plt.legend()
plt.show()Interpreting Coefficients
The output tells you:
Intercept (β₀) ≈ 5.0— with zero connections, baseline CPU is ~5% (OS overhead, background processes)Coefficient (β₁) ≈ 0.15— each additional connection adds ~0.15% CPU usage
This is the power of linear regression for ops: the coefficients are directly interpretable. You can tell your team "each new connection costs 0.15% CPU" and plan capacity accordingly.
Multiple Features
Real predictions use multiple inputs. Here is a model predicting response time from several server metrics:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
np.random.seed(42)
n = 200
data = pd.DataFrame({
'cpu_pct': np.random.uniform(10, 95, n),
'mem_pct': np.random.uniform(20, 90, n),
'active_conns': np.random.randint(5, 300, n),
'disk_io_mbps': np.random.uniform(1, 500, n),
})
# Response time depends on all four features
data['response_ms'] = (
20
+ 0.8 * data['cpu_pct']
+ 0.3 * data['mem_pct']
+ 0.05 * data['active_conns']
+ 0.02 * data['disk_io_mbps']
+ np.random.normal(0, 5, n)
)
X = data.drop(columns=['response_ms'])
y = data['response_ms']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f} ms")
print("\nFeature importance (standardised coefficients):")
for name, coef in sorted(
zip(X.columns, model.coef_), key=lambda x: abs(x[1]), reverse=True
):
print(f" {name:15s} {coef:+.3f}")After scaling, the coefficients are comparable. The largest absolute coefficient tells you which feature has the most impact on response time — typically CPU in this scenario.
Polynomial Regression
Linear regression fails when the relationship curves. Disk usage over time often follows a non-linear trend — slow growth that accelerates as the filesystem fills.
Fitting Curves
Polynomial regression is still a linear model internally — it creates new features like x², x³, etc., and fits a linear model to the expanded feature set.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
# Simulated: disk usage over 30 days (accelerating growth)
np.random.seed(42)
days = np.arange(1, 31).reshape(-1, 1)
disk_pct = 20 + 0.5 * days.ravel() + 0.05 * days.ravel()**2 + np.random.normal(0, 2, 30)
disk_pct = np.clip(disk_pct, 0, 100)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, degree in enumerate([1, 2, 3]):
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(days, disk_pct)
y_pred = model.predict(days)
rmse = np.sqrt(mean_squared_error(disk_pct, y_pred))
r2 = r2_score(disk_pct, y_pred)
axes[i].scatter(days, disk_pct, alpha=0.6)
days_smooth = np.linspace(1, 30, 100).reshape(-1, 1)
axes[i].plot(days_smooth, model.predict(days_smooth), color='red', linewidth=2)
axes[i].set_title(f'Degree {degree} (RMSE={rmse:.1f}, R²={r2:.3f})')
axes[i].set_xlabel('Day')
axes[i].set_ylabel('Disk Usage (%)')
plt.tight_layout()
plt.show()Degree 1 (linear) misses the curve. Degree 2 (quadratic) captures the acceleration. Degree 3 (cubic) fits slightly better on training data but is starting to chase noise.
The Bias-Variance Tradeoff
This is the central tension in machine learning, and polynomial regression makes it visible:
High bias (underfitting) — a degree-1 model cannot represent a curved relationship. It is systematically wrong regardless of how much data you give it.
High variance (overfitting) — a degree-15 model fits the training data perfectly but oscillates wildly between data points. It memorises noise.
The sweet spot is a model complex enough to capture the real pattern but not so complex that it fits the noise. Cross-validation (from Part 2) is how you find it.
from sklearn.model_selection import cross_val_score
for degree in [1, 2, 3, 5, 10]:
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
scores = cross_val_score(model, days, disk_pct, cv=5, scoring='neg_mean_squared_error')
rmse = np.sqrt(-scores.mean())
print(f"Degree {degree:2d}: CV RMSE = {rmse:.2f}")You will see the CV error decrease from degree 1 to 2, then start climbing again at higher degrees — the classic U-shaped curve.
Forecasting Disk Capacity
With a well-fitted polynomial model, you can extrapolate (cautiously) to answer "when does this disk hit 90%?"
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
model.fit(days, disk_pct)
# Forecast out to 60 days
future_days = np.arange(1, 61).reshape(-1, 1)
forecast = model.predict(future_days)
days_to_90 = future_days[forecast >= 90]
if len(days_to_90) > 0:
print(f"Disk hits 90% around day {days_to_90[0][0]}")
else:
print("Disk stays below 90% in the forecast window")
plt.scatter(days, disk_pct, alpha=0.6, label='Observed')
plt.plot(future_days, forecast, color='red', linewidth=2, label='Forecast')
plt.axhline(y=90, color='orange', linestyle='--', label='90% threshold')
plt.xlabel('Day')
plt.ylabel('Disk Usage (%)')
plt.title('Disk Capacity Forecast')
plt.legend()
plt.show()A word of caution: polynomial extrapolation diverges quickly outside the training range. For long-term forecasting, time-series methods (covered later in the series) are more robust.
Regularisation: Ridge and Lasso
When you have many features or use polynomial expansion, overfitting becomes likely. Regularisation adds a penalty for large coefficients, forcing the model to stay simple.
Ridge Regression (L2)
Ridge adds the sum of squared coefficients to the loss:
minimise Σ(yᵢ - ŷᵢ)² + α Σ βⱼ²
The penalty α controls the tradeoff. Higher α means smaller coefficients, simpler model, more bias, less variance. Ridge shrinks coefficients toward zero but never exactly to zero — all features stay in the model.
Lasso Regression (L1)
Lasso adds the sum of absolute coefficients:
minimise Σ(yᵢ - ŷᵢ)² + α Σ |βⱼ|
The key difference: Lasso can shrink coefficients to exactly zero, effectively removing features. This makes Lasso a feature selection tool as well as a regression model.
When to Use Which
| Method | Use When | Behaviour |
|---|---|---|
Ridge (L2) | Many features all contribute somewhat | Shrinks all coefficients; keeps every feature |
Lasso (L1) | Many features but you suspect only a few matter | Sets irrelevant coefficients to exactly zero |
ElasticNet | You want the benefits of both | Combines L1 and L2 penalties |
Implementation: Server Response Time with Regularisation
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
# Generate data with some irrelevant features
np.random.seed(42)
n = 150
data = pd.DataFrame({
'cpu_pct': np.random.uniform(10, 95, n),
'mem_pct': np.random.uniform(20, 90, n),
'active_conns': np.random.randint(5, 300, n),
'disk_io_mbps': np.random.uniform(1, 500, n),
'uptime_days': np.random.randint(1, 365, n), # irrelevant
'hostname_hash': np.random.randint(0, 1000, n), # irrelevant
'random_noise_1': np.random.normal(0, 1, n), # noise
'random_noise_2': np.random.normal(0, 1, n), # noise
})
data['response_ms'] = (
20
+ 0.8 * data['cpu_pct']
+ 0.3 * data['mem_pct']
+ 0.05 * data['active_conns']
+ np.random.normal(0, 5, n)
)
X = data.drop(columns=['response_ms'])
y = data['response_ms']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
models = {
'Linear': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.5),
}
for name, model in models.items():
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
non_zero = np.sum(np.abs(model.coef_) > 0.01)
print(f"{name:10s} RMSE={rmse:.2f} R²={r2:.3f} Non-zero coefficients: {non_zero}")
print("\nLasso coefficients:")
for name, coef in zip(X.columns, models['Lasso'].coef_):
marker = " ← zeroed" if abs(coef) < 0.01 else ""
print(f" {name:18s} {coef:+.4f}{marker}")Lasso correctly identifies uptime_days, hostname_hash, and the noise columns as irrelevant and drives their coefficients to zero. Ridge keeps all coefficients non-zero but makes them small.
Logistic Regression
Despite the name, logistic regression is a classification algorithm. It predicts the probability that an observation belongs to a class.
The Sigmoid Function
Logistic regression wraps a linear model in the sigmoid function (also called the logistic function):
σ(z) = 1 / (1 + e⁻ᶻ)
Where z = β₀ + β₁x₁ + β₂x₂ + … — the same linear combination as before. The sigmoid squashes any real number into the range (0, 1), giving us a probability.
import numpy as np
import matplotlib.pyplot as plt
z = np.linspace(-8, 8, 200)
sigma = 1 / (1 + np.exp(-z))
plt.plot(z, sigma, linewidth=2)
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('z (linear combination)')
plt.ylabel('σ(z) — probability')
plt.title('The Sigmoid Function')
plt.show()When the linear combination is strongly positive, the probability is near 1. When strongly negative, near 0. The decision boundary is where the probability is 0.5 — i.e., where the linear combination equals zero.
Implementation: Healthy vs Unhealthy Servers
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
np.random.seed(42)
n = 300
# Healthy servers: lower CPU, lower error count
cpu_healthy = np.random.normal(40, 15, n // 2)
errors_healthy = np.random.normal(5, 3, n // 2)
# Unhealthy servers: higher CPU, higher error count
cpu_unhealthy = np.random.normal(80, 12, n // 2)
errors_unhealthy = np.random.normal(30, 10, n // 2)
X = np.vstack([
np.column_stack([cpu_healthy, errors_healthy]),
np.column_stack([cpu_unhealthy, errors_unhealthy]),
])
y = np.array([0] * (n // 2) + [1] * (n // 2))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
model = LogisticRegression(random_state=42)
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
y_prob = model.predict_proba(X_test_s)[:, 1]
print(classification_report(y_test, y_pred, target_names=['healthy', 'unhealthy']))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")
print(f"\nCoefficients: cpu={model.coef_[0][0]:.3f}, errors={model.coef_[0][1]:.3f}")
print(f"Intercept: {model.intercept_[0]:.3f}")The coefficients tell you which features push the prediction toward "unhealthy". Positive coefficient means higher values of that feature increase the probability of being unhealthy.
Decision Boundary
Logistic regression draws a linear boundary. On one side: healthy. On the other: unhealthy.
import matplotlib.pyplot as plt
import numpy as np
# Create mesh grid for decision boundary
x_min, x_max = X[:, 0].min() - 5, X[:, 0].max() + 5
y_min, y_max = X[:, 1].min() - 5, X[:, 1].max() + 5
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, 300),
np.linspace(y_min, y_max, 300)
)
grid = np.column_stack([xx.ravel(), yy.ravel()])
grid_scaled = scaler.transform(grid)
Z = model.predict(grid_scaled).reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlGn_r')
plt.scatter(X[y == 0, 0], X[y == 0, 1], alpha=0.5, label='Healthy', c='green', s=20)
plt.scatter(X[y == 1, 0], X[y == 1, 1], alpha=0.5, label='Unhealthy', c='red', s=20)
plt.xlabel('CPU Usage (%)')
plt.ylabel('Error Count (per min)')
plt.title('Logistic Regression: Decision Boundary')
plt.legend()
plt.colorbar(label='Predicted Class')
plt.show()The boundary is a straight line in 2D (a hyperplane in higher dimensions). Everything on the green side is predicted healthy; everything on the red side is predicted unhealthy. This is a linear decision boundary.
The Perceptron
The Perceptron is the simplest possible neural network — a single neuron. Understanding it is the bridge from classical ML to neural networks.
How It Works
A Perceptron computes:
output = 1 if (w₁x₁ + w₂x₂ + ... + wₙxₙ + b) > 0 else 0
That is: take a weighted sum of inputs, add a bias, and apply a step function. If the result is positive, output 1; otherwise, output 0.
The learning algorithm is straightforward:
Initialise weights randomly
For each training sample, compute the prediction
If wrong, nudge the weights:
wᵢ = wᵢ + η(y - ŷ)xᵢwhereηis the learning rateRepeat until convergence
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
# Same healthy/unhealthy server data from above
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
perceptron = Perceptron(max_iter=1000, eta0=0.1, random_state=42)
perceptron.fit(X_train_s, y_train)
y_pred = perceptron.predict(X_test_s)
print(f"Perceptron accuracy: {accuracy_score(y_test, y_pred):.3f}")The Linearity Limitation: The XOR Problem
The Perceptron can only learn linearly separable patterns. If you cannot draw a straight line between the classes, a single Perceptron fails.
The classic example is XOR — where the output is 1 when exactly one input is 1:
import numpy as np
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
# XOR: not linearly separable
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])
perceptron = Perceptron(max_iter=1000, random_state=42)
perceptron.fit(X_xor, y_xor)
print(f"XOR accuracy: {perceptron.score(X_xor, y_xor):.2f}") # ~0.50 — random
# Visualise why it fails
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# XOR — not linearly separable
for cls, marker, colour in [(0, 'o', 'red'), (1, 's', 'green')]:
mask = y_xor == cls
axes[0].scatter(X_xor[mask, 0], X_xor[mask, 1],
marker=marker, c=colour, s=100, label=f'Class {cls}')
axes[0].set_title('XOR — No linear boundary exists')
axes[0].legend()
# AND — linearly separable
y_and = np.array([0, 0, 0, 1])
perceptron_and = Perceptron(max_iter=1000, random_state=42)
perceptron_and.fit(X_xor, y_and)
print(f"AND accuracy: {perceptron_and.score(X_xor, y_and):.2f}") # 1.00
for cls, marker, colour in [(0, 'o', 'red'), (1, 's', 'green')]:
mask = y_and == cls
axes[1].scatter(X_xor[mask, 0], X_xor[mask, 1],
marker=marker, c=colour, s=100, label=f'Class {cls}')
axes[1].set_title('AND — Linearly separable')
axes[1].legend()
plt.tight_layout()
plt.show()This limitation — the inability to solve XOR — was what stalled neural network research in the 1960s. The solution, which we will cover in Part 6, is to stack multiple Perceptrons into layers. A multi-layer network can learn non-linear boundaries.
Visualising Decision Boundaries
Different classifiers draw different decision boundaries. Seeing them side by side builds intuition for which model fits which problem.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate two moons dataset — a non-linearly separable problem
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
classifiers = {
'Logistic Regression': LogisticRegression(random_state=42),
'Perceptron': Perceptron(max_iter=1000, random_state=42),
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
'SVM (RBF)': SVC(kernel='rbf', random_state=42),
'SVM (Linear)': SVC(kernel='linear', random_state=42),
}
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
x_min, x_max = X_train_s[:, 0].min() - 1, X_train_s[:, 0].max() + 1
y_min, y_max = X_train_s[:, 1].min() - 1, X_train_s[:, 1].max() + 1
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200)
)
grid = np.column_stack([xx.ravel(), yy.ravel()])
for ax, (name, clf) in zip(axes.ravel(), classifiers.items()):
clf.fit(X_train_s, y_train)
Z = clf.predict(grid).reshape(xx.shape)
acc = accuracy_score(y_test, clf.predict(X_test_s))
ax.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlGn_r')
ax.scatter(X_train_s[y_train == 0, 0], X_train_s[y_train == 0, 1],
c='green', s=10, alpha=0.5)
ax.scatter(X_train_s[y_train == 1, 0], X_train_s[y_train == 1, 1],
c='red', s=10, alpha=0.5)
ax.set_title(f'{name}\nAccuracy: {acc:.3f}')
plt.tight_layout()
plt.show()What you will see:
Logistic Regression and Perceptron draw straight lines — they cannot separate the moons
Decision Tree creates axis-aligned rectangular regions
KNN produces irregular boundaries that follow the local density of points
SVM (RBF) draws a smooth, non-linear curve that follows the gap between moons
SVM (Linear) draws a straight line, similar to logistic regression
The lesson: no single model is best for all problems. The shape of your data determines which boundary works.
Practical Comparison: When to Use What
| Model | Best For | Limitations | Interpretable? |
|---|---|---|---|
Linear Regression | Continuous targets with roughly linear relationships — CPU forecasting, capacity planning | Cannot capture curves; sensitive to outliers | Yes — coefficients are directly meaningful |
Polynomial Regression | Curved relationships — disk growth, non-linear resource scaling | Overfits easily at high degrees; extrapolation is unreliable | Somewhat — higher-order terms lose intuition |
Ridge (L2) | Many correlated features — server metric bundles | Does not zero out features | Yes — all features retained |
Lasso (L1) | Feature selection + regression — finding which metrics actually matter | Can be unstable with highly correlated features | Yes — irrelevant features zeroed out |
Logistic Regression | Binary/multi-class classification with linear boundaries — server health, alert/no-alert | Cannot learn non-linear boundaries without feature engineering | Yes — coefficients show feature importance |
Perceptron | Simple, fast binary classification — linearly separable problems | Cannot solve non-linearly separable problems (XOR) | Yes — weights are interpretable |
Rules of Thumb
Start with linear or logistic regression. They are fast, interpretable, and often good enough. If they work, you are done.
If the relationship curves, try polynomial features with degree 2 or 3. Use cross-validation to pick the degree.
If you have many features, add regularisation. Use Lasso if you want automatic feature selection; Ridge if all features contribute.
If the decision boundary is non-linear, move to SVM with RBF kernel, Decision Trees, or the ensemble methods coming in later parts.
If you need to explain the model to other engineers or to management, linear/logistic regression wins. The coefficients tell a story.
Videos:
- StatQuest — Linear Regression — fitting a line, R², and residuals explained clearly.
- StatQuest — Logistic Regression — from linear to sigmoid, decision boundaries.
- StatQuest — Ridge vs Lasso Regression — L1/L2 regularisation visualised.
- 3Blue1Brown — But what is a neural network? — the perceptron is the first step toward this.
Reading:
- Scikit-learn — Linear Models — official docs for linear, logistic, Ridge, and Lasso regression.
- Scikit-learn — Ridge coefficients as a function of regularisation — visual example of how Ridge shrinks coefficients.
Companion: link:/posts/why-maths-for-machine-learning/[Maths for ML] — Parts 4-5 cover the calculus behind gradient descent and optimisation that powers these models.
1. Predict response times. Export response time data from your monitoring system along with concurrent connections, CPU, and memory. Fit a linear regression. Which feature has the largest standardised coefficient? Does the R² suggest the model captures the relationship?
2. Forecast a resource. Pick a resource metric that is trending (disk, memory, log volume). Fit polynomial models of degree 1–5 and use cross-validation to find the best degree. Extrapolate: when does it hit a critical threshold?
3. Regularisation showdown. Add 10 random noise columns to your dataset. Train Linear, Ridge, and Lasso models. Does Lasso correctly zero out the noise features? How does Ridge compare?
4. Classify your servers. Label your hosts as healthy/unhealthy based on your own criteria. Train a logistic regression and plot the decision boundary (using 2 features). Where does the boundary fall? Does it match your intuition?
5. The XOR test. Create an XOR-like dataset: servers that are unhealthy when CPU is high AND error count is low, OR when CPU is low AND error count is high. Can logistic regression solve it? What about a Decision Tree?
Next
Part 6: Neural Networks from Scratch — building a multi-layer Perceptron by hand, understanding backpropagation, and seeing how stacking neurons solves the limitations we hit in this post.
Comments
Loading comments...