Data Pre-processing and Evaluation

Mar 24, 2026 · 10 min read AI Augmented

🎯 What You Will Learn

Clean and transform raw data into ML-ready features
Understand feature selection and dimensionality reduction (PCA)
Split data correctly into training, validation, and test sets
Evaluate models using accuracy, precision, recall, F1, ROC curves, and RMSE
Recognise and prevent overfitting

📋 Prerequisites

link:/posts/what-is-machine-learning/[Part 1: What Is Machine Learning?] — understanding of supervised vs unsupervised learning, features, and labels.

Why Pre-processing Matters

Raw data is never ready for machine learning. Logs have missing fields. Metrics arrive at irregular intervals. Categorical values like hostnames need encoding. Scales differ wildly — CPU percentage ranges 0–100 while network throughput might be 0–10,000 Mbps.

If you feed raw data into a model, the model will either fail or learn the wrong things. Pre-processing is the bridge between your data as it exists and your data as the model needs it.

In practice, data preparation takes more time than model selection. Getting this right is what separates a model that works from one that looks good in a notebook but falls over in production.

The Pre-processing Pipeline

Every ML project follows roughly the same pipeline:

Raw Data → Clean → Transform → Select Features → Split → Train → Evaluate

Let’s walk through each stage.

1. Handling Missing Data

Real-world data has gaps. A monitoring agent drops offline, a log field is optional, a sensor returns null. You have three options:

import pandas as pd
import numpy as np

# Simulated server metrics with gaps
df = pd.DataFrame({
    'cpu_avg': [45, 92, np.nan, 87, 38, 91],
    'mem_pct': [62, 88, 55, np.nan, 55, 80],
    'disk_io': [120, 450, 95, 520, np.nan, 480],
    'error_count': [3, 47, 1, 62, 1, 55],
    'failed': [0, 1, 0, 1, 0, 1]
})

print("Missing values per column:")
print(df.isnull().sum())

Option A: Drop rows — simplest, but you lose data.

df_dropped = df.dropna()
print(f"Rows: {len(df)} → {len(df_dropped)}")

Option B: Fill with a statistic — mean, median, or mode. Preserves row count.

df_filled = df.fillna(df.median(numeric_only=True))

Option C: Forward/backward fill — useful for time-series where the last known value is a reasonable estimate.

df_ffill = df.ffill()

Which to use depends on context. For time-series metrics, forward fill usually makes sense. For independent samples, median imputation is safer than mean (outliers skew the mean).

2. Feature Encoding

ML models work with numbers. Categorical values need encoding.

Label encoding — assigns integers. Use when there is a natural order.

from sklearn.preprocessing import LabelEncoder

severities = ['info', 'warning', 'error', 'critical']
le = LabelEncoder()
encoded = le.fit_transform(severities)
print(dict(zip(severities, encoded)))
# {'critical': 0, 'error': 1, 'info': 2, 'warning': 3}

One-hot encoding — creates binary columns. Use when there is no natural order.

import pandas as pd

hosts = pd.DataFrame({'host': ['griffin', 'wolfhound', 'blackjack', 'griffin']})
encoded = pd.get_dummies(hosts, columns=['host'], dtype=int)
print(encoded)
#    host_blackjack  host_griffin  host_wolfhound
# 0               0            1               0
# 1               0            0               1
# 2               1            0               0
# 3               0            1               0

The trap with label encoding is that the model may interpret the integers as having magnitude — it might think critical (0) < info (2), which is meaningless. One-hot avoids this but adds columns.

3. Feature Scaling

When features have different scales, algorithms that rely on distance (KNN, SVM, neural networks) will be dominated by the largest-scale feature.

Standardisation (z-score) — centres data around 0 with unit variance. Good default.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df_filled.drop(columns=['failed'])),
    columns=df_filled.drop(columns=['failed']).columns
)
print(df_scaled.describe().round(2))

Min-max normalisation — scales to [0, 1]. Use when you need bounded values.

from sklearn.preprocessing import MinMaxScaler

minmax = MinMaxScaler()
df_normed = pd.DataFrame(
    minmax.fit_transform(df_filled.drop(columns=['failed'])),
    columns=df_filled.drop(columns=['failed']).columns
)

Rule of thumb: standardise unless you have a reason not to. Tree-based models (Decision Trees, Random Forest, XGBoost) are scale-invariant and do not need scaling.

Feature Selection

Not all features help. Some are redundant, some are noise, and some actively hurt performance by introducing irrelevant patterns the model tries to learn.

Correlation Analysis

Start by checking which features correlate with your target.

# Correlation with the target variable
correlations = df_filled.corr(numeric_only=True)['failed'].drop('failed')
print(correlations.sort_values(ascending=False))

Features with near-zero correlation to the target are candidates for removal. Features that are highly correlated with each other (multicollinearity) are redundant — keep one.

Dimensionality Reduction with PCA

Principal Component Analysis compresses many features into fewer dimensions while preserving as much variance as possible. Useful when you have dozens or hundreds of metrics.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce 4 features to 2 principal components
pca = PCA(n_components=2)
X = df_filled.drop(columns=['failed'])
X_pca = pca.fit_transform(StandardScaler().fit_transform(X))

print(f"Explained variance: {pca.explained_variance_ratio_.round(3)}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.1%}")

# Visualise
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df_filled['failed'], cmap='RdYlGn_r')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Servers in PCA space')
plt.colorbar(label='Failed')
plt.show()

PCA is a black box — the resulting components are linear combinations of the original features and lose interpretability. Use it when you need to reduce dimensionality, not when you need to explain which features matter.

Splitting Your Data

The cardinal rule: never evaluate a model on data it trained on.

Train / Test Split

The simplest approach. Hold back a portion of data for testing.

from sklearn.model_selection import train_test_split

X = df_filled.drop(columns=['failed'])
y = df_filled['failed']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")

stratify=y ensures the class balance in the split matches the original. If 30% of your servers failed, both train and test will have ~30% failures.

Train / Validation / Test Split

For tuning hyperparameters, you need three sets:

Training set — the model learns from this
Validation set — used to tune hyperparameters and make design decisions
Test set — touched only once, at the very end, to report final performance

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Result: 60% train, 20% validation, 20% test

Cross-Validation

When data is limited, holding out 40% is expensive. K-fold cross-validation uses all the data for both training and testing by rotating the held-out fold.

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

print(f"Fold scores: {scores.round(3)}")
print(f"Mean accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

With 5-fold CV, each sample appears in the test set exactly once. The mean score is a more reliable estimate than a single train/test split.

Evaluating Your Model

A model that is 95% accurate sounds great — until you realise 95% of your servers are healthy and the model just predicts "healthy" every time. Evaluation metrics need to match your problem.

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report, confusion_matrix
)
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, target_names=['healthy', 'failed']))

Metric	What It Measures	When It Matters
Accuracy	Correct predictions / total predictions	Balanced classes only
Precision	Of predicted positives, how many were correct?	When false positives are costly (unnecessary alerts)
Recall	Of actual positives, how many did we catch?	When false negatives are costly (missed failures)
F1 Score	Harmonic mean of precision and recall	When you need to balance both

Metric

What It Measures

When It Matters

Accuracy

Correct predictions / total predictions

Balanced classes only

Precision

Of predicted positives, how many were correct?

When false positives are costly (unnecessary alerts)

Recall

Of actual positives, how many did we catch?

When false negatives are costly (missed failures)

F1 Score

Harmonic mean of precision and recall

When you need to balance both

For infrastructure monitoring, recall usually matters more — missing a server failure (false negative) is worse than a false alarm (false positive).

The Confusion Matrix

import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',
            xticklabels=['healthy', 'failed'],
            yticklabels=['healthy', 'failed'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

The confusion matrix shows you exactly where the model is wrong. The off-diagonal cells are your errors: top-right is false positives, bottom-left is false negatives.

ROC Curves

The Receiver Operating Characteristic curve plots true positive rate against false positive rate at every classification threshold.

from sklearn.metrics import roc_curve, roc_auc_score

y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

AUC (Area Under the Curve) gives a single number: 1.0 is perfect, 0.5 is random guessing. Anything above 0.8 is generally useful.

Regression Metrics

For continuous predictions (CPU forecasting, capacity planning):

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Example: predicting CPU usage
y_actual = np.array([45, 82, 38, 91, 67])
y_predicted = np.array([48, 79, 42, 88, 70])

rmse = np.sqrt(mean_squared_error(y_actual, y_predicted))
mae = mean_absolute_error(y_actual, y_predicted)
r2 = r2_score(y_actual, y_predicted)

print(f"RMSE: {rmse:.2f}")   # Root Mean Squared Error
print(f"MAE:  {mae:.2f}")    # Mean Absolute Error
print(f"R²:   {r2:.3f}")     # Coefficient of determination

Metric	Interpretation
RMSE	Average prediction error in the same units as the target. Lower is better. Penalises large errors.
MAE	Average absolute error. Less sensitive to outliers than RMSE.
R²	Proportion of variance explained. 1.0 is perfect, 0 means the model is no better than predicting the mean.

Metric

Interpretation

RMSE

Average prediction error in the same units as the target. Lower is better. Penalises large errors.

MAE

Average absolute error. Less sensitive to outliers than RMSE.

R²

Proportion of variance explained. 1.0 is perfect, 0 means the model is no better than predicting the mean.

Overfitting and Underfitting

The most common failure mode in ML is overfitting — the model memorises the training data instead of learning generalisable patterns.

How to Detect It

from sklearn.tree import DecisionTreeClassifier

# Unrestricted tree — will overfit
model_overfit = DecisionTreeClassifier(random_state=42)
model_overfit.fit(X_train, y_train)

train_acc = model_overfit.score(X_train, y_train)
test_acc = model_overfit.score(X_test, y_test)

print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy:  {test_acc:.3f}")

If training accuracy is significantly higher than test accuracy, the model is overfitting. A model that scores 0.99 on training data and 0.72 on test data has memorised the noise.

How to Prevent It

Technique	How It Works
More data	The single most effective remedy. More examples make it harder to memorise.
Simpler model	Reduce tree depth, fewer neurons, fewer features. Less capacity to memorise.
Cross-validation	Evaluate on multiple folds to get a realistic performance estimate.
Regularisation	Add a penalty for model complexity (L1/L2 regularisation, dropout in neural networks).
Early stopping	Stop training when validation performance stops improving.

Technique

How It Works

More data

The single most effective remedy. More examples make it harder to memorise.

Simpler model

Reduce tree depth, fewer neurons, fewer features. Less capacity to memorise.

Cross-validation

Evaluate on multiple folds to get a realistic performance estimate.

Regularisation

Add a penalty for model complexity (L1/L2 regularisation, dropout in neural networks).

Early stopping

Stop training when validation performance stops improving.

Underfitting is the opposite — the model is too simple to capture the patterns. The fix is a more complex model or better features. You know you are underfitting when both training and test performance are poor.

Putting It All Together

Here is a complete pipeline — from raw data to evaluated model:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# 1. Load and inspect
df = pd.read_csv('server_metrics.csv')
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")

# 2. Handle missing data
df = df.fillna(df.median(numeric_only=True))

# 3. Encode categoricals
df = pd.get_dummies(df, columns=['host', 'region'], dtype=int)

# 4. Split features and target
X = df.drop(columns=['failed'])
y = df['failed']

# 5. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 6. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use train stats!

# 7. Train
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train_scaled, y_train)

# 8. Evaluate
print(classification_report(y_test, model.predict(X_test_scaled)))

# 9. Cross-validate
scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Note line 6: scaler.transform(X_test) — not fit_transform. The scaler must be fitted on training data only. Fitting on the test set leaks information from the future.

Part 3: Python ML Toolkit — setting up your environment with Pandas, NumPy, Scikit-learn, and TensorFlow, with practical patterns for ML workflows.

📚 Resources

Videos:

StatQuest — Cross Validation — clear visual explanation of k-fold CV and why it matters.
StatQuest — ROC and AUC — how ROC curves work and what AUC actually measures.
StatQuest — Confusion Matrix — precision, recall, F1 demystified.
3Blue1Brown — Essence of Linear Algebra Ch.1 — visual intuition for vectors and spaces, useful for understanding PCA.

Reading:

Scikit-learn — Preprocessing data — official docs for scaling, encoding, and imputation.
Scikit-learn — Model evaluation — complete reference for all metrics covered in this post.

Companion: link:/posts/why-maths-for-machine-learning/[Maths for ML Part 1] covers why evaluation metrics work the way they do.

🔬 Try It Yourself

1. Clean a real dataset. Export a day’s worth of metrics from your monitoring system (Netdata, Prometheus, or even /proc stats). How many missing values are there? Which imputation method makes the most sense?

2. Feature correlation. Using the same dataset, compute correlations between your metrics. Which features are redundant? Which correlate most strongly with a metric you care about (e.g., response time)?

3. Overfit on purpose. Train a DecisionTreeClassifier with no depth limit on a small dataset (<100 rows). Compare train vs test accuracy. Then add max_depth=3 and see how the gap changes.

4. Evaluate properly. Pick any classification task from Part 1’s exercises. Implement a full pipeline: clean → split → train → evaluate with precision, recall, and F1. Is accuracy alone misleading for your dataset?

Next in series Python ML Toolkit →

Part 3 of the ML Fundamentals series. Setting up a Python ML environment, and practical workflow patterns with Pandas, NumPy, Matplotlib, Scikit-learn, and TensorFlow.

Comments

Loading comments...