Data Pre-processing and Evaluation
- Clean and transform raw data into ML-ready features
- Understand feature selection and dimensionality reduction (PCA)
- Split data correctly into training, validation, and test sets
- Evaluate models using accuracy, precision, recall, F1, ROC curves, and RMSE
- Recognise and prevent overfitting
Why Pre-processing Matters
Raw data is never ready for machine learning. Logs have missing fields. Metrics arrive at irregular intervals. Categorical values like hostnames need encoding. Scales differ wildly β CPU percentage ranges 0β100 while network throughput might be 0β10,000 Mbps.
If you feed raw data into a model, the model will either fail or learn the wrong things. Pre-processing is the bridge between your data as it exists and your data as the model needs it.
In practice, data preparation takes more time than model selection. Getting this right is what separates a model that works from one that looks good in a notebook but falls over in production.
The Pre-processing Pipeline
Every ML project follows roughly the same pipeline:
Raw Data β Clean β Transform β Select Features β Split β Train β EvaluateLetβs walk through each stage.
1. Handling Missing Data
Real-world data has gaps. A monitoring agent drops offline, a log field is optional, a sensor returns null. You have three options:
import pandas as pd
import numpy as np
# Simulated server metrics with gaps
df = pd.DataFrame({
'cpu_avg': [45, 92, np.nan, 87, 38, 91],
'mem_pct': [62, 88, 55, np.nan, 55, 80],
'disk_io': [120, 450, 95, 520, np.nan, 480],
'error_count': [3, 47, 1, 62, 1, 55],
'failed': [0, 1, 0, 1, 0, 1]
})
print("Missing values per column:")
print(df.isnull().sum())Option A: Drop rows β simplest, but you lose data.
df_dropped = df.dropna()
print(f"Rows: {len(df)} β {len(df_dropped)}")Option B: Fill with a statistic β mean, median, or mode. Preserves row count.
df_filled = df.fillna(df.median(numeric_only=True))Option C: Forward/backward fill β useful for time-series where the last known value is a reasonable estimate.
df_ffill = df.ffill()Which to use depends on context. For time-series metrics, forward fill usually makes sense. For independent samples, median imputation is safer than mean (outliers skew the mean).
2. Feature Encoding
ML models work with numbers. Categorical values need encoding.
Label encoding β assigns integers. Use when there is a natural order.
from sklearn.preprocessing import LabelEncoder
severities = ['info', 'warning', 'error', 'critical']
le = LabelEncoder()
encoded = le.fit_transform(severities)
print(dict(zip(severities, encoded)))
# {'critical': 0, 'error': 1, 'info': 2, 'warning': 3}One-hot encoding β creates binary columns. Use when there is no natural order.
import pandas as pd
hosts = pd.DataFrame({'host': ['griffin', 'wolfhound', 'blackjack', 'griffin']})
encoded = pd.get_dummies(hosts, columns=['host'], dtype=int)
print(encoded)
# host_blackjack host_griffin host_wolfhound
# 0 0 1 0
# 1 0 0 1
# 2 1 0 0
# 3 0 1 0The trap with label encoding is that the model may interpret the integers as having magnitude β it might think critical (0) < info (2), which is meaningless. One-hot avoids this but adds columns.
3. Feature Scaling
When features have different scales, algorithms that rely on distance (KNN, SVM, neural networks) will be dominated by the largest-scale feature.
Standardisation (z-score) β centres data around 0 with unit variance. Good default.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(
scaler.fit_transform(df_filled.drop(columns=['failed'])),
columns=df_filled.drop(columns=['failed']).columns
)
print(df_scaled.describe().round(2))Min-max normalisation β scales to [0, 1]. Use when you need bounded values.
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
df_normed = pd.DataFrame(
minmax.fit_transform(df_filled.drop(columns=['failed'])),
columns=df_filled.drop(columns=['failed']).columns
)Rule of thumb: standardise unless you have a reason not to. Tree-based models (Decision Trees, Random Forest, XGBoost) are scale-invariant and do not need scaling.
Feature Selection
Not all features help. Some are redundant, some are noise, and some actively hurt performance by introducing irrelevant patterns the model tries to learn.
Correlation Analysis
Start by checking which features correlate with your target.
# Correlation with the target variable
correlations = df_filled.corr(numeric_only=True)['failed'].drop('failed')
print(correlations.sort_values(ascending=False))Features with near-zero correlation to the target are candidates for removal. Features that are highly correlated with each other (multicollinearity) are redundant β keep one.
Dimensionality Reduction with PCA
Principal Component Analysis compresses many features into fewer dimensions while preserving as much variance as possible. Useful when you have dozens or hundreds of metrics.
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Reduce 4 features to 2 principal components
pca = PCA(n_components=2)
X = df_filled.drop(columns=['failed'])
X_pca = pca.fit_transform(StandardScaler().fit_transform(X))
print(f"Explained variance: {pca.explained_variance_ratio_.round(3)}")
print(f"Total variance retained: {sum(pca.explained_variance_ratio_):.1%}")
# Visualise
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df_filled['failed'], cmap='RdYlGn_r')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Servers in PCA space')
plt.colorbar(label='Failed')
plt.show()PCA is a black box β the resulting components are linear combinations of the original features and lose interpretability. Use it when you need to reduce dimensionality, not when you need to explain which features matter.
Splitting Your Data
The cardinal rule: never evaluate a model on data it trained on.
Train / Test Split
The simplest approach. Hold back a portion of data for testing.
from sklearn.model_selection import train_test_split
X = df_filled.drop(columns=['failed'])
y = df_filled['failed']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")stratify=y ensures the class balance in the split matches the original. If 30% of your servers failed, both train and test will have ~30% failures.
Train / Validation / Test Split
For tuning hyperparameters, you need three sets:
Training set β the model learns from this
Validation set β used to tune hyperparameters and make design decisions
Test set β touched only once, at the very end, to report final performance
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Result: 60% train, 20% validation, 20% testCross-Validation
When data is limited, holding out 40% is expensive. K-fold cross-validation uses all the data for both training and testing by rotating the held-out fold.
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Fold scores: {scores.round(3)}")
print(f"Mean accuracy: {scores.mean():.3f} Β± {scores.std():.3f}")With 5-fold CV, each sample appears in the test set exactly once. The mean score is a more reliable estimate than a single train/test split.
Evaluating Your Model
A model that is 95% accurate sounds great β until you realise 95% of your servers are healthy and the model just predicts "healthy" every time. Evaluation metrics need to match your problem.
Classification Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, classification_report, confusion_matrix
)
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=['healthy', 'failed']))| Metric | What It Measures | When It Matters |
|---|---|---|
Accuracy | Correct predictions / total predictions | Balanced classes only |
Precision | Of predicted positives, how many were correct? | When false positives are costly (unnecessary alerts) |
Recall | Of actual positives, how many did we catch? | When false negatives are costly (missed failures) |
F1 Score | Harmonic mean of precision and recall | When you need to balance both |
For infrastructure monitoring, recall usually matters more β missing a server failure (false negative) is worse than a false alarm (false positive).
The Confusion Matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens',
xticklabels=['healthy', 'failed'],
yticklabels=['healthy', 'failed'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()The confusion matrix shows you exactly where the model is wrong. The off-diagonal cells are your errors: top-right is false positives, bottom-left is false negatives.
ROC Curves
The Receiver Operating Characteristic curve plots true positive rate against false positive rate at every classification threshold.
from sklearn.metrics import roc_curve, roc_auc_score
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()AUC (Area Under the Curve) gives a single number: 1.0 is perfect, 0.5 is random guessing. Anything above 0.8 is generally useful.
Regression Metrics
For continuous predictions (CPU forecasting, capacity planning):
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# Example: predicting CPU usage
y_actual = np.array([45, 82, 38, 91, 67])
y_predicted = np.array([48, 79, 42, 88, 70])
rmse = np.sqrt(mean_squared_error(y_actual, y_predicted))
mae = mean_absolute_error(y_actual, y_predicted)
r2 = r2_score(y_actual, y_predicted)
print(f"RMSE: {rmse:.2f}") # Root Mean Squared Error
print(f"MAE: {mae:.2f}") # Mean Absolute Error
print(f"RΒ²: {r2:.3f}") # Coefficient of determination| Metric | Interpretation |
|---|---|
RMSE | Average prediction error in the same units as the target. Lower is better. Penalises large errors. |
MAE | Average absolute error. Less sensitive to outliers than RMSE. |
RΒ² | Proportion of variance explained. 1.0 is perfect, 0 means the model is no better than predicting the mean. |
Overfitting and Underfitting
The most common failure mode in ML is overfitting β the model memorises the training data instead of learning generalisable patterns.
How to Detect It
from sklearn.tree import DecisionTreeClassifier
# Unrestricted tree β will overfit
model_overfit = DecisionTreeClassifier(random_state=42)
model_overfit.fit(X_train, y_train)
train_acc = model_overfit.score(X_train, y_train)
test_acc = model_overfit.score(X_test, y_test)
print(f"Train accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")If training accuracy is significantly higher than test accuracy, the model is overfitting. A model that scores 0.99 on training data and 0.72 on test data has memorised the noise.
How to Prevent It
| Technique | How It Works |
|---|---|
More data | The single most effective remedy. More examples make it harder to memorise. |
Simpler model | Reduce tree depth, fewer neurons, fewer features. Less capacity to memorise. |
Cross-validation | Evaluate on multiple folds to get a realistic performance estimate. |
Regularisation | Add a penalty for model complexity (L1/L2 regularisation, dropout in neural networks). |
Early stopping | Stop training when validation performance stops improving. |
Underfitting is the opposite β the model is too simple to capture the patterns. The fix is a more complex model or better features. You know you are underfitting when both training and test performance are poor.
Putting It All Together
Here is a complete pipeline β from raw data to evaluated model:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
# 1. Load and inspect
df = pd.read_csv('server_metrics.csv')
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")
# 2. Handle missing data
df = df.fillna(df.median(numeric_only=True))
# 3. Encode categoricals
df = pd.get_dummies(df, columns=['host', 'region'], dtype=int)
# 4. Split features and target
X = df.drop(columns=['failed'])
y = df['failed']
# 5. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 6. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # use train stats!
# 7. Train
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train_scaled, y_train)
# 8. Evaluate
print(classification_report(y_test, model.predict(X_test_scaled)))
# 9. Cross-validate
scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"CV accuracy: {scores.mean():.3f} Β± {scores.std():.3f}")Note line 6: scaler.transform(X_test) β not fit_transform. The scaler must be fitted on training data only. Fitting on the test set leaks information from the future.
Next
Part 3: Python ML Toolkit β setting up your environment with Pandas, NumPy, Scikit-learn, and TensorFlow, with practical patterns for ML workflows.
Videos:
- StatQuest β Cross Validation β clear visual explanation of k-fold CV and why it matters.
- StatQuest β ROC and AUC β how ROC curves work and what AUC actually measures.
- StatQuest β Confusion Matrix β precision, recall, F1 demystified.
- 3Blue1Brown β Essence of Linear Algebra Ch.1 β visual intuition for vectors and spaces, useful for understanding PCA.
Reading:
- Scikit-learn β Preprocessing data β official docs for scaling, encoding, and imputation.
- Scikit-learn β Model evaluation β complete reference for all metrics covered in this post.
Companion: link:/posts/why-maths-for-machine-learning/[Maths for ML Part 1] covers why evaluation metrics work the way they do.
1. Clean a real dataset. Export a day’s worth of metrics from your monitoring system (Netdata, Prometheus, or even /proc stats). How many missing values are there? Which imputation method makes the most sense?
2. Feature correlation. Using the same dataset, compute correlations between your metrics. Which features are redundant? Which correlate most strongly with a metric you care about (e.g., response time)?
3. Overfit on purpose. Train a DecisionTreeClassifier with no depth limit on a small dataset (<100 rows). Compare train vs test accuracy. Then add max_depth=3 and see how the gap changes.
4. Evaluate properly. Pick any classification task from Part 1’s exercises. Implement a full pipeline: clean β split β train β evaluate with precision, recall, and F1. Is accuracy alone misleading for your dataset?
Comments
Loading comments...