Python ML Toolkit

Mar 24, 2026 · 14 min read AI Generated

🤖

AI Disclosure: This post was written by Claude Opus 4.6. References to “I” refer to the AI author, not the site owner.

AI edit history

Date	Model	Action
2026-03-25	Claude Opus 4.6	authored

🎯 What You Will Learn

Set up an isolated Python environment for ML work
Load and manipulate data with Pandas
Perform numerical operations with NumPy
Visualise data with Matplotlib and Seaborn
Use the Scikit-learn estimator API: fit, predict, transform, and pipelines
Build a basic neural network with TensorFlow/Keras
Tie all libraries together in a complete end-to-end workflow

📋 Prerequisites

link:/posts/what-is-machine-learning/[Part 1: What Is Machine Learning?] — the three learning paradigms and key terminology. + link:/posts/data-preprocessing-and-evaluation/[Part 2: Data Pre-processing and Evaluation] — cleaning data, splitting datasets, and evaluation metrics.

Why a Dedicated Toolkit?

Machine learning workflows involve loading data, transforming it, visualising patterns, training models, and evaluating results. You could write all of this from scratch, but the Python ML ecosystem has standardised libraries that handle each stage. Knowing the right tool for each job — and how they fit together — is what makes the difference between fighting your toolchain and actually getting work done.

This post covers the six libraries you will use constantly:

Library	Role
Pandas	Data loading, cleaning, and manipulation
NumPy	Numerical arrays and linear algebra
Matplotlib	Low-level plotting
Seaborn	Statistical visualisation built on Matplotlib
Scikit-learn	Classical ML algorithms, pipelines, and evaluation
TensorFlow/Keras	Neural networks and deep learning

Library

Role

Pandas

Data loading, cleaning, and manipulation

NumPy

Numerical arrays and linear algebra

Matplotlib

Low-level plotting

Seaborn

Statistical visualisation built on Matplotlib

Scikit-learn

Classical ML algorithms, pipelines, and evaluation

TensorFlow/Keras

Neural networks and deep learning

Setting Up Your Environment

Never install ML libraries into your system Python. Use a virtual environment so each project has its own dependencies.

# Create a project directory
mkdir ~/ml-lab && cd ~/ml-lab

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install the core stack
pip install pandas numpy matplotlib seaborn scikit-learn

# TensorFlow is large — install only when you need it
pip install tensorflow

# Lock your dependencies
pip freeze > requirements.txt

Verify everything works:

import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import sklearn
import tensorflow as tf

print(f"Pandas:       {pd.__version__}")
print(f"NumPy:        {np.__version__}")
print(f"Matplotlib:   {matplotlib.__version__}")
print(f"Seaborn:      {sns.__version__}")
print(f"Scikit-learn: {sklearn.__version__}")
print(f"TensorFlow:   {tf.__version__}")

If you are working on a remote server without a display, set the Matplotlib backend before importing pyplot:

import matplotlib
matplotlib.use('Agg')  # non-interactive backend — saves to file instead of displaying
import matplotlib.pyplot as plt

Pandas: Data Loading and Manipulation

Pandas is how you get data in, inspect it, clean it, and reshape it. The core object is the DataFrame — a table with labelled rows and columns.

Loading Data

import pandas as pd

# CSV — the most common format for ML datasets
df = pd.read_csv('server_metrics.csv')

# Other formats you will encounter
df_json = pd.read_json('alerts.json')
df_parquet = pd.read_parquet('metrics.parquet')  # fast columnar format

# Quick inspection
print(df.shape)            # (rows, columns)
print(df.dtypes)           # column types
print(df.describe())       # summary statistics
df.head()                  # first 5 rows

Core DataFrame Operations

# Simulated fleet metrics
df = pd.DataFrame({
    'host': ['web-01', 'web-02', 'db-01', 'web-01', 'db-01', 'web-02'],
    'timestamp': pd.to_datetime([
        '2026-03-24 10:00', '2026-03-24 10:00', '2026-03-24 10:00',
        '2026-03-24 10:05', '2026-03-24 10:05', '2026-03-24 10:05',
    ]),
    'cpu_pct': [45.2, 62.1, 28.5, 48.9, 31.2, 58.7],
    'mem_pct': [72.0, 55.3, 88.1, 73.5, 89.2, 56.0],
    'req_per_sec': [1200, 980, 50, 1350, 55, 1020],
    'error_count': [3, 0, 1, 7, 0, 2],
})

# Filtering
high_cpu = df[df['cpu_pct'] > 50]
web_servers = df[df['host'].str.startswith('web-')]

# Adding derived features
df['error_rate'] = df['error_count'] / df['req_per_sec']
df['is_overloaded'] = ((df['cpu_pct'] > 60) | (df['mem_pct'] > 85)).astype(int)

# Sorting
df_sorted = df.sort_values(['host', 'timestamp'])

Grouping and Aggregation

This is where Pandas shines for infrastructure data — summarising metrics per host, per time window, per region.

# Per-host statistics
host_summary = df.groupby('host').agg({
    'cpu_pct': ['mean', 'max'],
    'mem_pct': ['mean', 'max'],
    'error_count': 'sum',
    'req_per_sec': 'mean',
}).round(2)

print(host_summary)

# Resample time-series to a different interval
df = df.set_index('timestamp')
five_min_avg = df.groupby('host').resample('5min').mean(numeric_only=True)

Merging Datasets

In practice you often need to join metrics with metadata — host inventory, deployment records, incident logs.

# Host metadata
inventory = pd.DataFrame({
    'host': ['web-01', 'web-02', 'db-01'],
    'role': ['webserver', 'webserver', 'database'],
    'region': ['us-east-1', 'eu-west-1', 'us-east-1'],
    'instance_type': ['c5.xlarge', 'c5.xlarge', 'r5.2xlarge'],
})

# Join metrics with inventory
df_reset = df.reset_index()
enriched = df_reset.merge(inventory, on='host', how='left')
print(enriched[['host', 'cpu_pct', 'role', 'region']].head())

NumPy: Numerical Operations

NumPy is the foundation everything else is built on. Pandas DataFrames wrap NumPy arrays. Scikit-learn expects NumPy arrays as input. Understanding NumPy means understanding how your data is actually stored and processed.

Arrays and Basic Operations

import numpy as np

# Create arrays from monitoring data
cpu_readings = np.array([45.2, 62.1, 28.5, 48.9, 31.2, 58.7])
mem_readings = np.array([72.0, 55.3, 88.1, 73.5, 89.2, 56.0])

# Vectorised operations — no loops needed
print(f"Mean CPU:   {cpu_readings.mean():.1f}%")
print(f"Std CPU:    {cpu_readings.std():.1f}%")
print(f"Max memory: {mem_readings.max():.1f}%")

# Element-wise operations
combined_load = (cpu_readings + mem_readings) / 2
above_threshold = cpu_readings[cpu_readings > 50]
print(f"Hosts above 50% CPU: {len(above_threshold)}")

Broadcasting

Broadcasting is NumPy’s rule for operating on arrays of different shapes. It eliminates explicit loops and is critical for performance.

# Normalise metrics to [0, 1] using broadcasting
# Each column gets its own min and max
metrics = np.array([
    [45.2, 72.0, 1200],  # cpu, mem, req/s
    [62.1, 55.3, 980],
    [28.5, 88.1, 50],
    [48.9, 73.5, 1350],
])

col_min = metrics.min(axis=0)   # min per column → shape (3,)
col_max = metrics.max(axis=0)   # max per column → shape (3,)

# This works because NumPy broadcasts (4,3) with (3,)
normalised = (metrics - col_min) / (col_max - col_min)
print(normalised.round(3))

Linear Algebra Basics

You do not need a maths degree, but some operations come up repeatedly in ML.

# Dot product — the core operation in linear models and neural networks
weights = np.array([0.3, 0.5, 0.2])  # learned feature importance
features = np.array([85.0, 72.0, 450])  # one server's metrics

score = np.dot(weights, features)
print(f"Risk score: {score:.1f}")

# Matrix multiplication — batch predictions
fleet_metrics = np.array([
    [85.0, 72.0, 450],
    [32.0, 45.0, 120],
    [91.0, 88.0, 520],
])

scores = fleet_metrics @ weights  # @ is the matrix multiply operator
print(f"Fleet risk scores: {scores.round(1)}")

# Useful operations
print(f"Determinant: {np.linalg.det(np.eye(3))}")
print(f"Matrix rank: {np.linalg.matrix_rank(fleet_metrics)}")

Matplotlib and Seaborn: Visualisation

Visualisation is not optional in ML. You need to see distributions before choosing algorithms, check for outliers before training, and plot learning curves to diagnose problems.

Distribution Plots

Before training any model, understand what your data looks like.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Simulated fleet CPU readings (200 hosts)
np.random.seed(42)
fleet_cpu = pd.DataFrame({
    'cpu_pct': np.concatenate([
        np.random.normal(40, 10, 150),   # normal hosts
        np.random.normal(85, 5, 50),     # overloaded hosts
    ]),
    'role': ['webserver'] * 150 + ['batch-worker'] * 50,
})

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Histogram with KDE
sns.histplot(data=fleet_cpu, x='cpu_pct', hue='role', kde=True, ax=axes[0])
axes[0].set_title('CPU Distribution by Role')

# Box plot — shows median, quartiles, outliers
sns.boxplot(data=fleet_cpu, x='role', y='cpu_pct', ax=axes[1])
axes[1].set_title('CPU by Role')

plt.tight_layout()
plt.savefig('cpu_distribution.png', dpi=150)
plt.show()

The bimodal distribution above immediately tells you that a single threshold will not work — you have two distinct populations.

Correlation Heatmaps

Spot redundant features and strong predictors before training.

np.random.seed(42)
n = 200

metrics_df = pd.DataFrame({
    'cpu_pct': np.random.normal(55, 15, n),
    'mem_pct': np.random.normal(65, 12, n),
    'disk_io_mbps': np.random.normal(150, 50, n),
    'net_mbps': np.random.normal(200, 80, n),
    'error_count': np.random.poisson(5, n),
    'response_ms': np.random.normal(120, 30, n),
})

# Make some features correlated (realistic)
metrics_df['load_avg'] = metrics_df['cpu_pct'] * 0.04 + np.random.normal(0, 0.3, n)
metrics_df['response_ms'] = metrics_df['cpu_pct'] * 0.8 + np.random.normal(80, 10, n)

plt.figure(figsize=(8, 6))
sns.heatmap(
    metrics_df.corr(numeric_only=True),
    annot=True, fmt='.2f', cmap='coolwarm', center=0,
    square=True, linewidths=0.5,
)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.show()

High correlation between cpu_pct and load_avg (>0.9) means they carry nearly the same information — drop one.

Learning Curves

Learning curves tell you whether your model needs more data or a different architecture.

from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier

# Assuming X_train, y_train exist from your pipeline
train_sizes, train_scores, val_scores = learning_curve(
    DecisionTreeClassifier(max_depth=5, random_state=42),
    X_train, y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='accuracy',
)

plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training score')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation score')
plt.fill_between(train_sizes,
                 val_scores.mean(axis=1) - val_scores.std(axis=1),
                 val_scores.mean(axis=1) + val_scores.std(axis=1),
                 alpha=0.2)
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('learning_curve.png', dpi=150)
plt.show()

If training and validation scores converge at a high value, the model is good. If there is a large gap, the model is overfitting. If both are low, it is underfitting.

Scikit-learn: API Patterns

Scikit-learn has the most consistent API in the Python ecosystem. Once you learn the pattern, every algorithm works the same way.

The Estimator Pattern

Every model in Scikit-learn follows three methods:

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Every estimator works the same way:
# 1. Instantiate with hyperparameters
model = DecisionTreeClassifier(max_depth=5, random_state=42)

# 2. Fit on training data
model.fit(X_train, y_train)

# 3. Predict on new data
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

# Swap the algorithm — same API
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

This consistency means you can swap algorithms in a single line. The preprocessing, evaluation, and pipeline code stays identical.

Transformers

Transformers follow a parallel pattern: fit learns parameters, transform applies them.

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

# Scaler — learns mean and std from training data
scaler = StandardScaler()
scaler.fit(X_train)                # learn parameters
X_train_scaled = scaler.transform(X_train)  # apply
X_test_scaled = scaler.transform(X_test)    # same parameters!

# PCA — learns principal components
pca = PCA(n_components=5)
pca.fit(X_train_scaled)
X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

The critical rule: fit on training data only, then transform both train and test. Calling fit_transform on test data leaks future information into your model.

Pipelines

Pipelines chain transformers and estimators together so the entire workflow is a single object. This prevents data leakage and makes deployment cleaner.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

# Define the pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=5)),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
])

# One call does everything: scale → PCA → train
pipeline.fit(X_train, y_train)

# One call for inference: scale → PCA → predict
predictions = pipeline.predict(X_test)
score = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {score:.3f}")

With a pipeline, you cannot accidentally forget to scale the test data or apply PCA in the wrong order. The pipeline handles it.

Column Transformers

Real data has mixed types. ColumnTransformer lets you apply different preprocessing to different columns.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Define which columns get which treatment
numeric_features = ['cpu_pct', 'mem_pct', 'disk_io', 'req_per_sec']
categorical_features = ['role', 'region']

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features),
])

# Full pipeline with mixed preprocessing
full_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
])

full_pipeline.fit(X_train, y_train)
print(f"Score: {full_pipeline.score(X_test, y_test):.3f}")

TensorFlow/Keras: Neural Networks

For most tabular data problems, Scikit-learn is the right tool. But when you need neural networks — for complex patterns, large datasets, or deep learning — TensorFlow with the Keras API is the standard.

The Sequential Pattern

Keras follows the same conceptual flow as Scikit-learn: build, compile, fit, predict.

import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Assume X, y are ready (NumPy arrays or Pandas DataFrames)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build the model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(1, activation='sigmoid'),  # binary classification
])

# Compile — define loss function, optimiser, and metrics
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy'],
)

# Fit — train the network
history = model.fit(
    X_train_scaled, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=1,
)

The key differences from Scikit-learn:

You define the architecture explicitly (layers, neurons, activations)
You choose the loss function and optimiser at compile time
Training runs for multiple epochs, and you can watch the loss decrease
validation_split holds back data during training to monitor overfitting

Plotting Training History

The history object contains loss and metric values per epoch — essential for diagnosing training problems.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Loss
axes[0].plot(history.history['loss'], label='Training')
axes[0].plot(history.history['val_loss'], label='Validation')
axes[0].set_title('Loss')
axes[0].set_xlabel('Epoch')
axes[0].legend()

# Accuracy
axes[1].plot(history.history['accuracy'], label='Training')
axes[1].plot(history.history['val_accuracy'], label='Validation')
axes[1].set_title('Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].legend()

plt.tight_layout()
plt.savefig('training_history.png', dpi=150)
plt.show()

If validation loss starts increasing while training loss keeps dropping, the network is overfitting. Add more dropout, reduce layers, or use early stopping:

early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=5, restore_best_weights=True
)

model.fit(
    X_train_scaled, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stop],
)

Evaluation

# Evaluate on test set
test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test)
print(f"Test accuracy: {test_accuracy:.3f}")

# Get predictions for detailed metrics
y_prob = model.predict(X_test_scaled).flatten()
y_pred = (y_prob > 0.5).astype(int)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Note that Scikit-learn’s evaluation functions work on TensorFlow predictions. The ecosystems interoperate.

Complete Workflow: Fleet Health Prediction

Here is everything tied together — a realistic end-to-end workflow predicting which servers are likely to need intervention within the next hour, based on current metrics.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve

# ── 1. Generate synthetic fleet data ──────────────────────────
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'cpu_pct': np.random.normal(55, 20, n_samples).clip(0, 100),
    'mem_pct': np.random.normal(60, 15, n_samples).clip(0, 100),
    'disk_io_mbps': np.random.exponential(100, n_samples),
    'net_mbps': np.random.normal(200, 80, n_samples).clip(0, None),
    'error_count': np.random.poisson(3, n_samples),
    'uptime_hours': np.random.exponential(500, n_samples),
    'role': np.random.choice(['webserver', 'database', 'cache', 'worker'], n_samples),
    'region': np.random.choice(['us-east-1', 'eu-west-1', 'ap-southeast-1'], n_samples),
})

# Create a realistic target: servers with high CPU + high errors are at risk
risk_score = (
    0.4 * (data['cpu_pct'] / 100) +
    0.3 * (data['mem_pct'] / 100) +
    0.2 * (data['error_count'] / data['error_count'].max()) +
    0.1 * np.random.normal(0, 0.1, n_samples)
)
data['needs_intervention'] = (risk_score > 0.55).astype(int)
print(f"Dataset shape: {data.shape}")
print(f"Class balance:\n{data['needs_intervention'].value_counts(normalize=True).round(3)}")

# ── 2. Explore ────────────────────────────────────────────────
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

sns.histplot(data=data, x='cpu_pct', hue='needs_intervention', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('CPU Distribution')

sns.histplot(data=data, x='mem_pct', hue='needs_intervention', kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Memory Distribution')

sns.boxplot(data=data, x='role', y='error_count', ax=axes[1, 0])
axes[1, 0].set_title('Errors by Role')

numeric_cols = ['cpu_pct', 'mem_pct', 'disk_io_mbps', 'net_mbps', 'error_count']
sns.heatmap(data[numeric_cols].corr(), annot=True, fmt='.2f', cmap='coolwarm',
            ax=axes[1, 1], square=True)
axes[1, 1].set_title('Correlations')

plt.tight_layout()
plt.savefig('fleet_exploration.png', dpi=150)
plt.show()

# ── 3. Prepare features ──────────────────────────────────────
X = data.drop(columns=['needs_intervention'])
y = data['needs_intervention']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ── 4. Build pipeline ────────────────────────────────────────
numeric_features = ['cpu_pct', 'mem_pct', 'disk_io_mbps', 'net_mbps',
                    'error_count', 'uptime_hours']
categorical_features = ['role', 'region']

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features),
])

pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=200, max_depth=10, random_state=42
    )),
])

# ── 5. Cross-validate ────────────────────────────────────────
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
print(f"\nCV accuracy: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

# ── 6. Train and evaluate ────────────────────────────────────
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

print("\nClassification Report:")
print(classification_report(y_test, y_pred,
                            target_names=['healthy', 'needs intervention']))

auc = roc_auc_score(y_test, y_prob)
print(f"ROC AUC: {auc:.3f}")

# ── 7. ROC curve ──────────────────────────────────────────────
fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, label=f'Random Forest (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve — Fleet Health Prediction')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()

# ── 8. Feature importance ────────────────────────────────────
feature_names = (
    numeric_features +
    list(pipeline.named_steps['preprocessing']
         .transformers_[1][1]
         .get_feature_names_out(categorical_features))
)

importances = pipeline.named_steps['classifier'].feature_importances_
feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=True)

plt.figure(figsize=(8, 5))
feat_imp.plot(kind='barh')
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()

This workflow covers every stage: synthetic data generation (standing in for your real metrics export), exploration, pipeline construction, cross-validation, training, evaluation, and interpretation. The feature importance plot at the end tells you which metrics actually drive the prediction — useful both for model understanding and for deciding what to monitor more closely.

Cheat Sheet

Quick reference for the patterns you will use most:

Task Code

Task	Code
Load CSV	`pd.read_csv('file.csv')`
Summary stats	`df.describe()` / `df.info()`
Group and aggregate	`df.groupby('col').agg({'metric': 'mean'})`
Merge datasets	`df.merge(other, on='key', how='left')`
Create array	`np.array([1, 2, 3])`
Matrix multiply	`A @ B` or `np.dot(A, B)`
Quick plot	`plt.plot(x, y); plt.show()`
Heatmap	`sns.heatmap(df.corr(), annot=True)`
Train model	`model.fit(X_train, y_train)`
Predict	`model.predict(X_test)`
Pipeline	`Pipeline([('scaler', StandardScaler()), ('model', RFC())])`
Cross-validate	`cross_val_score(model, X, y, cv=5)`

Load CSV

pd.read_csv('file.csv')

Summary stats

df.describe() / df.info()

Group and aggregate

df.groupby('col').agg({'metric': 'mean'})

Merge datasets

df.merge(other, on='key', how='left')

Create array

np.array([1, 2, 3])

Matrix multiply

A @ B or np.dot(A, B)

Quick plot

plt.plot(x, y); plt.show()

Heatmap

sns.heatmap(df.corr(), annot=True)

Train model

model.fit(X_train, y_train)

Predict

model.predict(X_test)

Pipeline

Pipeline([('scaler', StandardScaler()), ('model', RFC())])

Cross-validate

cross_val_score(model, X, y, cv=5)

📚 Resources

Videos:

Keith Galli — Pandas Tutorial — comprehensive practical walkthrough of Pandas.
freeCodeCamp — NumPy Full Course — covers arrays, broadcasting, and linear algebra ops.
Sentdex — Matplotlib Tutorial — practical plotting for data science.
StatQuest — Scikit-learn Tutorial — the fit/predict/transform pattern explained clearly.

Documentation:

Pandas — 10 Minutes to Pandas — official quick-start.
Scikit-learn — Getting Started — the estimator API explained.
TensorFlow — Quickstart for beginners — build your first model in minutes.

🔬 Try It Yourself

1. Build your environment. Create a virtual environment, install the stack, and verify all imports work. Export requirements.txt.

2. Load real data. Export metrics from your monitoring system (Prometheus, Netdata, CloudWatch — anything). Load it into a Pandas DataFrame. How many rows? How many missing values? What do the distributions look like?

3. Explore with plots. Create a correlation heatmap of your metrics. Which features are redundant? Create distribution plots — are any features bimodal or heavily skewed?

4. End-to-end pipeline. Using the fleet health workflow as a template, build a pipeline that: loads your data, preprocesses it (handle missing values, encode categoricals, scale numerics), trains a Random Forest, and evaluates with precision/recall/F1. What is your ROC AUC?

5. Try TensorFlow. Replace the Random Forest in exercise 4 with a Keras Sequential model. Compare the results. Is the neural network better for your dataset? (For tabular data, it usually is not — but knowing the pattern matters.)

Part 4: Classification Algorithms — KNN, Decision Trees, Naive Bayes, Logistic Regression, and SVM. How each algorithm works, when to use it, and how to choose between them.

Next in series Classification — KNN, Naive Bayes, Decision Trees →

Part 4 of the ML Fundamentals series. Three foundational classification algorithms — how they work, when to use each, and hands-on implementation with scikit-learn using infrastructure examples.

Comments

Loading comments...