There’s a comforting myth that machine learning is about choosing the cleverest algorithm. In practice, the algorithm is the easy part. The discipline lives in the preparation — and a repeatable sequence beats raw cleverness every time. Here are the fifteen steps, grouped into four phases, exactly as I run them.

The four phases at a glance
  • Prepare — understand, clean, and treat the data before anything else
  • Engineer — encode, reduce, and split deliberately
  • Model — run templates, compare on the right metrics
  • Ship — validate on real data, then keep the model honest

P1Prepare the data

1

Exploratory Data Analysis

Check variable types and shape. Plot histograms, box plots, and a correlation matrix. Lean on Tableau or Power BI for deeper visual analysis before you assume anything.

types & shapehistogrambox plotcorrelation
2

Correlation & multicollinearity

Find variables that move together and drop or combine them. Multicollinearity quietly destabilises coefficients — catch it now, not after modelling.

3

Detect & treat outliers

Identify outliers and replace them with NA so they can be handled consistently in the next step — rather than silently skewing your distributions.

4

Treat NA / null values — and branch your datasets

Don’t guess at one imputation. Create multiple candidate datasets and let the metrics decide which survives.

dataset variants to compare# build several, feed each to the algorithm, keep the best df_original = drop_na(raw) # baseline df_na_treated = impute(raw, method="mice", diagnostic="cooks_distance") df_out_capped = treat_outliers(raw, method="capping", fill="mean") df_binned = bin_and_log(raw) # binning + log transform

P2Engineer the features

5

Encode categorical variables

Treat factorial / categorical variables with the appropriate encoders — the right encoding choice often matters more than the model choice.

6

Feature selection

Reduce dimensionality with Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) to cut noise and speed up training.

PCALDA
7

Split: train / test / holdout

Partition deliberately — 70% train, 20% test, 10% holdout. The holdout is your honesty check; never let it leak into training.

70 / 20 / 10 splittrain, temp = split(data, ratio=0.70) test, holdout = split(temp, ratio=0.667) # 20% test, 10% holdout assert len(train)/len(data) == 0.70

P3Model and measure

8

Run multiple algorithm templates

Push each candidate dataset through a battery of models rather than betting on one. Below: the templates I keep ready for regression and classification.

model templates — regression vs classification# Regression (caret package) models_reg = ["linear_regression", "neural_network", "C4.5", "CTREE", "random_forest"] # Classification models_clf = ["logistic_regression", "naive_bayes", "SVM", "C4.5", "CTREE", "random_forest"] for ds in [df_original, df_na_treated, df_out_capped, df_binned]: for m in models_reg + models_clf: fit_and_score(m, ds) # compare on the metrics in step 10
9

Combine insights across models

Pull insights from several algorithms together — agreement and disagreement between models is itself a business signal.

10

Check the right metrics

Match the metric to the task. The wrong metric will happily reward a useless model.

RegressionRMSE, R², Adjusted R², MAPE
ClassificationAccuracy, ROC, Sensitivity, Specificity, F1, Precision
11

Templatise steps 2–10

Wrap the repeatable work into library functions and keep them in a template file. Future-you solves the next problem in a fraction of the time.

P4Ship and maintain

12

Translate to business insight

Convert model output into decisions a stakeholder can act on. A metric nobody acts on is a vanity number.

13

Validate with real-time data

Test against live data, not just your holdout. The world drifts from your training set the moment you deploy.

14

Make it read new data continuously

Wire the model to ingest fresh data on an ongoing basis so it stays relevant instead of decaying.

15

Revise to hold your metrics

Monitor and retrain to keep performance within tolerance. A model is a product, not a delivery — it needs maintenance.

The first eleven steps are data work. Only one step is “pick a model.” That ratio is the whole lesson.