There’s a comforting myth that machine learning is about choosing the cleverest algorithm. In practice, the algorithm is the easy part. The discipline lives in the preparation — and a repeatable sequence beats raw cleverness every time. Here are the fifteen steps, grouped into four phases, exactly as I run them.
- Prepare — understand, clean, and treat the data before anything else
- Engineer — encode, reduce, and split deliberately
- Model — run templates, compare on the right metrics
- Ship — validate on real data, then keep the model honest
P1Prepare the data
Exploratory Data Analysis
Check variable types and shape. Plot histograms, box plots, and a correlation matrix. Lean on Tableau or Power BI for deeper visual analysis before you assume anything.
Correlation & multicollinearity
Find variables that move together and drop or combine them. Multicollinearity quietly destabilises coefficients — catch it now, not after modelling.
Detect & treat outliers
Identify outliers and replace them with NA so they can be handled consistently in the next step — rather than silently skewing your distributions.
Treat NA / null values — and branch your datasets
Don’t guess at one imputation. Create multiple candidate datasets and let the metrics decide which survives.
# build several, feed each to the algorithm, keep the best
df_original = drop_na(raw) # baseline
df_na_treated = impute(raw, method="mice", diagnostic="cooks_distance")
df_out_capped = treat_outliers(raw, method="capping", fill="mean")
df_binned = bin_and_log(raw) # binning + log transformP2Engineer the features
Encode categorical variables
Treat factorial / categorical variables with the appropriate encoders — the right encoding choice often matters more than the model choice.
Feature selection
Reduce dimensionality with Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) to cut noise and speed up training.
Split: train / test / holdout
Partition deliberately — 70% train, 20% test, 10% holdout. The holdout is your honesty check; never let it leak into training.
train, temp = split(data, ratio=0.70)
test, holdout = split(temp, ratio=0.667) # 20% test, 10% holdout
assert len(train)/len(data) == 0.70P3Model and measure
Run multiple algorithm templates
Push each candidate dataset through a battery of models rather than betting on one. Below: the templates I keep ready for regression and classification.
# Regression (caret package)
models_reg = ["linear_regression", "neural_network", "C4.5", "CTREE", "random_forest"]
# Classification
models_clf = ["logistic_regression", "naive_bayes", "SVM", "C4.5", "CTREE", "random_forest"]
for ds in [df_original, df_na_treated, df_out_capped, df_binned]:
for m in models_reg + models_clf:
fit_and_score(m, ds) # compare on the metrics in step 10Combine insights across models
Pull insights from several algorithms together — agreement and disagreement between models is itself a business signal.
Check the right metrics
Match the metric to the task. The wrong metric will happily reward a useless model.
| Regression | RMSE, R², Adjusted R², MAPE |
|---|---|
| Classification | Accuracy, ROC, Sensitivity, Specificity, F1, Precision |
Templatise steps 2–10
Wrap the repeatable work into library functions and keep them in a template file. Future-you solves the next problem in a fraction of the time.
P4Ship and maintain
Translate to business insight
Convert model output into decisions a stakeholder can act on. A metric nobody acts on is a vanity number.
Validate with real-time data
Test against live data, not just your holdout. The world drifts from your training set the moment you deploy.
Make it read new data continuously
Wire the model to ingest fresh data on an ongoing basis so it stays relevant instead of decaying.
Revise to hold your metrics
Monitor and retrain to keep performance within tolerance. A model is a product, not a delivery — it needs maintenance.