Why Random Forest was my first production model
When I started building ML Production Pipeline, I had a choice to make before any of the interesting MLOps project building began. What model do I actually train?
I picked Random Forest. Not because it is the fanciest option in machine learning, but because it is a strong first model when you are learning how to take something from your notes to a system you can run and trust out of the box.
This post explains what Random Forest is, where it fits in the ML landscape, and why it was the right call for that project.
Start with the decision tree
Before Random Forest, you need a decision tree.
A decision tree is a flowchart the model learns from data. At each step it asks a question about the inputs:
- Is the transaction amount over $500?
- Is the time of day unusual?
- Does feature
v3look like past fraud cases?
Each branch leads to another question or a final prediction. For fraud detection, the leaf might say fraud or not fraud.
Random Forest fixes that by training many trees and letting them vote.
What Random Forest actually is
Random Forest is an ensemble model. Instead of one decision tree, you build hundreds or thousands of them. Each tree sees a random sample of the training rows and a random subset of features. When you need a prediction, the trees vote (for classification) or average (for regression).
The full name is useful:
| Term | Plain English |
|---|---|
| Random | Each tree trains on a different random slice of the data and features |
| Forest | Many trees together, not just one |
In ML taxonomy, Random Forest is usually described as:
- Supervised learning (you train on labeled examples)
- Classification or regression (predict a category or a number)
- Tree-based (built from decision trees)
- Ensemble learning (combines many models)
- Bagging (bootstrap aggregating: train on random samples, aggregate the results)
This is a lot of information to soak in, and I’m not a fan of memorizing things. I like to learn through osmosis. So the practical takeaway here: it is a well-tested workhorse for structured, tabular data (rows and columns in a spreadsheet or database), not images or free text.
Why I chose it for a fraud pipeline
Strong baseline for tabular data. Fraud scores, transaction amounts, timestamps, and engineered features live in columns. Tree models handle that shape well without a lot of tuning on day one.
Class imbalance is common in fraud. Most transactions are legitimate; fraud is rare. Random Forest supports balanced class weights so the model does not learn to always say “not fraud” and look accurate while missing everything important.
Feature importances help you learn. After training, you can see which inputs drove the most splits. That is useful when you are still building intuition about the dataset.
Fast enough to iterate. Training completes quickly on a laptop. When the goal is learning the full loop (train, track in MLflow, serve with FastAPI, log to Redis, watch for drift), you want a model that gets out of your way.
Interpretable enough for a first production story. Deep neural networks can be stronger on some problems, but they are harder to explain and heavier to serve when you are still learning the ops side. Random Forest let me focus on serving, logging, and monitoring without fighting the model itself.
What I did not optimize for (on purpose)
Random Forest is not the best answer to every ML problem.
Skip it (or at least do not start there) when:
- Images, audio, or raw text are the main input. Convolutional nets, transformers, and embedding pipelines exist for a reason.
- You need the highest possible accuracy on a huge, well-labeled dataset and you have the team and infrastructure to train and serve something heavier.
- Relationships over time dominate the signal. Sequences and trends often need different model families or feature engineering first.
For my project, the learning goal was MLOps, not winning a Kaggle competition. A solid tree ensemble was the right trade-off.
How this connects to “production”
Training the forest is step one. The rest of the project is about what happens after the .pkl or MLflow artifact exists:
- Serve predictions behind an API
- Log what the model saw in production
- Compare live inputs to the training baseline
- Alert when data drift suggests the world changed
Random Forest does not solve any of that. It just gives you a realistic model worth serving while you build the operational loop around it.
That is the same lesson I wrote up on the project page: training is a milestone, not a finish line. Picking Random Forest kept the milestone reachable so I could spend energy on the parts production teams actually run.
What changed for me
I started thinking more in loops while working on this project: train, serve, log, monitor, retrain when the data says so.
Random Forest was not the hero of that story. It was the practical starting point that let the rest of the system become the lesson.
If you are building your first end-to-end ML demo, start with a model you can train quickly, explain roughly, and put behind an API. For tabular problems like fraud scoring, Random Forest is a good candidate.
The code, Docker Compose stack, and drift checker live in the ML Production Pipeline repo. Train the forest first. Then spend your curiosity on everything that comes after the model file exists.