How does a machine learning model know that if it is performing well or not? It needs a way to measure how far off its predictions are from reality. That’s where the loss function comes in.
Think of the loss function as the model’s internal GPS telling it, “You’re this far away from your destination—time to adjust!”
Simple analogy
Think of loss like a golf score. Each shot (prediction) that misses the hole (target) by more distance adds more to your score (loss). The aim is always to get the lowest possible score.
What is a Loss Function?
A loss function quantifies the difference between what the model predicts and what the actual value is. It’s like a scorecard: lower scores mean the model’s predictions are close to reality, higher scores mean it’s way off.
During training, the model continuously tweaks its parameters to minimize this loss, aiming to improve prediction accuracy step by step.
Why is the Loss Function Important?
Without a loss function, the model wouldn’t know which way to adjust its parameters. The loss acts like a guide, telling the model whether it’s improving or getting worse. Optimizers (like gradient descent) rely on this value to decide how to update the model.
Types of Loss Functions
Depending on what problem you’re solving, you pick different loss functions:
- Mean Squared Error (MSE): Used for regression problems (predicting prices, temperatures, etc.).
- Cross-Entropy Loss: Used for classification (cats vs. dogs).
Each has a shape and penalty style tailored to the type of task.
A simple example:
Imagine we built a model for house price prediction. If it predicts $200,000 but the actual price is $220,000, that’s a $20,000 error.
Why use the Squared Error?
We use the squared difference—like in Mean Squared Error (MSE)—because it serves two key purposes:
- Amplifies bigger mistakes: Squaring the error means that larger mistakes grow disproportionately. An error of 20 becomes 400, while an error of 2 becomes just 4. This encourages the model to pay special attention to large misses.
- Eliminates direction bias: If we simply summed up raw errors, positive and negative mistakes could cancel each other out (like +20 and -20 = 0), misleading the model into thinking it’s performing perfectly. Squaring makes all errors positive, so the loss truly reflects total deviation.
That’s why, in our house price example, we used $(20,000)^2 = 400,000,000$. It highlights not just that the prediction was wrong, but how wrong it was—pushing the model to correct aggressively when far off.
Why did not we try MSE here?
When we have just one prediction, like in this house price example, we simply compute the squared error. But in most real-world machine learning, we deal with many predictions across lots of data points. In those cases, we calculate the Mean Squared Error (MSE) — the average of all the squared errors.
So:
- One prediction: squared error = (20,000)2
- Many predictions: MSE = average of all squared errors
This averaging ensures the loss isn’t tied just to the number of samples, making it a fair metric across different dataset sizes.
A tiny demo with Python script
Here’s a quick snippet showing how raw errors can cancel out, while squared errors properly reflect total mistakes.
errors = [20, -20, 2, -2]
# Sum of raw errors
total_error = sum(errors)
print(f"Sum of raw errors: {total_error}")
# Sum of squared errors
total_squared_error = sum(e**2 for e in errors)
print(f"Sum of squared errors: {total_squared_error}")
Output:
Sum of raw errors: 0
Sum of squared errors: 808
Even though the sum of raw errors misleadingly says “0,” the squared errors show there’s actually a significant total deviation.
Wrap up
The loss function sits at the heart of machine learning. It’s the compass guiding your model toward better predictions by telling it exactly how far off it is at each step. In the next post, we’ll explore how optimizers use this loss to actually adjust the model’s parameters—getting it ever closer to its goal.
