You have built a Machine Learning model. But, how do you know if it’s actually good?
In this part of the series, we will break down the core evaluation metrics used for classification models:
- Confusion matrix
- Precision
- Recall
- F1 Score
- ROC Curve
Confusion Matrix
Confusion Matrix is a table showing how many predictions your model got right or wrong. And in what way.
Predicted: No | Predicted: Yes | |
Actual: No | True Negative (TN) | False Positive (FP) |
Actual: Yes | False Negative (FN) | True Positive (TP) |
Sample Python Code Snippet:
from sklearn.metrics import confusion_matrix
y_true = [1, 0, 1, 1, 0, 1, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1]
print(confusion_matrix(y_true, y_pred))
Real-world use cases:
In fraud detection, false positives = legitimate users flagged. false negatives = fraudsters slipping through.
Precision
Precision is of all the items the model predicted as positive; how many were correct?
Formula:
Precision = (TP) / (TP + FP)
Sample Python Code Snippet:
from sklearn.metrics import precision_score
precision_score(y_true, y_pred)
Real-world use cases:
In email spam filters, precision tells you how many flagged emails were actually spam.
Recall
Recall is of all the actual positives; how many did the model catch?
Formula:
Recall = (TP) / (TP + FN)
Sample Python Code Snippet:
from sklearn.metrics import recall_score
recall_score(y_true, y_pred)
Real-world use cases:
In medical screening, high recall ensures you catch as many real cases as possible. Even if it means more false alarms.
F1 Score
The harmonic mean of precision and recall. It balances both and is especially useful when classes are imbalanced.
Formula:
F1 score = 2 * ((Precision * Recall) / (Precision + Recall))
Sample Python Code Snippet:
from sklearn.metrics import f1_score
f1_score(y_true, y_pred)
Real-world use cases:
In fraud detection, F1 Score is crucial because catching fraud (recall) and not falsely accusing customers (precision) are both important.
ROC Curve & AOC
A graph showing how well the model separates classes as the decision threshold changes.
- True Positive Rate (Recall) vs False Positive Rate.
- The Area Under Curve (AUC) tells you the overall performance.
Sample Python Code Snippet:
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_true, y_scores)
print("AUC Score:", auc(fpr, tpr))
Real-world use case:
Used by banks to measure how well a credit model can distinguish between good and risky borrowers.
Summary:
Metric | What it Measures? | When to Focus on it? |
Precision | How many predicted positives are correct | When false positives are costly |
Recall | How many actual positives are caught | When missing positives is risky |
F1 Score | Balance of precision and recall | When both matter (For example, fraud, medical) |
ROC AUC | Model’s ability to distinguish classes | General evaluation for binary classification |
Up Next
In Part 5, we will complete the series with how to tune your model using:
- Hyperparameter tuning
- Gradient descent
- Epochs
- Loss functions