Jun 1, 2026

Did Your Model Actually Learn Anything? Evaluating Honestly

Series: Practical PyTorch · II (Phase II) — Part 6 of 9

You just fine-tuned a model (Part 5), the loss fell beautifully, and the training log says it’s getting 99% of its examples right. Time to ship? Absolutely not — because that 99% is the model’s score on the exact questions it studied, and a student who memorized the answer key isn’t the same as a student who understood the material. This post is about telling those two apart. It’s the difference between a model that learned and a model that merely memorized, and it’s the most important skill in this whole series that nobody puts on a slide.

We’ll cover why you split your data into three piles, the everyday metrics in plain English, how to read overfitting and underfitting straight off the numbers, and how to wire real metrics into the Trainer with the evaluate library so you get an honest report card automatically.

Open the companion notebook in Colab

You can’t grade on what you studied

The single rule underneath all of evaluation: never measure a model on data it learned from. If you do, you’re rewarding memorization, and memorization looks identical to understanding right up until the model meets something new and falls apart.

So you split your data into three piles before training starts:

Train — the examples the model actually learns from. The biggest pile, usually ~80%.
Validation — held back during training, used to check progress and tune your choices (learning rate, number of epochs, which model to keep). The model never learns from these, but you look at them constantly. ~10%.
Test — locked in a drawer until the very end. You touch it once, to get a final honest number, and then you’re done. ~10%.

Why two held-out piles instead of one? Because the moment you start tuning your setup to make the validation numbers look good — trying another learning rate, training one more epoch — you’re subtly fitting to the validation set too. It stops being a clean measure. The test set is the final exam you never peeked at, so its number is the one you can actually trust and quote. Validation is the practice exam; test is the real thing.

The good news: the libraries make this nearly free. A Hugging Face Dataset splits in one line, and you hand the train and validation pieces to the Trainer.

# A dataset with a built-in split, e.g. "train" and "test"
split = dataset["train"].train_test_split(test_size=0.2, seed=42)
train_ds = split["train"]        # the model learns from these
eval_ds  = split["test"]         # held out — the model never sees these in training

That seed=42 matters more than it looks — it makes the split reproducible, so you and a colleague grading the same model are grading on the same questions.

The metrics that matter, in plain terms

Once you have predictions on held-out data, you need a number that says how good they are. The obvious one is accuracy, and it’s exactly what it sounds like.

Accuracy — the fraction the model got right. 90 correct out of 100 is 0.90. Intuitive, and the right default for balanced problems where each category shows up about equally often.

But accuracy has a famous failure mode. Imagine a fraud detector where 99% of transactions are legitimate. A “model” that blindly says legitimate every single time scores 99% accuracy — and catches exactly zero fraud. The number looks brilliant; the model is useless. This is why a single accuracy figure isn’t the whole story, and why you want two more:

Precision — of the things the model flagged, how many were actually right? High precision means few false alarms. (When it cries “fraud!”, believe it.)
Recall — of the things that should have been flagged, how many did the model catch? High recall means few misses. (It rarely lets fraud slip through.)

These two pull against each other. Flag everything and you catch all the fraud (great recall) but bury people in false alarms (terrible precision). Flag nothing risky and your few alarms are all correct (great precision) but you miss most of the fraud (terrible recall). Which you favor depends on the cost of being wrong in each direction — a spam filter wrongly trashing a real email is worse than letting one spam through, so it leans toward precision.

F1 — a single number that balances precision and recall (their harmonic mean, if you want the term). It only stays high when both are decent, so it’s the honest summary metric for imbalanced problems where plain accuracy lies. When in doubt, look at F1 alongside accuracy.

You don’t compute any of these by hand. The evaluate library has them ready to load, which we’ll get to in a moment.

Overfitting vs. underfitting — read it off train vs. validation

Here’s the payoff for keeping that validation pile separate: comparing the model’s train score to its validation score tells you, at a glance, what kind of trouble you’re in.

Overfitting — great on train, bad on validation. The model memorized the training examples instead of learning the general pattern. Like a student who can recite last year’s exam but freezes on a question phrased differently. The tell:

train accuracy:       0.99
validation accuracy:  0.74    ← big gap = overfitting

A wide gap means the model is too cozy with its training data. Fixes: more training data, stop training sooner (fewer epochs), or a smaller / more constrained model. You’ll meet a particularly elegant fix in Part 7.

Underfitting — bad on train, bad on validation. The model didn’t learn enough even on the material it studied. Like a student who didn’t open the book. The tell:

train accuracy:       0.61
validation accuracy:  0.59    ← both low = underfitting

Both numbers are weak and close together. Fixes: train longer, use a bigger or less constrained model, or check that your data and labels are actually sane.

The healthy middle — good on train, nearly as good on validation. A small gap is normal and fine.

train accuracy:       0.94
validation accuracy:  0.91    ← small gap = healthy

That’s the whole diagnostic, and it costs you nothing but the discipline to look at both numbers instead of just the one that flatters you.

Wiring real metrics into the Trainer

In Part 5 the Trainer reported loss and not much else. Loss is fine for the optimizer, but it’s not a number you can explain to a PM. Let’s make the Trainer report accuracy and F1 every time it evaluates, by writing a tiny compute_metrics function and handing it over.

First, load the metrics:

import numpy as np
import evaluate

acc = evaluate.load("accuracy")
f1  = evaluate.load("f1")

evaluate.load("accuracy") fetches a ready-made metric with a .compute() method — you don’t implement the math. Now the bridge between what the Trainer produces and what those metrics want:

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)   # turn raw scores into a predicted class
    return {
        **acc.compute(predictions=preds, references=labels),
        **f1.compute(predictions=preds, references=labels, average="weighted"),
    }

Two things are happening. The model outputs logits — raw, unnormalized scores, one per class. np.argmax(..., axis=-1) picks the highest-scoring class for each example, turning a row of scores into a single predicted label. Then each metric compares those predictions against the true labels. The ** spreads both result dicts into one, so you get {"accuracy": ..., "f1": ...} in a single return. The average="weighted" on F1 is the right default for multi-class problems — it accounts for how common each class is, which is exactly the imbalance issue from earlier.

Hand it to the Trainer with one extra argument:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    compute_metrics=compute_metrics,   # ← the new piece
)

trainer.train()
metrics = trainer.evaluate()
print(metrics)
# {'eval_loss': 0.31, 'eval_accuracy': 0.91, 'eval_f1': 0.90, ...}

Now every evaluation prints accuracy and F1 alongside loss, on your held-out data, automatically. That’s a report card you can actually read — and defend.

Gotchas

The mistakes here are quiet ones: they don’t crash, they just hand you a number that’s wrong in a way you’ll believe.

Testing on training data. The original sin. If your evaluation set overlaps with your training set, your score is inflated and meaningless — you’re grading the answer key. Split before you train, and keep the piles separate.
Trusting accuracy on imbalanced data. When one class dominates, accuracy can look fantastic while the model ignores the rare class entirely. If your classes aren’t roughly balanced, look at F1 (and at precision/recall) before you believe the accuracy number.
Peeking at the test set. The instant you tune anything based on test-set results, it stops being a clean final exam and becomes just another validation set. Tune on validation; touch test once, at the end. Treat it like a sealed envelope.
Reporting only the metric that flatters the model. Cherry-picking “94% accuracy!” while quietly omitting that recall is 0.20 is how good-looking models ship broken. Report a metric that fits the task, not the one that reads best.
Forgetting model.eval() when evaluating by hand. The Trainer handles this for you, but if you ever loop over the validation set yourself, switch the model to eval mode (and wrap it in torch.no_grad()) — otherwise dropout and friends quietly distort your numbers. Same habit from Part 2.
A tiny validation set. Measuring accuracy on 20 examples gives you a noisy number that can swing 5–10 points between runs by luck alone. If your held-out pile is small, treat its score as a rough estimate, not gospel.

What’s next

You can now answer the only question that matters after training — did it actually learn? — with numbers you trust instead of numbers that flatter. Split the data, watch the train-vs-validation gap, reach for F1 when the classes are lopsided, and keep the test set sealed until the end.

There’s one fix we kept gesturing at: when fine-tuning a large model is slow, memory-hungry, or prone to overfitting, you don’t have to retrain the whole thing. You can train a tiny set of new parameters bolted onto a frozen model and get most of the benefit for a fraction of the cost.

Next: Part 7 — Parameter-Efficient Fine-Tuning (LoRA & PEFT), fine-tuning big models without the big bill.

Target keyword(s): evaluate machine learning model, accuracy F1 overfitting, compute_metrics trainer.