Train Your First Model


Last time you learned the big idea: show a computer labeled examples and it finds the pattern. Today you actually do it. Using the penguins from Phase III, you’ll train a real model to guess a penguin’s species from its measurements — in about three lines. The tool is scikit-learn, the most popular machine-learning library, built into Colab.

💡 In Colab. (scikit-learn is already installed — no pip needed.)

Open this lesson in Colab

Set up the examples

Remember features (the clues) and label (the answer)? For penguins, the features are measurements and the label is the species:

import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier

penguins = sns.load_dataset("penguins").dropna()

X = penguins[["bill_length_mm", "flipper_length_mm", "body_mass_g"]]   # features
y = penguins["species"]                                                # label

By tradition, the features are called X (a table of clues) and the label is called y (the answers). Each row of X is one penguin’s measurements; the matching y is that penguin’s species.

Train it: .fit()

This is the moment of learning. Create a model and call .fit(X, y) — “learn the pattern from these examples”:

model = KNeighborsClassifier()
model.fit(X, y)

That’s it. The model just studied hundreds of penguins and learned how measurements relate to species. fit is the learning step — the same “learn from examples” idea, now real code.

Use it: .predict()

Now give it a new penguin’s measurements and ask what species it is:

guess = model.predict([[45, 210, 4500]])
print("I think this penguin is a:", guess[0])

You handed it a bill length of 45mm, a flipper of 210mm, and a mass of 4500g — measurements it had never seen — and it predicted a species from the pattern it learned. You trained an AI!

Try it 🎯

  1. Change the three numbers and predict again. Tiny penguin? Big one?
  2. Look at a real row (penguins.head()), copy its three measurements into predict, and check if the model gets that species right.

How does KNeighbors decide?

The model you used, KNeighborsClassifier, has a wonderfully simple idea: to label a new penguin, it finds the most similar penguins it already knows (its “neighbors”) and goes with the majority. “This new one is closest to a bunch of Gentoos, so… Gentoo.” Similarity, not magic.

Think about it 🔮

You trained on bill_length_mm, flipper_length_mm, and body_mass_g. If you gave the model a penguin’s island instead of measurements, could it predict species? (No — it only learned from those three number features. A model can only use the kinds of clues it was trained on.)

Fix the bug 🐞

This trains a model but crashes on predict, because the new penguin is given as a flat list instead of a list-of-lists (the model expects a table of penguins, even if it’s just one):

model.predict([45, 210, 4500])

(Wrap it in another set of brackets — one row inside a table: model.predict([[45, 210, 4500]]).)

Your mission 🚀

Train the penguin classifier, then test it on three made-up penguins (three different sets of measurements). print each prediction. Then grab a real penguin from penguins.head() and confirm the model labels it correctly.

What you learned today

  • scikit-learn trains machine-learning models; it’s built into Colab.
  • Features go in X, labels go in y.
  • model.fit(X, y) is the learning step; model.predict([[...]]) makes a guess on new data.
  • KNeighborsClassifier decides by finding the most similar examples it knows.

You trained a model — but is it any good? Next time we test it honestly and measure its accuracy. 🎯

Comments