Filtering: Keep Only the Rows You Want


A real data detective is always asking “show me only the ones that…”. Only the heavy penguins. Only the Gentoos. Only the ones from a certain island. Today you learn to filter — keep just the rows that match a condition.

Open this lesson in Colab

💡 Load the data first:

import pandas as pd
import seaborn as sns
penguins = sns.load_dataset("penguins").dropna()

A condition picks rows

Remember comparisons from Phase II (>, ==)? Put one inside the brackets, and pandas keeps only the rows where it’s True:

heavy = penguins[penguins["body_mass_g"] > 4500]
heavy.head()
print("How many heavy penguins:", len(heavy))

Read penguins[penguins["body_mass_g"] > 4500] as: “from penguins, keep the rows where body_mass_g is over 4500.” The result is a smaller table of just those penguins.

Try it 🎯

  1. Keep penguins with flipper_length_mm > 210. How many are there?
  2. Keep the light penguins: body_mass_g < 3500.

Filtering by a word

Conditions work on text columns too — great for picking a category:

gentoos = penguins[penguins["species"] == "Gentoo"]
print("Number of Gentoo penguins:", len(gentoos))
gentoos.head()

Try it 🎯

Make a table of only the penguins from the island "Biscoe".

Sorting the results

.sort_values(...) puts rows in order. Add ascending=False for biggest-first:

penguins.sort_values("body_mass_g", ascending=False).head()

That shows the five heaviest penguins, top of the table. Drop ascending=False for smallest-first.

Combining two conditions

You can require two things at once. In pandas, use & for “and”, | for “or”, and wrap each condition in parentheses:

big_adelies = penguins[(penguins["species"] == "Adelie") & (penguins["body_mass_g"] > 3800)]
big_adelies.head()

⚠️ Two pandas rules: use &/| (not the words and/or), and put parentheses around each part. Forgetting the parentheses is the most common filtering bug.

Predict it 🔮

Roughly, will there be more penguins with body_mass_g > 3000 or with body_mass_g > 5000? Filter both and compare the len(...). (Far more over 3000 — most penguins weigh more than 3000g, but only the biggest top 5000g.)

Fix the bug 🐞

This tries to combine two conditions but crashes. It’s missing the parentheses pandas needs:

penguins[penguins["species"] == "Adelie" & penguins["body_mass_g"] > 3800]

(Wrap each condition: penguins[(penguins["species"] == "Adelie") & (penguins["body_mass_g"] > 3800)].)

Your mission 🚀

Be a penguin detective. In separate cells: (1) find how many penguins weigh over 5000g, (2) make a table of just the Chinstrap penguins, and (3) show the 5 penguins with the longest flippers (sort descending and take .head()).

What you learned today

  • Filter rows with a condition in brackets: penguins[penguins["col"] > value].
  • Conditions work on words too: penguins["species"] == "Gentoo".
  • .sort_values("col", ascending=False) orders the rows.
  • Combine conditions with & / | and parentheses around each part.

Next time we answer “how many of each?” and “what’s the average per group?” — counting and grouping, the heart of data analysis. 📊

Comments