Filtering: Keep Only the Rows You Want
A real data detective is always asking “show me only the ones that…”. Only the heavy penguins. Only the Gentoos. Only the ones from a certain island. Today you learn to filter — keep just the rows that match a condition.
Open this lesson in Colab💡 Load the data first:
import pandas as pd import seaborn as sns penguins = sns.load_dataset("penguins").dropna()
A condition picks rows
Remember comparisons from Phase II (>, ==)? Put one inside the brackets, and pandas keeps only the rows where it’s True:
heavy = penguins[penguins["body_mass_g"] > 4500]
heavy.head()
print("How many heavy penguins:", len(heavy))
Read penguins[penguins["body_mass_g"] > 4500] as: “from penguins, keep the rows where body_mass_g is over 4500.” The result is a smaller table of just those penguins.
Try it 🎯
- Keep penguins with
flipper_length_mm > 210. How many are there? - Keep the light penguins:
body_mass_g < 3500.
Filtering by a word
Conditions work on text columns too — great for picking a category:
gentoos = penguins[penguins["species"] == "Gentoo"]
print("Number of Gentoo penguins:", len(gentoos))
gentoos.head()
Try it 🎯
Make a table of only the penguins from the island "Biscoe".
Sorting the results
.sort_values(...) puts rows in order. Add ascending=False for biggest-first:
penguins.sort_values("body_mass_g", ascending=False).head()
That shows the five heaviest penguins, top of the table. Drop ascending=False for smallest-first.
Combining two conditions
You can require two things at once. In pandas, use & for “and”, | for “or”, and wrap each condition in parentheses:
big_adelies = penguins[(penguins["species"] == "Adelie") & (penguins["body_mass_g"] > 3800)]
big_adelies.head()
⚠️ Two pandas rules: use
&/|(not the wordsand/or), and put parentheses around each part. Forgetting the parentheses is the most common filtering bug.
Predict it 🔮
Roughly, will there be more penguins with body_mass_g > 3000 or with body_mass_g > 5000? Filter both and compare the len(...). (Far more over 3000 — most penguins weigh more than 3000g, but only the biggest top 5000g.)
Fix the bug 🐞
This tries to combine two conditions but crashes. It’s missing the parentheses pandas needs:
penguins[penguins["species"] == "Adelie" & penguins["body_mass_g"] > 3800]
(Wrap each condition: penguins[(penguins["species"] == "Adelie") & (penguins["body_mass_g"] > 3800)].)
Your mission 🚀
Be a penguin detective. In separate cells: (1) find how many penguins weigh over 5000g, (2) make a table of just the Chinstrap penguins, and (3) show the 5 penguins with the longest flippers (sort descending and take .head()).
What you learned today
- Filter rows with a condition in brackets:
penguins[penguins["col"] > value]. - Conditions work on words too:
penguins["species"] == "Gentoo". .sort_values("col", ascending=False)orders the rows.- Combine conditions with
&/|and parentheses around each part.
Next time we answer “how many of each?” and “what’s the average per group?” — counting and grouping, the heart of data analysis. 📊
Comments