Counting and Grouping

Answer the big data questions: how many of each kind? what's the average per group? Meet value_counts and groupby — the heart of data analysis.

Now for the questions data is really good at: How many of each kind are there? Which group is biggest, heaviest, fastest? Today you learn the two tools that answer them: value_counts and groupby. This is the heart of what data scientists do.

Open this lesson in Colab

💡 Load the data first:

import pandas as pd
import seaborn as sns
penguins = sns.load_dataset("penguins").dropna()

How many of each? value_counts

.value_counts() counts how many times each value appears in a column:

penguins["species"].value_counts()

In one line, you learn how many Adelie, Gentoo, and Chinstrap penguins are in the data. Try it on another category:

penguins["island"].value_counts()

Try it 🎯

Count how many penguins of each sex there are.

Average per group: groupby

Here’s the powerful one. groupby splits the data into groups, then computes something for each group. “What’s the average body mass for each species?”

penguins.groupby("species")["body_mass_g"].mean()

Read it left to right: take penguins, group by species, look at the body_mass_g column, and find the mean (average) of each group. The result is one number per species, so you can instantly see which species is heaviest on average.

Try it 🎯

  1. Average flipper_length_mm per species.
  2. Average bill_length_mm per island.

Other things per group

You can ask for more than the average. Swap .mean() for:

  • .max() — the biggest in each group
  • .min() — the smallest
  • .count() — how many in each group
penguins.groupby("species")["body_mass_g"].max()
penguins.groupby("island")["species"].count()

Predict it 🔮

Gentoo penguins are the largest species. So in penguins.groupby("species")["body_mass_g"].mean(), which species do you expect to have the highest number? Run it and check. (Gentoo, by a lot. Their average body mass is well above the other two.)

Fix the bug 🐞

This is meant to find the average body mass per species, but it errors. The column to average is missing — you have to say which column:

penguins.groupby("species").mean()

(Tell it which column to average: penguins.groupby("species")["body_mass_g"].mean(). Without picking a column, pandas can get confused by text columns.)

Your mission 🚀

Investigate the penguins. In separate cells: (1) count how many penguins live on each island, (2) find the average flipper length per species, and (3) find the heaviest penguin in each species using .max(). Then write a sentence (as a print) saying which species is the heaviest on average.

What you learned today

  • .value_counts() counts how many of each value are in a column.
  • groupby("col")["other"].mean() computes a value per group — the analysis workhorse.
  • Swap .mean() for .max(), .min(), or .count() to ask different questions.
  • These few lines answer questions that would take a long loop to do by hand.

You can find answers now. Next time, we start turning them into pictures — your first chart. 📈

Comments