Counting and Grouping


Now for the questions data is really good at: How many of each kind are there? Which group is biggest, heaviest, fastest? Today you learn the two tools that answer them: value_counts and groupby. This is the heart of what data scientists do.

Open this lesson in Colab

๐Ÿ’ก Load the data first:

import pandas as pd
import seaborn as sns
penguins = sns.load_dataset("penguins").dropna()

How many of each? value_counts

.value_counts() counts how many times each value appears in a column:

penguins["species"].value_counts()

In one line, you learn how many Adelie, Gentoo, and Chinstrap penguins are in the data. Try it on another category:

penguins["island"].value_counts()

Try it ๐ŸŽฏ

Count how many penguins of each sex there are.

Average per group: groupby

Hereโ€™s the powerful one. groupby splits the data into groups, then computes something for each group. โ€œWhatโ€™s the average body mass for each species?โ€

penguins.groupby("species")["body_mass_g"].mean()

Read it left to right: take penguins, group by species, look at the body_mass_g column, and find the mean (average) of each group. The result is one number per species, so you can instantly see which species is heaviest on average.

Try it ๐ŸŽฏ

  1. Average flipper_length_mm per species.
  2. Average bill_length_mm per island.

Other things per group

You can ask for more than the average. Swap .mean() for:

  • .max() โ€” the biggest in each group
  • .min() โ€” the smallest
  • .count() โ€” how many in each group
penguins.groupby("species")["body_mass_g"].max()
penguins.groupby("island")["species"].count()

Predict it ๐Ÿ”ฎ

Gentoo penguins are the largest species. So in penguins.groupby("species")["body_mass_g"].mean(), which species do you expect to have the highest number? Run it and check. (Gentoo, by a lot. Their average body mass is well above the other two.)

Fix the bug ๐Ÿž

This is meant to find the average body mass per species, but it errors. The column to average is missing โ€” you have to say which column:

penguins.groupby("species").mean()

(Tell it which column to average: penguins.groupby("species")["body_mass_g"].mean(). Without picking a column, pandas can get confused by text columns.)

Your mission ๐Ÿš€

Investigate the penguins. In separate cells: (1) count how many penguins live on each island, (2) find the average flipper length per species, and (3) find the heaviest penguin in each species using .max(). Then write a sentence (as a print) saying which species is the heaviest on average.

What you learned today

  • .value_counts() counts how many of each value are in a column.
  • groupby("col")["other"].mean() computes a value per group โ€” the analysis workhorse.
  • Swap .mean() for .max(), .min(), or .count() to ask different questions.
  • These few lines answer questions that would take a long loop to do by hand.

You can find answers now. Next time, we start turning them into pictures โ€” your first chart. ๐Ÿ“ˆ

Comments