Meet a Real Dataset


Time for the real thing. Today you load an actual dataset — measurements of real penguins from Antarctica — and take your first look around. The tool for this is pandas, the most popular way to work with data in Python. It’s already installed in Colab.

Open this lesson in Colab

💡 Type each code block into a cell in your notebook and run it with Shift + Enter.

Loading the penguins

Run this in a cell:

import pandas as pd
import seaborn as sns

penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()
penguins.head()

A few things happened:

  • import pandas as pd brings in the data tool (we nickname it pd).
  • sns.load_dataset("penguins") fetches a built-in dataset.
  • .dropna() throws out any rows with missing measurements (some penguins weren’t fully measured), so our data is clean.
  • .head() shows the first 5 rows.

You should see a neat table with columns like species, island, bill_length_mm, flipper_length_mm, body_mass_g, and sex. Real data about real animals!

💡 You can load any CSV from the web the same way: pd.read_csv("https://...some-file.csv"). load_dataset is just a shortcut for a few practice datasets.

A DataFrame is a smart spreadsheet

What you loaded is called a DataFrame. Think of it as a spreadsheet: rows (one per penguin) and columns (one per measurement). Each column is like a labeled list — exactly the lists and labels you learned in Phase II, now holding hundreds of values at once.

How big is it?

print("Rows and columns:", penguins.shape)
print("Number of penguins:", len(penguins))
print("The columns are:", list(penguins.columns))
  • .shape gives (rows, columns) — how big the table is.
  • len(...) counts the rows (the penguins).
  • .columns lists the column names.

You’re looking at a few hundred penguins, each with several measurements. That’s more data than you’d ever want to type by hand — and pandas makes it easy.

Peeking at the data

A few ways to look:

penguins.head(10)    # first 10 rows
penguins.tail()      # last 5 rows
penguins.sample(5)   # 5 random rows

Try it 🎯

  1. Show the first 3 rows with penguins.head(3).
  2. Run penguins.sample(5) a few times — different penguins each time.
  3. Print just the number of columns: penguins.shape[1].

A quick summary

pandas can summarize all the number columns at once:

penguins.describe()

This shows the count, average (mean), smallest (min), and largest (max) for each numeric column. Glance at the body_mass_g row — that’s the range of penguin weights in grams. (We’ll make our own summaries soon.)

Predict it 🔮

Before running, what do you think penguins.shape looks like — one number or two? (It’s two numbers in parentheses: rows first, then columns, like (333, 7).)

Fix the bug 🐞

This cell errors with a KeyError. The column name doesn’t match — capitals and spelling must be exact:

penguins["Species"].head()

(The column is species (lowercase). Use penguins["species"].head(). Column names are case-sensitive.)

Your mission 🚀

Load the penguins (with .dropna()), then in separate cells: print how many penguins there are, print the list of columns, and show 5 random penguins. You’re now officially holding real data.

What you learned today

  • pandas (import pandas as pd) is Python’s data tool, built into Colab.
  • A DataFrame is a smart spreadsheet: rows and columns, each column like a labeled list.
  • .head(), .tail(), .sample() peek at rows; .shape, len(), .columns describe the size; .describe() summarizes the numbers.
  • .dropna() cleans out rows with missing values.

Next time, we’ll pull out exactly the columns and rows we care about — the first move of every data detective. 🔎

Comments