Meet a Real Dataset
Time for the real thing. Today you load an actual dataset — measurements of real penguins from Antarctica — and take your first look around. The tool for this is pandas, the most popular way to work with data in Python. It’s already installed in Colab.
Open this lesson in Colab💡 Type each code block into a cell in your notebook and run it with Shift + Enter.
Loading the penguins
Run this in a cell:
import pandas as pd
import seaborn as sns
penguins = sns.load_dataset("penguins")
penguins = penguins.dropna()
penguins.head()
A few things happened:
import pandas as pdbrings in the data tool (we nickname itpd).sns.load_dataset("penguins")fetches a built-in dataset..dropna()throws out any rows with missing measurements (some penguins weren’t fully measured), so our data is clean..head()shows the first 5 rows.
You should see a neat table with columns like species, island, bill_length_mm, flipper_length_mm, body_mass_g, and sex. Real data about real animals!
💡 You can load any CSV from the web the same way:
pd.read_csv("https://...some-file.csv").load_datasetis just a shortcut for a few practice datasets.
A DataFrame is a smart spreadsheet
What you loaded is called a DataFrame. Think of it as a spreadsheet: rows (one per penguin) and columns (one per measurement). Each column is like a labeled list — exactly the lists and labels you learned in Phase II, now holding hundreds of values at once.
How big is it?
print("Rows and columns:", penguins.shape)
print("Number of penguins:", len(penguins))
print("The columns are:", list(penguins.columns))
.shapegives(rows, columns)— how big the table is.len(...)counts the rows (the penguins)..columnslists the column names.
You’re looking at a few hundred penguins, each with several measurements. That’s more data than you’d ever want to type by hand — and pandas makes it easy.
Peeking at the data
A few ways to look:
penguins.head(10) # first 10 rows
penguins.tail() # last 5 rows
penguins.sample(5) # 5 random rows
Try it 🎯
- Show the first 3 rows with
penguins.head(3). - Run
penguins.sample(5)a few times — different penguins each time. - Print just the number of columns:
penguins.shape[1].
A quick summary
pandas can summarize all the number columns at once:
penguins.describe()
This shows the count, average (mean), smallest (min), and largest (max) for each numeric column. Glance at the body_mass_g row — that’s the range of penguin weights in grams. (We’ll make our own summaries soon.)
Predict it 🔮
Before running, what do you think penguins.shape looks like — one number or two? (It’s two numbers in parentheses: rows first, then columns, like (333, 7).)
Fix the bug 🐞
This cell errors with a KeyError. The column name doesn’t match — capitals and spelling must be exact:
penguins["Species"].head()
(The column is species (lowercase). Use penguins["species"].head(). Column names are case-sensitive.)
Your mission 🚀
Load the penguins (with .dropna()), then in separate cells: print how many penguins there are, print the list of columns, and show 5 random penguins. You’re now officially holding real data.
What you learned today
- pandas (
import pandas as pd) is Python’s data tool, built into Colab. - A DataFrame is a smart spreadsheet: rows and columns, each column like a labeled list.
.head(),.tail(),.sample()peek at rows;.shape,len(),.columnsdescribe the size;.describe()summarizes the numbers..dropna()cleans out rows with missing values.
Next time, we’ll pull out exactly the columns and rows we care about — the first move of every data detective. 🔎
Comments