Open-ended assignments#
In Homework 1, Homework 4, and the Final Project, you will pick your own dataset(s). For each:
Use at least one dataset that you aren’t familiar with.
Using data from a primary source is preferred.
It should have between one thousand and one million rows.
If it’s larger than that, you can make it smaller.
Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.
It’s ok if you pick the same dataset as another student, as long as you’re following the Academic Integrity rules.
Read like a blog post#
For these assignments:
Pretend you’re explaining to a peer who hasn’t taken this class. You don’t need to teach them to code, but they should be able to follow what’s going on.
Walk the reader through what you’re doing in every step and what they should be taking away from it.
You are more than welcome to inject personality in there; doesn’t need to be dry.
Use text cells with Markdown for formatting.
You’ll need to change the cell type to Markdown.
Open data portals#
There are countless places to get data. Below are some examples.
Primary sources#
Local:
-
Scout can be used to find datasets with certain columns
-
U.S. Federal:
Organisation for Economic Co-operation and Development (OECD)
Secondary sources#
Directories of open data portals#
Housing data#
Inspiration#
For starters, see the Final Project examples from past semesters.
Probably not realistic to make visualizations that are as fancy as these ones made by professionals, but they may give you ideas. Some also include links/downloads of the source data.
Storing data#
To work with uploaded files in Google Colab, you have a lot of options. A few recommended options:
Direct upload#
Fewer steps, but your file(s) will disappear when your session ends.
In the Google Colab sidebar, click the
Filesicon (A).Click the upload button (B).
Select your file.
Google Drive#
More steps, but your file(s) are preserved between sessions.
Expands on the instructions to mount Google Drive locally.

Upload the file(s) somewhere in Drive.
In the Google Colab sidebar, click the
Filesicon (A).Click the
Mount Driveicon (B).You may need to run the code it injects to authorize it (C).
Think of this as attaching your Drive to your Google Colab instance, as if you were plugging in a USB flash drive.
Read the file in Python (below).
Reading files in Python#
Navigate to the file (D).
You may need to click into
content, thendrive.
Next to the filename, click the three dots.
Click
Copy path(E).The value should be something like
/content/drive/My Drive/....
Use this path with
read_csv()(F).
Google Colab cannot access the file on your local machine; in other words, the path shouldn’t start with C:\\ or anything like that. More info about file paths.
InteractiveSheet#
You can import your data into a Google Sheet and read it directly from Colab.
from google.colab import sheets
sheet_url = "…"
sheet = sheets.InteractiveSheet(url=sheet_url, include_column_headers=True)
df = sheet.as_df()
Reducing data size#
You can make data smaller before uploading by filtering it through:
The data portal, if it supports it
This makes the download faster, including only the data you need.
Instructions for Socrata-based portals:
The
$limitparameter (or equivalent), if using an APIIn a spreadsheet program