Open-ended assignments#
In Homework 1, Homework 4, and the Final Project, you will pick your own dataset(s).
Use at least one dataset that you aren’t familiar with.
Using data from a primary source is preferred.
It should have between one thousand and one million rows.
If it’s larger than that, you can make it smaller.
Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.
It’s ok if you pick the same dataset as another student, as long as you’re following the Academic Integrity rules.
If you’d be interested in working with SIPA alumni employment data, reach out to the instructor.
Open data portals#
There are countless places to get data, notably:
Local:
-
Scout can be used to find datasets with certain columns
-
U.S. Federal:
Lists of open data portals:
Inspiration#
For starters, see the Final Project examples from past semesters.
Probably not realistic to make visualizations that are as fancy as these ones made by professionals, but they may give you ideas. Some also include links/downloads of the source data.
Storing data#
To work with uploaded files in Google Colab, you have two options.
Direct upload#
Fewer steps, but your file(s) will disappear when your session ends.

In the Google Colab sidebar, click the
Files
icon (A).Click the upload button (B).
Select your file.
You’ll use
read_csv("MY_FILENAME.csv")
in your code.
Google Drive#
More steps, but your file(s) are preserved between sessions.
Upload the file(s) somewhere in Drive.
In the Google Colab sidebar, click the
Files
icon (A).Click the
Mount Drive
icon (B).You may need to run the code it injects to authorize it (C).
Think of this as attaching your Drive to your Google Colab instance, as if you were plugging in a USB flash drive.
Navigate to the file (D).
You may need to click into
content
, thendrive
.
Next to the filename, click the three dots.
Click
Copy path
(E).The value should be something like
/content/drive/My Drive/...
.
Use this path with
read_csv()
(F).
Google Colab cannot access the file on your local machine; in other words, the path shouldn’t start with C:\\
or anything like that. More info about file paths.
Reducing data size#
You can make data smaller before uploading by filtering it through:
The data portal, if it supports it
This makes the download faster, including only the data you need.
The
$limit
parameter (or equivalent), if using an APIIn a spreadsheet program