Open-ended assignments#
In Homework 1 and the Final Project, you will pick your own dataset(s).
Use at least one dataset that you aren’t familiar with.
Using data from a primary source is preferred.
Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.
Open data portals#
There are countless places to get data, notably:
Local:
-
Scout can be used to find datasets with certain columns
-
U.S. Federal:
Lists of open data portals:
Inspiration#
For starters, see the Final Project examples from past semesters.
Probably not realistic to make visualizations that are as fancy as these ones made by professionals, but they may give you ideas. Some also include links/downloads of the source data.
Storing data#
Open the JupyterHub file browser.
Navigate to the folder your notebook is in.
From Python, use
read_csv("./<filename>.csv")
.
Note that that file path should be to relative to the notebook within JupyterHub — ./
means “in the same directory”. JupyterHub cannot access the file on your local machine; in other words, the path shouldn’t start with C:\\
or anything like that. More info about file paths.
Limits#
JupyterHub has a disk storage limit of 1GB (a.k.a. 1,024 MB or 1,048,576 KB) across all your files, and a memory limit of 3GB.
Reducing data size#
You can make data smaller before uploading by filtering it through:
The data portal, if it supports it
This makes the download faster, including only the data you need.
The
$limit
parameter (or equivalent), if using an APIIn a spreadsheet program