Open-ended assignments

Open-ended assignments#

In Homework 1, Homework 4, and the Final Project, you will pick your own dataset(s). For each:

Use at least one dataset that you aren’t familiar with.
- Using data from a primary source is preferred.
It should have between one thousand and one million rows.
- If it’s larger than that, you can make it smaller.
Finding a dataset available in CSV or JSON is recommended, though pandas can read other formats.
It’s ok if you pick the same dataset as another student, as long as you’re following the Academic Integrity rules.

Read like a blog post#

For these assignments:

Pretend you’re explaining to a peer who hasn’t taken this class. You don’t need to teach them to code, but they should be able to follow what’s going on.
Walk the reader through what you’re doing in every step and what they should be taking away from it.
- You are more than welcome to inject personality in there; doesn’t need to be dry.
Use text cells with Markdown for formatting.
- You’ll need to change the cell type to Markdown.

Open data portals#

There are countless places to get data. Below are some examples.

Primary sources#

NYU Libraries Data Sources
Local:
- NYC Open Data
  - Scout can be used to find datasets with certain columns
- BetaNYC
U.S. Federal:
United Nations
World Bank
World Health Organization (WHO)
The Humanitarian Data Exchange
Economic Policy Institute
Organisation for Economic Co-operation and Development (OECD)
Gallup Global Datasets
International Telecommunication Union (ITU)

Secondary sources#

Directories of open data portals#

Housing data#

Inspiration#

For starters, see the Final Project examples from past semesters.

Probably not realistic to make visualizations that are as fancy as these ones made by professionals, but they may give you ideas. Some also include links/downloads of the source data.

Storing data#

To work with uploaded files in Google Colab, you have a lot of options. A few recommended options:

Direct upload#

Fewer steps, but your file(s) will disappear when your session ends.

Steps to get data into Google Colab directly

In the Google Colab sidebar, click the Files icon (A).
Click the upload button (B).
Select your file.
Read the file in Python.

Google Drive#

More steps, but your file(s) are preserved between sessions.

Expands on the instructions to mount Google Drive locally.

Steps to get data into Google Colab via Drive

Upload the file(s) somewhere in Drive.
In the Google Colab sidebar, click the Files icon (A).
Click the Mount Drive icon (B).
- You may need to run the code it injects to authorize it (C).
- Think of this as attaching your Drive to your Google Colab instance, as if you were plugging in a USB flash drive.
Read the file in Python (below).

Reading files in Python#

Navigate to the file (D).
- You may need to click into content, then drive.
Next to the filename, click the three dots.
Click Copy path (E).
- The value should be something like /content/drive/My Drive/....
Use this path with read_csv() (F).

Google Colab cannot access the file on your local machine; in other words, the path shouldn’t start with C:\\ or anything like that. More info about file paths.

InteractiveSheet #

You can import your data into a Google Sheet and read it directly from Colab.

from google.colab import sheets

sheet_url = "…"
sheet = sheets.InteractiveSheet(url=sheet_url, include_column_headers=True)
df = sheet.as_df()

Reducing data size#

You can make data smaller before uploading by filtering it through:

The data portal, if it supports it
- This makes the download faster, including only the data you need.
- Instructions for Socrata-based portals:
  - NYC’s guide
  - Official
The $limit parameter (or equivalent), if using an API
- Socrata documentation
In a spreadsheet program