Lecture 1: Working with data#

Please sign attendance sheet; close devices

Challenge#

Complete the demos and exercise today with generative AI only.

  • Allowed

    • Prompts

    • Copy-pasting

  • Not allowed

    • Googling

    • Editing

Spoiler: This is an example of what not to do 😉

I’ll be using Gemini, built into Colab; you can use a different tool if you prefer.

Working with CSVs in pure Python#

We will use Python’s CSV DictReader. We’ll open the file, parse it as a CSV, then operate row by row.

# our code here

In-class exercise#

311 requests#

Who’s called 311 before?

NYC 311 homepage

311 data#

Today’s goal#

  • Which 311 complaints are most common?

  • Which agencies are responsible for handling them?

Pandas#

  • A Python package (bundled up code that you can reuse)

  • Very common for data science in Python

  • A lot like R

    • Both organize around “data frames”

Load data#

Pull data from:

https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip

We’re using a sample to make it easier/faster to work with. This will take a while (~30 seconds).

# our code here

If you see a DtypeWarning, ignore it for now. We’ll come back to it.

Preview the data#

# our code here

Pandas data structures#

Diagram showing a DataFrame, Series, labels, and indexes

DataFrame information#

# our code here

Demo#

Analysis#

Which complaints are most common?#

# code goes here

What’s the most frequent request per agency?#

# code goes here

Exclude bad records from the DataFrame#

Let’s look at the complaint types.

# code goes here

How should we go about cleaning those up?

# code goes here

Reflections?#

  • What worked well?

  • What didn’t work well?

  • Did this change how you’re thinking about generative AI?

Best practices#

Homework 1#