# a bit of setup just for lecture; ignore me
import numpy as np

np.random.seed(5)

Lecture 1: Working with data#

Please sign attendance sheet

Additional programming concepts#

Functions#

  • Functions == methods

  • Arguments == parameters

For simplicity, we’ll use them interchangeably.

Packages#

  • a.k.a. “libraries”

  • Developers have create them to make code/functionality reusable and easily sharable

  • Software plugins that you import

  • Main packages we’ll use:

    • pandas

    • plotly

A module is a file containing Python definitions and statements.

https://docs.python.org/3/tutorial/modules.html

Working with files in Python#

We’ll open the file, then read it row by row.

# set up the file object
with open("moby_dick.txt") as file:
    for line in file:
        print(line)

Working with CSVs in pure Python#

We will use Python’s CSV DictReader. We’ll open the file, parse it as a CSV, then operate row by row.

Example#

import csv


# set up the file object
with open("people.csv") as csvfile:
    # set up the reader
    reader = csv.DictReader(csvfile)
    # loop through the rows
    for row in reader:
        # access the data in various columns
        first = row["first_name"]
        last = row["last_name"]

        print(f"{first} {last}")

In-class exercise#

311 requests#

Who’s called 311 before?

NYC 311 homepage

311 data#

Today’s goal#

  • Which 311 complaints are most common?

  • Which agencies are responsible for handling them?

Pandas#

  • A Python package (bundled up code that you can reuse)

  • Very common for data science in Python

  • A lot like R

    • Both organize around “data frames”

Import package#

import pandas as pd

Read data#

We’re using a sample to make it easier/faster to work with. This will take a while (~30 seconds).

requests = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip"
)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_94278/4056869583.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(

Ignore the DtypeWarning for now; we’ll come back to it.

Preview the data#

requests
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
0 39885889 08/01/2018 12:05:13 AM 08/01/2018 12:05:13 AM DOT Department of Transportation Street Condition Pothole NaN 11235 3143 SHORE PARKWAY ... NaN NaN NaN NaN NaN NaN NaN 40.585156 -73.959119 (40.585155533520144, -73.95911915841708)
1 39886470 08/01/2018 12:06:05 AM 08/01/2018 12:06:05 AM DOT Department of Transportation Street Condition Pothole NaN 11235 3153 SHORE PARKWAY ... NaN NaN NaN NaN NaN NaN NaN 40.585218 -73.958608 (40.58521848090658, -73.95860788382927)
2 39893543 08/01/2018 12:06:16 AM 08/03/2018 02:03:55 PM HPD Department of Housing Preservation and Develop... HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 11221 729 LAFAYETTE AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.690733 -73.943964 (40.69073285353906, -73.943963521266)
3 39886233 08/01/2018 12:06:29 AM 08/01/2018 02:54:24 AM NYPD New York City Police Department Noise - Residential Banging/Pounding Residential Building/House 11693 82-01 BEACH CHANNEL DRIVE ... NaN NaN NaN NaN NaN NaN NaN 40.589931 -73.808896 (40.58993080750793, -73.80889570815852)
4 39880309 08/01/2018 12:06:51 AM 08/01/2018 04:54:26 AM NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 11216 64 HERKIMER STREET ... NaN NaN NaN NaN NaN NaN NaN 40.679716 -73.951234 (40.67971590505359, -73.95123396494363)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
499995 43622302 08/24/2019 01:46:09 AM NaN NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 10009.0 431 EAST 9 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.727536 -73.983295 (40.72753608835362, -73.98329522742081)
499996 43619709 08/24/2019 01:49:49 AM NaN NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 10304.0 191 BROAD STREET ... NaN NaN NaN NaN NaN NaN NaN 40.624157 -74.081006 (40.62415703282506, -74.08100614362155)
499997 43623124 08/24/2019 01:56:35 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Music/Party Street/Sidewalk 10031.0 534 WEST 153 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.830718 -73.945006 (40.83071800761314, -73.94500557250639)
499998 43625595 08/24/2019 01:56:40 AM NaN NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 10452.0 EAST 170 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.839882 -73.916783 (40.839882158779105, -73.91678321635897)
499999 43622817 08/24/2019 01:57:58 AM NaN NYPD New York City Police Department Noise - Commercial Loud Music/Party Store/Commercial 10033.0 247 AUDUBON AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.846376 -73.934048 (40.84637632367179, -73.93404825809533)

500000 rows × 41 columns

.head() defaults to providing the first 5 if you don’t specify a number. Can also use .tail().

requests.head(10)
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
0 39885889 08/01/2018 12:05:13 AM 08/01/2018 12:05:13 AM DOT Department of Transportation Street Condition Pothole NaN 11235 3143 SHORE PARKWAY ... NaN NaN NaN NaN NaN NaN NaN 40.585156 -73.959119 (40.585155533520144, -73.95911915841708)
1 39886470 08/01/2018 12:06:05 AM 08/01/2018 12:06:05 AM DOT Department of Transportation Street Condition Pothole NaN 11235 3153 SHORE PARKWAY ... NaN NaN NaN NaN NaN NaN NaN 40.585218 -73.958608 (40.58521848090658, -73.95860788382927)
2 39893543 08/01/2018 12:06:16 AM 08/03/2018 02:03:55 PM HPD Department of Housing Preservation and Develop... HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 11221 729 LAFAYETTE AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.690733 -73.943964 (40.69073285353906, -73.943963521266)
3 39886233 08/01/2018 12:06:29 AM 08/01/2018 02:54:24 AM NYPD New York City Police Department Noise - Residential Banging/Pounding Residential Building/House 11693 82-01 BEACH CHANNEL DRIVE ... NaN NaN NaN NaN NaN NaN NaN 40.589931 -73.808896 (40.58993080750793, -73.80889570815852)
4 39880309 08/01/2018 12:06:51 AM 08/01/2018 04:54:26 AM NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 11216 64 HERKIMER STREET ... NaN NaN NaN NaN NaN NaN NaN 40.679716 -73.951234 (40.67971590505359, -73.95123396494363)
5 39878305 08/01/2018 12:07:41 AM 08/01/2018 12:07:41 AM DOT Department of Transportation Street Condition Pothole NaN 11235 3167 SHORE PARKWAY ... NaN NaN NaN NaN NaN NaN NaN 40.585254 -73.958302 (40.585254052666755, -73.95830184003603)
6 39878821 08/01/2018 12:10:04 AM 08/01/2018 02:04:49 AM NYPD New York City Police Department Illegal Parking Commercial Overnight Parking Street/Sidewalk 10465 245 THROGGS NECK BOULEVARD ... NaN NaN NaN NaN NaN NaN NaN 40.818533 -73.813294 (40.8185333998848, -73.81329367027502)
7 39881328 08/01/2018 12:11:00 AM 08/09/2018 12:00:00 AM DSNY Department of Sanitation Request Large Bulky Item Collection Request Large Bulky Item Collection Sidewalk 11367 147-46 77 ROAD ... NaN NaN NaN NaN NaN NaN NaN 40.721380 -73.816326 (40.72137977337607, -73.8163256713087)
8 39885268 08/01/2018 12:14:41 AM 08/01/2018 03:03:31 AM NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 10459 951 HOE AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.822088 -73.890340 (40.82208801817637, -73.89033972476959)
9 39897573 08/01/2018 12:17:02 AM 08/13/2018 01:05:32 PM HPD Department of Housing Preservation and Develop... HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 11101 33-01 38 AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.752911 -73.930358 (40.752910688469996, -73.93035822987532)

10 rows × 41 columns

requests.sample(5)
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
311150 42044618 03/25/2019 10:16:38 PM 03/25/2019 11:47:09 PM NYPD New York City Police Department Noise - Residential Loud Talking Residential Building/House 10453.0 1800 POPHAM AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.851389 -73.917651 (40.85138859491611, -73.91765079814148)
32142 40107788 08/25/2018 10:24:40 AM 08/25/2018 04:30:08 PM NYPD New York City Police Department Blocked Driveway No Access Street/Sidewalk 11419.0 133-05 107 AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.687721 -73.811536 (40.6877211401399, -73.8115361385404)
120849 40692417 10/29/2018 07:23:28 AM 11/01/2018 09:52:41 PM HPD Department of Housing Preservation and Develop... HEAT/HOT WATER APARTMENT ONLY RESIDENTIAL BUILDING 10040 660 FT WASHINGTON AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.856431 -73.936092 (40.856431325229096, -73.93609157070478)
471582 43386392 07/26/2019 11:17:21 PM 07/27/2019 02:36:26 AM NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 10303.0 351 HARBOR ROAD ... NaN NaN NaN NaN NaN NaN NaN 40.627510 -74.160446 (40.62751009515846, -74.16044618563123)
455213 43253780 07/11/2019 03:23:28 PM 07/12/2019 11:25:43 AM NYPD New York City Police Department Illegal Parking Double Parked Blocking Traffic Street/Sidewalk 11204.0 61 STREET ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 41 columns

Pandas data structures#

Diagram showing a DataFrame, Series, labels, and indexes

DataFrame information#

requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      500000 non-null  int64  
 1   Created Date                    500000 non-null  object 
 2   Closed Date                     476156 non-null  object 
 3   Agency                          500000 non-null  object 
 4   Agency Name                     500000 non-null  object 
 5   Complaint Type                  500000 non-null  object 
 6   Descriptor                      492534 non-null  object 
 7   Location Type                   392590 non-null  object 
 8   Incident Zip                    480411 non-null  object 
 9   Incident Address                434544 non-null  object 
 10  Street Name                     434519 non-null  object 
 11  Cross Street 1                  300838 non-null  object 
 12  Cross Street 2                  299635 non-null  object 
 13  Intersection Street 1           107383 non-null  object 
 14  Intersection Street 2           107048 non-null  object 
 15  Address Type                    451018 non-null  object 
 16  City                            476649 non-null  object 
 17  Landmark                        32521 non-null   object 
 18  Facility Type                   134923 non-null  object 
 19  Status                          500000 non-null  object 
 20  Due Date                        171559 non-null  object 
 21  Resolution Description          457389 non-null  object 
 22  Resolution Action Updated Date  488801 non-null  object 
 23  Community Board                 500000 non-null  object 
 24  BBL                             407355 non-null  float64
 25  Borough                         500000 non-null  object 
 26  X Coordinate (State Plane)      470830 non-null  float64
 27  Y Coordinate (State Plane)      470830 non-null  float64
 28  Open Data Channel Type          500000 non-null  object 
 29  Park Facility Name              499973 non-null  object 
 30  Park Borough                    500000 non-null  object 
 31  Vehicle Type                    37 non-null      object 
 32  Taxi Company Borough            403 non-null     object 
 33  Taxi Pick Up Location           4475 non-null    object 
 34  Bridge Highway Name             697 non-null     object 
 35  Bridge Highway Direction        766 non-null     object 
 36  Road Ramp                       759 non-null     object 
 37  Bridge Highway Segment          1027 non-null    object 
 38  Latitude                        470830 non-null  float64
 39  Longitude                       470830 non-null  float64
 40  Location                        470830 non-null  object 
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB

Demo#

Analysis#

Which complaints are most common?#

# code goes here

What’s the most frequent request per agency?#

# code goes here

Exclude bad records from the DataFrame#

Let’s look at the complaint types.

# code goes here

How should we go about cleaning those up?

# code goes here

Best practices#

Homework 1#