# a bit of setup just for lecture; ignore me
import numpy as np
np.random.seed(5)
Lecture 1: Working with data#
Please sign attendance sheet
How was the homework? 👍/👎
Questions?
Reminder about the between-class participation
Additional programming concepts#
Functions#
Functions == methods
Arguments == parameters
For simplicity, we’ll use them interchangeably.
Packages#
a.k.a. “libraries”
Developers have create them to make code/functionality reusable and easily sharable
Software plugins that you
import
Main packages we’ll use:
pandas
plotly
A module is a file containing Python definitions and statements.
Working with files in Python#
We’ll open the file, then read it row by row.
# set up the file object
with open("moby_dick.txt") as file:
for line in file:
print(line)
Working with CSVs in pure Python#
We will use Python’s CSV DictReader. We’ll open the file, parse it as a CSV, then operate row by row.
Example#
import csv
# set up the file object
with open("people.csv") as csvfile:
# set up the reader
reader = csv.DictReader(csvfile)
# loop through the rows
for row in reader:
# access the data in various columns
first = row["first_name"]
last = row["last_name"]
print(f"{first} {last}")
In-class exercise#
311 requests#
Who’s called 311 before?
311 data#
Today’s goal#
Which 311 complaints are most common?
Which agencies are responsible for handling them?
Pandas#
A Python package (bundled up code that you can reuse)
Very common for data science in Python
-
Both organize around “data frames”
Import package#
import pandas as pd
Read data#
We’re using a sample to make it easier/faster to work with. This will take a while (~30 seconds).
requests = pd.read_csv(
"https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip"
)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_94278/4056869583.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
requests = pd.read_csv(
Ignore the DtypeWarning
for now; we’ll come back to it.
Preview the data#
requests
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39885889 | 08/01/2018 12:05:13 AM | 08/01/2018 12:05:13 AM | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3143 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.585156 | -73.959119 | (40.585155533520144, -73.95911915841708) |
1 | 39886470 | 08/01/2018 12:06:05 AM | 08/01/2018 12:06:05 AM | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3153 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.585218 | -73.958608 | (40.58521848090658, -73.95860788382927) |
2 | 39893543 | 08/01/2018 12:06:16 AM | 08/03/2018 02:03:55 PM | HPD | Department of Housing Preservation and Develop... | HEAT/HOT WATER | ENTIRE BUILDING | RESIDENTIAL BUILDING | 11221 | 729 LAFAYETTE AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.690733 | -73.943964 | (40.69073285353906, -73.943963521266) |
3 | 39886233 | 08/01/2018 12:06:29 AM | 08/01/2018 02:54:24 AM | NYPD | New York City Police Department | Noise - Residential | Banging/Pounding | Residential Building/House | 11693 | 82-01 BEACH CHANNEL DRIVE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.589931 | -73.808896 | (40.58993080750793, -73.80889570815852) |
4 | 39880309 | 08/01/2018 12:06:51 AM | 08/01/2018 04:54:26 AM | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 11216 | 64 HERKIMER STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.679716 | -73.951234 | (40.67971590505359, -73.95123396494363) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
499995 | 43622302 | 08/24/2019 01:46:09 AM | NaN | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10009.0 | 431 EAST 9 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.727536 | -73.983295 | (40.72753608835362, -73.98329522742081) |
499996 | 43619709 | 08/24/2019 01:49:49 AM | NaN | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10304.0 | 191 BROAD STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.624157 | -74.081006 | (40.62415703282506, -74.08100614362155) |
499997 | 43623124 | 08/24/2019 01:56:35 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Music/Party | Street/Sidewalk | 10031.0 | 534 WEST 153 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.830718 | -73.945006 | (40.83071800761314, -73.94500557250639) |
499998 | 43625595 | 08/24/2019 01:56:40 AM | NaN | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10452.0 | EAST 170 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.839882 | -73.916783 | (40.839882158779105, -73.91678321635897) |
499999 | 43622817 | 08/24/2019 01:57:58 AM | NaN | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Store/Commercial | 10033.0 | 247 AUDUBON AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.846376 | -73.934048 | (40.84637632367179, -73.93404825809533) |
500000 rows × 41 columns
.head()
defaults to providing the first 5 if you don’t specify a number. Can also use .tail()
.
requests.head(10)
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39885889 | 08/01/2018 12:05:13 AM | 08/01/2018 12:05:13 AM | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3143 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.585156 | -73.959119 | (40.585155533520144, -73.95911915841708) |
1 | 39886470 | 08/01/2018 12:06:05 AM | 08/01/2018 12:06:05 AM | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3153 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.585218 | -73.958608 | (40.58521848090658, -73.95860788382927) |
2 | 39893543 | 08/01/2018 12:06:16 AM | 08/03/2018 02:03:55 PM | HPD | Department of Housing Preservation and Develop... | HEAT/HOT WATER | ENTIRE BUILDING | RESIDENTIAL BUILDING | 11221 | 729 LAFAYETTE AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.690733 | -73.943964 | (40.69073285353906, -73.943963521266) |
3 | 39886233 | 08/01/2018 12:06:29 AM | 08/01/2018 02:54:24 AM | NYPD | New York City Police Department | Noise - Residential | Banging/Pounding | Residential Building/House | 11693 | 82-01 BEACH CHANNEL DRIVE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.589931 | -73.808896 | (40.58993080750793, -73.80889570815852) |
4 | 39880309 | 08/01/2018 12:06:51 AM | 08/01/2018 04:54:26 AM | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 11216 | 64 HERKIMER STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.679716 | -73.951234 | (40.67971590505359, -73.95123396494363) |
5 | 39878305 | 08/01/2018 12:07:41 AM | 08/01/2018 12:07:41 AM | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3167 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.585254 | -73.958302 | (40.585254052666755, -73.95830184003603) |
6 | 39878821 | 08/01/2018 12:10:04 AM | 08/01/2018 02:04:49 AM | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 10465 | 245 THROGGS NECK BOULEVARD | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.818533 | -73.813294 | (40.8185333998848, -73.81329367027502) |
7 | 39881328 | 08/01/2018 12:11:00 AM | 08/09/2018 12:00:00 AM | DSNY | Department of Sanitation | Request Large Bulky Item Collection | Request Large Bulky Item Collection | Sidewalk | 11367 | 147-46 77 ROAD | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721380 | -73.816326 | (40.72137977337607, -73.8163256713087) |
8 | 39885268 | 08/01/2018 12:14:41 AM | 08/01/2018 03:03:31 AM | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10459 | 951 HOE AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.822088 | -73.890340 | (40.82208801817637, -73.89033972476959) |
9 | 39897573 | 08/01/2018 12:17:02 AM | 08/13/2018 01:05:32 PM | HPD | Department of Housing Preservation and Develop... | HEAT/HOT WATER | ENTIRE BUILDING | RESIDENTIAL BUILDING | 11101 | 33-01 38 AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.752911 | -73.930358 | (40.752910688469996, -73.93035822987532) |
10 rows × 41 columns
requests.sample(5)
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
311150 | 42044618 | 03/25/2019 10:16:38 PM | 03/25/2019 11:47:09 PM | NYPD | New York City Police Department | Noise - Residential | Loud Talking | Residential Building/House | 10453.0 | 1800 POPHAM AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.851389 | -73.917651 | (40.85138859491611, -73.91765079814148) |
32142 | 40107788 | 08/25/2018 10:24:40 AM | 08/25/2018 04:30:08 PM | NYPD | New York City Police Department | Blocked Driveway | No Access | Street/Sidewalk | 11419.0 | 133-05 107 AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.687721 | -73.811536 | (40.6877211401399, -73.8115361385404) |
120849 | 40692417 | 10/29/2018 07:23:28 AM | 11/01/2018 09:52:41 PM | HPD | Department of Housing Preservation and Develop... | HEAT/HOT WATER | APARTMENT ONLY | RESIDENTIAL BUILDING | 10040 | 660 FT WASHINGTON AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.856431 | -73.936092 | (40.856431325229096, -73.93609157070478) |
471582 | 43386392 | 07/26/2019 11:17:21 PM | 07/27/2019 02:36:26 AM | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10303.0 | 351 HARBOR ROAD | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.627510 | -74.160446 | (40.62751009515846, -74.16044618563123) |
455213 | 43253780 | 07/11/2019 03:23:28 PM | 07/12/2019 11:25:43 AM | NYPD | New York City Police Department | Illegal Parking | Double Parked Blocking Traffic | Street/Sidewalk | 11204.0 | 61 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 41 columns
Pandas data structures#
DataFrame information#
requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unique Key 500000 non-null int64
1 Created Date 500000 non-null object
2 Closed Date 476156 non-null object
3 Agency 500000 non-null object
4 Agency Name 500000 non-null object
5 Complaint Type 500000 non-null object
6 Descriptor 492534 non-null object
7 Location Type 392590 non-null object
8 Incident Zip 480411 non-null object
9 Incident Address 434544 non-null object
10 Street Name 434519 non-null object
11 Cross Street 1 300838 non-null object
12 Cross Street 2 299635 non-null object
13 Intersection Street 1 107383 non-null object
14 Intersection Street 2 107048 non-null object
15 Address Type 451018 non-null object
16 City 476649 non-null object
17 Landmark 32521 non-null object
18 Facility Type 134923 non-null object
19 Status 500000 non-null object
20 Due Date 171559 non-null object
21 Resolution Description 457389 non-null object
22 Resolution Action Updated Date 488801 non-null object
23 Community Board 500000 non-null object
24 BBL 407355 non-null float64
25 Borough 500000 non-null object
26 X Coordinate (State Plane) 470830 non-null float64
27 Y Coordinate (State Plane) 470830 non-null float64
28 Open Data Channel Type 500000 non-null object
29 Park Facility Name 499973 non-null object
30 Park Borough 500000 non-null object
31 Vehicle Type 37 non-null object
32 Taxi Company Borough 403 non-null object
33 Taxi Pick Up Location 4475 non-null object
34 Bridge Highway Name 697 non-null object
35 Bridge Highway Direction 766 non-null object
36 Road Ramp 759 non-null object
37 Bridge Highway Segment 1027 non-null object
38 Latitude 470830 non-null float64
39 Longitude 470830 non-null float64
40 Location 470830 non-null object
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB
Demo#
Analysis#
Which complaints are most common?#
# code goes here
What’s the most frequent request per agency?#
# code goes here
groupby()
similar to pivot tables in spreadsheets
Exclude bad records from the DataFrame#
Let’s look at the complaint types.
# code goes here
How should we go about cleaning those up?
# code goes here