# a bit of setup just for lecture; ignore me
import numpy as np
np.random.seed(5)
Class 1: Working with data#
Questions?
How was the homework? 👍/👎
Fundamental programming concepts#
Variables:
Statistics: “any characteristics, number, or quantity that can be measured or counted,” which we’ll see as columns of tables
Programming: Buckets for values
Functions: Buckets for code
You’ll learn more about these in the homework
Variables and functions#
Make it possible to reuse the respective thing elsewhere
Should have names that are meaningful, for readability
Also:
Function arguments are basically variables within that function, which get assigned to whatever is passed in
Functions == methods
Arguments == parameters
For simplicity, we’ll use them interchangeably.
Packages#
a.k.a. “libraries” or “modules”
Developers have create them to make code/functionality reusable and easily sharable
Software plugins that you
import
Main packages we’ll use:
pandas
plotly
Working with CSVs in pure Python#
We will use Python’s CSV DictReader. We’ll open the file, parse it as a CSV, then operate row by row.
Example#
import csv
# set up the file object
with open("people.csv") as csvfile:
# set up the reader
reader = csv.DictReader(csvfile)
# loop through the rows
for row in reader:
# access the data in various columns
first = row["first_name"]
last = row["last_name"]
print(f"{first} {last}")
In-class exercise#
311 requests#
Who’s called 311 before?
311 data#
Today’s goal#
Which 311 complaints are most common?
Which agencies are responsible for handling them?
Pandas#
A Python package (bundled up code that you can reuse)
Very common for data science in Python
-
Both organize around “data frames”
Start by importing necessary packages#
import pandas as pd
Read and save 311 Service Requests dataset as a pandas dataframe#
We’re using a sample to make it easier/faster to work with. This will take a while (~30 seconds).
requests = pd.read_csv(
"https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip"
)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_94333/4056869583.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
requests = pd.read_csv(
Ignore the DtypeWarning
for now; we’ll come back to it.
Preview the data contents#
requests.head() # defaults to providing the first 5 if you don't specify a number
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39885889 | 08/01/2018 12:05:13 AM | 08/01/2018 12:05:13 AM | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3143 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.585156 | -73.959119 | (40.585155533520144, -73.95911915841708) |
1 | 39886470 | 08/01/2018 12:06:05 AM | 08/01/2018 12:06:05 AM | DOT | Department of Transportation | Street Condition | Pothole | NaN | 11235 | 3153 SHORE PARKWAY | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.585218 | -73.958608 | (40.58521848090658, -73.95860788382927) |
2 | 39893543 | 08/01/2018 12:06:16 AM | 08/03/2018 02:03:55 PM | HPD | Department of Housing Preservation and Develop... | HEAT/HOT WATER | ENTIRE BUILDING | RESIDENTIAL BUILDING | 11221 | 729 LAFAYETTE AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.690733 | -73.943964 | (40.69073285353906, -73.943963521266) |
3 | 39886233 | 08/01/2018 12:06:29 AM | 08/01/2018 02:54:24 AM | NYPD | New York City Police Department | Noise - Residential | Banging/Pounding | Residential Building/House | 11693 | 82-01 BEACH CHANNEL DRIVE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.589931 | -73.808896 | (40.58993080750793, -73.80889570815852) |
4 | 39880309 | 08/01/2018 12:06:51 AM | 08/01/2018 04:54:26 AM | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 11216 | 64 HERKIMER STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.679716 | -73.951234 | (40.67971590505359, -73.95123396494363) |
5 rows × 41 columns
requests.tail(10) # last 10 records in the dataframe
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
499990 | 43622247 | 08/24/2019 01:33:43 AM | NaN | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 11385.0 | 578 FAIRVIEW AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.707576 | -73.907325 | (40.70757578135031, -73.90732527364065) |
499991 | 43620877 | 08/24/2019 01:34:32 AM | NaN | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 11201.0 | 160 NAVY WALK | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.693967 | -73.980210 | (40.6939671536727, -73.98020958205214) |
499992 | 43619232 | 08/24/2019 01:38:44 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Music/Party | Street/Sidewalk | 11238.0 | 981 DEAN STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.678030 | -73.957620 | (40.67803039848778, -73.95762012074778) |
499993 | 43626613 | 08/24/2019 01:43:57 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 10023.0 | WEST 65 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.775372 | -73.987710 | (40.7753720958196, -73.98770974232366) |
499994 | 43619756 | 08/24/2019 01:44:27 AM | NaN | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 11208.0 | 211 NICHOLS AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.685495 | -73.868876 | (40.68549515215576, -73.8688762456483) |
499995 | 43622302 | 08/24/2019 01:46:09 AM | NaN | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10009.0 | 431 EAST 9 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.727536 | -73.983295 | (40.72753608835362, -73.98329522742081) |
499996 | 43619709 | 08/24/2019 01:49:49 AM | NaN | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10304.0 | 191 BROAD STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.624157 | -74.081006 | (40.62415703282506, -74.08100614362155) |
499997 | 43623124 | 08/24/2019 01:56:35 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Music/Party | Street/Sidewalk | 10031.0 | 534 WEST 153 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.830718 | -73.945006 | (40.83071800761314, -73.94500557250639) |
499998 | 43625595 | 08/24/2019 01:56:40 AM | NaN | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10452.0 | EAST 170 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.839882 | -73.916783 | (40.839882158779105, -73.91678321635897) |
499999 | 43622817 | 08/24/2019 01:57:58 AM | NaN | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Store/Commercial | 10033.0 | 247 AUDUBON AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.846376 | -73.934048 | (40.84637632367179, -73.93404825809533) |
10 rows × 41 columns
requests.sample(5) # random sample of size determined by you
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | ... | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
311150 | 42044618 | 03/25/2019 10:16:38 PM | 03/25/2019 11:47:09 PM | NYPD | New York City Police Department | Noise - Residential | Loud Talking | Residential Building/House | 10453.0 | 1800 POPHAM AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.851389 | -73.917651 | (40.85138859491611, -73.91765079814148) |
32142 | 40107788 | 08/25/2018 10:24:40 AM | 08/25/2018 04:30:08 PM | NYPD | New York City Police Department | Blocked Driveway | No Access | Street/Sidewalk | 11419.0 | 133-05 107 AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.687721 | -73.811536 | (40.6877211401399, -73.8115361385404) |
120849 | 40692417 | 10/29/2018 07:23:28 AM | 11/01/2018 09:52:41 PM | HPD | Department of Housing Preservation and Develop... | HEAT/HOT WATER | APARTMENT ONLY | RESIDENTIAL BUILDING | 10040 | 660 FT WASHINGTON AVENUE | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.856431 | -73.936092 | (40.856431325229096, -73.93609157070478) |
471582 | 43386392 | 07/26/2019 11:17:21 PM | 07/27/2019 02:36:26 AM | NYPD | New York City Police Department | Noise - Residential | Loud Music/Party | Residential Building/House | 10303.0 | 351 HARBOR ROAD | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.627510 | -74.160446 | (40.62751009515846, -74.16044618563123) |
455213 | 43253780 | 07/11/2019 03:23:28 PM | 07/12/2019 11:25:43 AM | NYPD | New York City Police Department | Illegal Parking | Double Parked Blocking Traffic | Street/Sidewalk | 11204.0 | 61 STREET | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 41 columns
Pandas data structures#
How many records are in the dataset?#
info()
method#
requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unique Key 500000 non-null int64
1 Created Date 500000 non-null object
2 Closed Date 476156 non-null object
3 Agency 500000 non-null object
4 Agency Name 500000 non-null object
5 Complaint Type 500000 non-null object
6 Descriptor 492534 non-null object
7 Location Type 392590 non-null object
8 Incident Zip 480411 non-null object
9 Incident Address 434544 non-null object
10 Street Name 434519 non-null object
11 Cross Street 1 300838 non-null object
12 Cross Street 2 299635 non-null object
13 Intersection Street 1 107383 non-null object
14 Intersection Street 2 107048 non-null object
15 Address Type 451018 non-null object
16 City 476649 non-null object
17 Landmark 32521 non-null object
18 Facility Type 134923 non-null object
19 Status 500000 non-null object
20 Due Date 171559 non-null object
21 Resolution Description 457389 non-null object
22 Resolution Action Updated Date 488801 non-null object
23 Community Board 500000 non-null object
24 BBL 407355 non-null float64
25 Borough 500000 non-null object
26 X Coordinate (State Plane) 470830 non-null float64
27 Y Coordinate (State Plane) 470830 non-null float64
28 Open Data Channel Type 500000 non-null object
29 Park Facility Name 499973 non-null object
30 Park Borough 500000 non-null object
31 Vehicle Type 37 non-null object
32 Taxi Company Borough 403 non-null object
33 Taxi Pick Up Location 4475 non-null object
34 Bridge Highway Name 697 non-null object
35 Bridge Highway Direction 766 non-null object
36 Road Ramp 759 non-null object
37 Bridge Highway Segment 1027 non-null object
38 Latitude 470830 non-null float64
39 Longitude 470830 non-null float64
40 Location 470830 non-null object
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB
What are the distinct sets of values in columns that seem most useful?#
unique()
method for getting list of unique values#
Let’s look at the “status” column. What are the status options for these 311 complaints?
requests["Status"].unique()
array(['Closed', 'In Progress', 'Pending', 'Started', 'Open', 'Assigned',
'Email Sent'], dtype=object)
requests["Open Data Channel Type"].unique()
array(['UNKNOWN', 'MOBILE', 'PHONE', 'ONLINE', 'OTHER'], dtype=object)
requests["Agency"].unique()
array(['DOT', 'HPD', 'NYPD', 'DSNY', 'DEP', 'DOB', 'DPR', 'DOHMH', 'DHS',
'DCA', 'DOF', 'DFTA', 'TLC', 'HRA', 'EDC', 'DOE', 'DCAS', 'DOITT',
'ACS', 'DCP', 'NYCEM', 'FDNY', 'TAX', 'MOC', 'TAT', 'COIB', 'DVS'],
dtype=object)
Excluding bad records from the dataframe#
Let’s look at the list of distinct complaint types.
requests["Complaint Type"].unique()
array(['Street Condition', 'HEAT/HOT WATER', 'Noise - Residential',
'Illegal Parking', 'Request Large Bulky Item Collection', 'Noise',
'Noise - Street/Sidewalk', 'Electronics Waste Appointment',
'Blocked Driveway', 'Dirty Conditions', 'Curb Condition',
'Noise - Commercial', 'General Construction/Plumbing',
'Traffic Signal Condition', 'Street Light Condition', 'Lead',
'Street Sign - Damaged', 'Noise - Vehicle', 'New Tree Request',
'Sanitation Condition', 'Mosquitoes', 'WATER LEAK',
'UNSANITARY CONDITION', 'Root/Sewer/Sidewalk Condition',
'Dead/Dying Tree', 'Derelict Vehicles', 'Collection Truck Noise',
'Sewer', 'GENERAL', 'Overflowing Litter Baskets', 'Vacant Lot',
'Sidewalk Condition', 'PAINT/PLASTER', 'Building/Use',
'Street Sign - Dangling', 'Construction Safety Enforcement',
'PLUMBING', 'Derelict Vehicle', 'Homeless Person Assistance',
'ELECTRIC', 'Water System', 'Damaged Tree',
'Missed Collection (All Materials)', 'Animal Abuse',
'Borough Office', 'Elevator', 'Food Establishment',
'Maintenance or Facility', 'FLOORING/STAIRS', 'Consumer Complaint',
'Overgrown Tree/Branches', 'Graffiti', 'APPLIANCE',
'Other Enforcement', 'DOF Property - Payment Issue',
'Hazardous Materials', 'Water Conservation', 'Water Quality',
'Plumbing', 'DOF Property - Update Account', 'DOOR/WINDOW',
'Housing - Low Income Senior', 'BEST/Site Safety', 'SCRIE',
'Animal in a Park', 'DOF Parking - Payment Issue',
'For Hire Vehicle Complaint', 'Benefit Card Replacement',
'DOF Property - Reduction Issue',
'Special Projects Inspection Team (SPIT)', 'SAFETY',
'Taxi Complaint', 'Beach/Pool/Sauna Complaint',
'Homeless Encampment', 'DOF Property - RPIE Issue',
'Recycling Enforcement', 'DPR Internal', 'DRIE',
'Street Sign - Missing', 'Taxi Report',
'DOF Parking - Request Status', 'Air Quality',
'Broken Parking Meter', 'Mold', 'HPD Literature Request',
'Poison Ivy', 'Non-Emergency Police Matter',
'DOF Property - Request Copy', 'Housing Options',
'Illegal Tree Damage', 'Traffic', 'Noise - Park', 'Asbestos',
'DCA / DOH New License Application Request', 'Cooling Tower',
'Drinking', 'Indoor Sewage', 'Vending',
'Emergency Response Team (ERT)', 'Electrical',
'DOF Property - Owner Issue', 'Parking Card', 'Indoor Air Quality',
'Violation of Park Rules', 'Mobile Food Vendor',
'Highway Condition', 'ELEVATOR', 'Noise - Helicopter',
'Unsanitary Animal Pvt Property', 'Rodent', 'Standing Water',
'Food Poisoning', 'Litter Basket / Request', 'Day Care',
'Executive Inspections', 'DOF Parking - Tax Exemption',
'OUTSIDE BUILDING', 'DOF Parking - Request Copy',
'DOF Property - Property Value', 'School Maintenance', 'Question',
'Industrial Waste', 'Senior Center Complaint', 'Smoking',
'Quality of Life', 'Derelict Bicycle', 'Ferry Complaint',
'Sweeping/Missed', 'Ferry Inquiry', 'Sweeping/Inadequate',
'Found Property', 'LinkNYC', 'Sustainability Enforcement',
'Taxpayer Advocate Inquiry', 'Investigations and Discipline (IAD)',
'Noise - House of Worship', 'Scaffold Safety', 'Pet Shop',
'Bus Stop Shelter Complaint', 'Forms', 'Elder Abuse',
'Drug Activity', 'Unsanitary Pigeon Condition', 'Comments',
'Boilers', 'Building Marshals office',
'Home Delivered Meal - Missed Delivery',
'Bike/Roller/Skate Chronic', 'Posting Advertisement',
'Research Questions', 'Urinating in Public',
'Public Payphone Complaint', 'Unleashed Dog', 'Facades',
'Illegal Animal Sold', 'Cranes and Derricks',
'Bus Stop Shelter Placement', 'Panhandling', 'Bridge Condition',
'Illegal Animal Kept as Pet', 'OEM Literature Request',
'Advocate-Personal Exemptions', 'Abandoned Vehicle',
"Alzheimer's Care", 'Taxi Compliment', 'Disorderly Youth',
'Public Assembly', 'Advocate-Prop Refunds/Credits',
'Unsanitary Condition', 'DOF Parking - DMV Clearance',
'Home Delivered Meal Complaint',
'Transportation Provider Complaint', 'Ferry Permit',
'For Hire Vehicle Report', 'Harboring Bees/Wasps',
'Advocate - Other', 'Home Care Provider Complaint',
'Miscellaneous Categories', 'Municipal Parking Facility', 'FATF',
'Special Natural Area District (SNAD)', 'Illegal Fireworks',
'Highway Sign - Damaged', 'Drinking Water', 'Calorie Labeling',
'Case Management Agency Complaint', 'Bike Rack Condition',
'Unsanitary Animal Facility', 'Plant', 'Public Toilet',
'Tattooing', 'Animal Facility - No Permit', 'Electronics Waste',
'General', 'Snow', 'Unlicensed Dog', 'Taxi Licensee Complaint',
'FHV Licensee Complaint', 'Overflowing Recycling Baskets',
'Highway Sign - Missing', 'Non-Residential Heat',
'Special Operations', 'Safety', 'Radioactive Material',
'Dispatched Taxi Complaint', 'Lifeguard', 'Window Guard',
'Select Message Type...', 'Bereavement Support Group',
'Sweeping/Missed-Inadequate', 'Squeegee', 'Tunnel Condition',
'Foam Ban Enforcement', 'Request Xmas Tree Collection',
'Advocate-Business Tax', 'Advocate - RPIE',
'Highway Sign - Dangling', 'X-Ray Machine/Equipment',
'Advocate-Co-opCondo Abatement', 'Forensic Engineering',
'Outside Building', 'Lost Property',
'(select extractvalue(xmltyp...', "Misc. Comments' and 7702=77...",
'qfix4${695*589}lixaf', "eval(compile('for x in rang...",
'Misc. Comments"|echo 2...', 'Misc. Comments"|ping -...',
"Misc. Comments'|echo y9v6fc...", 'file:///c:/windows/win.ini',
"Misc. Comments'|ping -c 21 ...", '../../../../../../../../../...',
'\\windows\\win.ini', '..././..././..././..././......',
'idexf3mrb7)(!(objectClass=*)', 'Misc. Comments BCC:a5glxzf...',
'$(sleep 11)', '@(9313*3464)', "{!xmlparser v='<!DOCTYPE...",
'../WEB-INF/web.xml', 'Animal-Abuse', 'Missed Collection',
'DEP Street Condition', 'Green Taxi Report',
'Green Taxi Complaint', 'Bottled Water',
'Single Occupancy Bathroom', 'Appliance'], dtype=object)
Let’s make that a bit easier to read:
complaints = requests["Complaint Type"].unique()
complaints.sort()
list(complaints)
['$(sleep 11)',
'(select extractvalue(xmltyp...',
'..././..././..././..././......',
'../../../../../../../../../...',
'../WEB-INF/web.xml',
'@(9313*3464)',
'APPLIANCE',
'Abandoned Vehicle',
'Advocate - Other',
'Advocate - RPIE',
'Advocate-Business Tax',
'Advocate-Co-opCondo Abatement',
'Advocate-Personal Exemptions',
'Advocate-Prop Refunds/Credits',
'Air Quality',
"Alzheimer's Care",
'Animal Abuse',
'Animal Facility - No Permit',
'Animal in a Park',
'Animal-Abuse',
'Appliance',
'Asbestos',
'BEST/Site Safety',
'Beach/Pool/Sauna Complaint',
'Benefit Card Replacement',
'Bereavement Support Group',
'Bike Rack Condition',
'Bike/Roller/Skate Chronic',
'Blocked Driveway',
'Boilers',
'Borough Office',
'Bottled Water',
'Bridge Condition',
'Broken Parking Meter',
'Building Marshals office',
'Building/Use',
'Bus Stop Shelter Complaint',
'Bus Stop Shelter Placement',
'Calorie Labeling',
'Case Management Agency Complaint',
'Collection Truck Noise',
'Comments',
'Construction Safety Enforcement',
'Consumer Complaint',
'Cooling Tower',
'Cranes and Derricks',
'Curb Condition',
'DCA / DOH New License Application Request',
'DEP Street Condition',
'DOF Parking - DMV Clearance',
'DOF Parking - Payment Issue',
'DOF Parking - Request Copy',
'DOF Parking - Request Status',
'DOF Parking - Tax Exemption',
'DOF Property - Owner Issue',
'DOF Property - Payment Issue',
'DOF Property - Property Value',
'DOF Property - RPIE Issue',
'DOF Property - Reduction Issue',
'DOF Property - Request Copy',
'DOF Property - Update Account',
'DOOR/WINDOW',
'DPR Internal',
'DRIE',
'Damaged Tree',
'Day Care',
'Dead/Dying Tree',
'Derelict Bicycle',
'Derelict Vehicle',
'Derelict Vehicles',
'Dirty Conditions',
'Disorderly Youth',
'Dispatched Taxi Complaint',
'Drinking',
'Drinking Water',
'Drug Activity',
'ELECTRIC',
'ELEVATOR',
'Elder Abuse',
'Electrical',
'Electronics Waste',
'Electronics Waste Appointment',
'Elevator',
'Emergency Response Team (ERT)',
'Executive Inspections',
'FATF',
'FHV Licensee Complaint',
'FLOORING/STAIRS',
'Facades',
'Ferry Complaint',
'Ferry Inquiry',
'Ferry Permit',
'Foam Ban Enforcement',
'Food Establishment',
'Food Poisoning',
'For Hire Vehicle Complaint',
'For Hire Vehicle Report',
'Forensic Engineering',
'Forms',
'Found Property',
'GENERAL',
'General',
'General Construction/Plumbing',
'Graffiti',
'Green Taxi Complaint',
'Green Taxi Report',
'HEAT/HOT WATER',
'HPD Literature Request',
'Harboring Bees/Wasps',
'Hazardous Materials',
'Highway Condition',
'Highway Sign - Damaged',
'Highway Sign - Dangling',
'Highway Sign - Missing',
'Home Care Provider Complaint',
'Home Delivered Meal - Missed Delivery',
'Home Delivered Meal Complaint',
'Homeless Encampment',
'Homeless Person Assistance',
'Housing - Low Income Senior',
'Housing Options',
'Illegal Animal Kept as Pet',
'Illegal Animal Sold',
'Illegal Fireworks',
'Illegal Parking',
'Illegal Tree Damage',
'Indoor Air Quality',
'Indoor Sewage',
'Industrial Waste',
'Investigations and Discipline (IAD)',
'Lead',
'Lifeguard',
'LinkNYC',
'Litter Basket / Request',
'Lost Property',
'Maintenance or Facility',
'Misc. Comments BCC:a5glxzf...',
'Misc. Comments"|echo 2...',
'Misc. Comments"|ping -...',
"Misc. Comments' and 7702=77...",
"Misc. Comments'|echo y9v6fc...",
"Misc. Comments'|ping -c 21 ...",
'Miscellaneous Categories',
'Missed Collection',
'Missed Collection (All Materials)',
'Mobile Food Vendor',
'Mold',
'Mosquitoes',
'Municipal Parking Facility',
'New Tree Request',
'Noise',
'Noise - Commercial',
'Noise - Helicopter',
'Noise - House of Worship',
'Noise - Park',
'Noise - Residential',
'Noise - Street/Sidewalk',
'Noise - Vehicle',
'Non-Emergency Police Matter',
'Non-Residential Heat',
'OEM Literature Request',
'OUTSIDE BUILDING',
'Other Enforcement',
'Outside Building',
'Overflowing Litter Baskets',
'Overflowing Recycling Baskets',
'Overgrown Tree/Branches',
'PAINT/PLASTER',
'PLUMBING',
'Panhandling',
'Parking Card',
'Pet Shop',
'Plant',
'Plumbing',
'Poison Ivy',
'Posting Advertisement',
'Public Assembly',
'Public Payphone Complaint',
'Public Toilet',
'Quality of Life',
'Question',
'Radioactive Material',
'Recycling Enforcement',
'Request Large Bulky Item Collection',
'Request Xmas Tree Collection',
'Research Questions',
'Rodent',
'Root/Sewer/Sidewalk Condition',
'SAFETY',
'SCRIE',
'Safety',
'Sanitation Condition',
'Scaffold Safety',
'School Maintenance',
'Select Message Type...',
'Senior Center Complaint',
'Sewer',
'Sidewalk Condition',
'Single Occupancy Bathroom',
'Smoking',
'Snow',
'Special Natural Area District (SNAD)',
'Special Operations',
'Special Projects Inspection Team (SPIT)',
'Squeegee',
'Standing Water',
'Street Condition',
'Street Light Condition',
'Street Sign - Damaged',
'Street Sign - Dangling',
'Street Sign - Missing',
'Sustainability Enforcement',
'Sweeping/Inadequate',
'Sweeping/Missed',
'Sweeping/Missed-Inadequate',
'Tattooing',
'Taxi Complaint',
'Taxi Compliment',
'Taxi Licensee Complaint',
'Taxi Report',
'Taxpayer Advocate Inquiry',
'Traffic',
'Traffic Signal Condition',
'Transportation Provider Complaint',
'Tunnel Condition',
'UNSANITARY CONDITION',
'Unleashed Dog',
'Unlicensed Dog',
'Unsanitary Animal Facility',
'Unsanitary Animal Pvt Property',
'Unsanitary Condition',
'Unsanitary Pigeon Condition',
'Urinating in Public',
'Vacant Lot',
'Vending',
'Violation of Park Rules',
'WATER LEAK',
'Water Conservation',
'Water Quality',
'Water System',
'Window Guard',
'X-Ray Machine/Equipment',
'\\windows\\win.ini',
"eval(compile('for x in rang...",
'file:///c:/windows/win.ini',
'idexf3mrb7)(!(objectClass=*)',
'qfix4${695*589}lixaf',
"{!xmlparser v='<!DOCTYPE..."]
Let’s see how frequently these invalid Complaint Type values appear in the data.
Use .groupby().size()
to get the count of 311 requests per complaint type value. This is very similar to pivot tables in spreadsheets.
with pd.option_context("display.max_rows", 1000):
display(requests.groupby("Complaint Type").size())
Complaint Type
$(sleep 11) 1
(select extractvalue(xmltyp... 1
..././..././..././..././...... 1
../../../../../../../../../... 2
../WEB-INF/web.xml 1
@(9313*3464) 1
APPLIANCE 2539
Abandoned Vehicle 1655
Advocate - Other 26
Advocate - RPIE 1
Advocate-Business Tax 2
Advocate-Co-opCondo Abatement 4
Advocate-Personal Exemptions 24
Advocate-Prop Refunds/Credits 39
Air Quality 1457
Alzheimer's Care 25
Animal Abuse 1282
Animal Facility - No Permit 10
Animal in a Park 456
Animal-Abuse 261
Appliance 1
Asbestos 426
BEST/Site Safety 359
Beach/Pool/Sauna Complaint 42
Benefit Card Replacement 1649
Bereavement Support Group 1
Bike Rack Condition 20
Bike/Roller/Skate Chronic 97
Blocked Driveway 25530
Boilers 449
Borough Office 900
Bottled Water 2
Bridge Condition 64
Broken Parking Meter 2499
Building Marshals office 311
Building/Use 4869
Bus Stop Shelter Complaint 172
Bus Stop Shelter Placement 43
Calorie Labeling 4
Case Management Agency Complaint 4
Collection Truck Noise 36
Comments 16
Construction Safety Enforcement 1740
Consumer Complaint 2892
Cooling Tower 7
Cranes and Derricks 48
Curb Condition 1476
DCA / DOH New License Application Request 186
DEP Street Condition 46
DOF Parking - DMV Clearance 22
DOF Parking - Payment Issue 532
DOF Parking - Request Copy 97
DOF Parking - Request Status 94
DOF Parking - Tax Exemption 157
DOF Property - Owner Issue 280
DOF Property - Payment Issue 1092
DOF Property - Property Value 106
DOF Property - RPIE Issue 80
DOF Property - Reduction Issue 1846
DOF Property - Request Copy 1000
DOF Property - Update Account 802
DOOR/WINDOW 6479
DPR Internal 245
DRIE 233
Damaged Tree 6065
Day Care 98
Dead/Dying Tree 1995
Derelict Bicycle 457
Derelict Vehicle 6056
Derelict Vehicles 6759
Dirty Conditions 6776
Disorderly Youth 10
Dispatched Taxi Complaint 5
Drinking 311
Drinking Water 16
Drug Activity 352
ELECTRIC 4670
ELEVATOR 250
Elder Abuse 43
Electrical 551
Electronics Waste 1
Electronics Waste Appointment 5535
Elevator 3004
Emergency Response Team (ERT) 872
Executive Inspections 73
FATF 23
FHV Licensee Complaint 4
FLOORING/STAIRS 4177
Facades 72
Ferry Complaint 38
Ferry Inquiry 96
Ferry Permit 8
Foam Ban Enforcement 5
Food Establishment 2061
Food Poisoning 549
For Hire Vehicle Complaint 2789
For Hire Vehicle Report 112
Forensic Engineering 1
Forms 57
Found Property 25
GENERAL 4565
General 188
General Construction/Plumbing 9537
Graffiti 4286
Green Taxi Complaint 7
Green Taxi Report 2
HEAT/HOT WATER 39095
HPD Literature Request 405
Harboring Bees/Wasps 31
Hazardous Materials 549
Highway Condition 628
Highway Sign - Damaged 20
Highway Sign - Dangling 3
Highway Sign - Missing 12
Home Care Provider Complaint 4
Home Delivered Meal - Missed Delivery 15
Home Delivered Meal Complaint 13
Homeless Encampment 977
Homeless Person Assistance 3670
Housing - Low Income Senior 768
Housing Options 251
Illegal Animal Kept as Pet 37
Illegal Animal Sold 7
Illegal Fireworks 138
Illegal Parking 34297
Illegal Tree Damage 642
Indoor Air Quality 848
Indoor Sewage 214
Industrial Waste 263
Investigations and Discipline (IAD) 152
Lead 1825
Lifeguard 7
LinkNYC 40
Litter Basket / Request 463
Lost Property 472
Maintenance or Facility 3055
Misc. Comments BCC:a5glxzf... 1
Misc. Comments"|echo 2... 1
Misc. Comments"|ping -... 1
Misc. Comments' and 7702=77... 1
Misc. Comments'|echo y9v6fc... 1
Misc. Comments'|ping -c 21 ... 1
Miscellaneous Categories 34
Missed Collection 55
Missed Collection (All Materials) 8340
Mobile Food Vendor 383
Mold 81
Mosquitoes 139
Municipal Parking Facility 15
New Tree Request 3715
Noise 10937
Noise - Commercial 7894
Noise - Helicopter 303
Noise - House of Worship 148
Noise - Park 792
Noise - Residential 41311
Noise - Street/Sidewalk 16876
Noise - Vehicle 7244
Non-Emergency Police Matter 1628
Non-Residential Heat 169
OEM Literature Request 25
OUTSIDE BUILDING 150
Other Enforcement 1349
Outside Building 2
Overflowing Litter Baskets 140
Overflowing Recycling Baskets 1
Overgrown Tree/Branches 2935
PAINT/PLASTER 9877
PLUMBING 9007
Panhandling 96
Parking Card 2
Pet Shop 28
Plant 22
Plumbing 1482
Poison Ivy 47
Posting Advertisement 23
Public Assembly 1
Public Payphone Complaint 75
Public Toilet 9
Quality of Life 65
Question 171
Radioactive Material 3
Recycling Enforcement 134
Request Large Bulky Item Collection 30939
Request Xmas Tree Collection 2
Research Questions 4
Rodent 6253
Root/Sewer/Sidewalk Condition 1485
SAFETY 1668
SCRIE 770
Safety 65
Sanitation Condition 7151
Scaffold Safety 38
School Maintenance 442
Select Message Type... 1
Senior Center Complaint 52
Sewer 6757
Sidewalk Condition 6488
Single Occupancy Bathroom 4
Smoking 456
Snow 297
Special Natural Area District (SNAD) 14
Special Operations 39
Special Projects Inspection Team (SPIT) 1168
Squeegee 1
Standing Water 484
Street Condition 16895
Street Light Condition 13425
Street Sign - Damaged 1176
Street Sign - Dangling 482
Street Sign - Missing 1080
Sustainability Enforcement 247
Sweeping/Inadequate 55
Sweeping/Missed 174
Sweeping/Missed-Inadequate 11
Tattooing 21
Taxi Complaint 2416
Taxi Compliment 41
Taxi Licensee Complaint 5
Taxi Report 318
Taxpayer Advocate Inquiry 19
Traffic 868
Traffic Signal Condition 7306
Transportation Provider Complaint 7
Tunnel Condition 1
UNSANITARY CONDITION 14913
Unleashed Dog 128
Unlicensed Dog 6
Unsanitary Animal Facility 15
Unsanitary Animal Pvt Property 327
Unsanitary Condition 942
Unsanitary Pigeon Condition 103
Urinating in Public 112
Vacant Lot 318
Vending 873
Violation of Park Rules 492
WATER LEAK 6641
Water Conservation 853
Water Quality 332
Water System 12949
Window Guard 6
X-Ray Machine/Equipment 1
\windows\win.ini 1
eval(compile('for x in rang... 1
file:///c:/windows/win.ini 1
idexf3mrb7)(!(objectClass=*) 1
qfix4${695*589}lixaf 1
{!xmlparser v='<!DOCTYPE... 1
dtype: int64
with pd.option_context("display.max_rows", 1000):
display(...)
What this code is doing: showing all cells in a DataFrame with rich output.
requests.groupby('Complaint Type').size()
What this code is doing:
Group the records in the dataset based on their
Complaint Type
valueCount the records that have been grouped together by their shared
Complaint Type
value
How should we find junk records to delete?
It looks like most invalid complaint types only have a few records. Try excluding all complaint type categories with < 3 records, assuming that all complaint type categories with < 3 instances in the data are bad data entries.
Why 3? It’s arbitrary. We’re looking for trends in the data, and in this case we don’t care about low frequency entries.
Create a dataframe that captures the count of records per Complaint Type
value.
counts = requests.groupby("Complaint Type").size().reset_index(name="count")
counts
# .reset_index(name='count') allows us to name the new column that contains the count of rows
Complaint Type | count | |
---|---|---|
0 | $(sleep 11) | 1 |
1 | (select extractvalue(xmltyp... | 1 |
2 | ..././..././..././..././...... | 1 |
3 | ../../../../../../../../../... | 2 |
4 | ../WEB-INF/web.xml | 1 |
... | ... | ... |
243 | eval(compile('for x in rang... | 1 |
244 | file:///c:/windows/win.ini | 1 |
245 | idexf3mrb7)(!(objectClass=*) | 1 |
246 | qfix4${695*589}lixaf | 1 |
247 | {!xmlparser v='<!DOCTYPE... | 1 |
248 rows × 2 columns
You can also use .count()
but the output is a little different.
Let’s create a “series” that only lists the Complaint Type
values that have record counts >= 3.
Remember: A single column from a pandas dataframe is called a series. It’s essentially a list containing all the values in the column.
requests.groupby('Complaint Type').size().reset_index(name='size')
is equivalent to:
requests.groupby('Complaint Type', as_index=False).size()
Without the reset_index()
or as_index=False
, the Complaint Type
values are returned as the index. See also: value_counts()
.
valid_complaint_types = counts["Complaint Type"][counts["count"] >= 3]
valid_complaint_types
6 APPLIANCE
7 Abandoned Vehicle
8 Advocate - Other
11 Advocate-Co-opCondo Abatement
12 Advocate-Personal Exemptions
...
236 WATER LEAK
237 Water Conservation
238 Water Quality
239 Water System
240 Window Guard
Name: Complaint Type, Length: 213, dtype: object
Filter our requests
dataframe to only keep the rows where the Complaint Type
value is in the valid_complaint_types
series we created in the previous step. Save the result in a new dataframe.
df_cleaned = requests[requests["Complaint Type"].isin(valid_complaint_types)]
How can we make sure this worked? Let’s check how many records there were originally in requests
vs how many are in df_cleaned
.
Before:
requests["Unique Key"].size
500000
After:
df_cleaned["Unique Key"].size
499958
We can also print the set of complaint_type values from our cleaned dataframe to make sure they look correct.
df_cleaned["Complaint Type"].sample(10)
349480 Borough Office
254578 PLUMBING
198644 Request Large Bulky Item Collection
319316 Street Light Condition
29580 General Construction/Plumbing
398404 Street Light Condition
303870 HPD Literature Request
86399 Sewer
22631 Noise - Street/Sidewalk
398130 Illegal Parking
Name: Complaint Type, dtype: object
Great, now those invalid records will be excluded from our analysis!
Another approach to excluding those invalid records would be to use regular expressions (“RegExes”) to find records with weird characters.
Filtering rows#
Slicing and dicing is done through indexing.
Boolean indexing#
The table on the right is missing the state
column - need to fix.
How it works#
Done with clean up! Time for the actual analysis:#
Which 311 complaints are most common and which agencies are responsible for handling them?#
Which complaints are the most common?#
df_cleaned.groupby("Complaint Type").size().reset_index(name="count")
Complaint Type | count | |
---|---|---|
0 | APPLIANCE | 2539 |
1 | Abandoned Vehicle | 1655 |
2 | Advocate - Other | 26 |
3 | Advocate-Co-opCondo Abatement | 4 |
4 | Advocate-Personal Exemptions | 24 |
... | ... | ... |
208 | WATER LEAK | 6641 |
209 | Water Conservation | 853 |
210 | Water Quality | 332 |
211 | Water System | 12949 |
212 | Window Guard | 6 |
213 rows × 2 columns
.reset_index(name='count')
isn’t necessary, but it’s helpful to include because it allows us to name the new column that contains the count of rows.
Which agencies are responsible for handling these complaint categories?#
df_cleaned.groupby(["Agency", "Complaint Type"]).size().reset_index(name="count")
Agency | Complaint Type | count | |
---|---|---|---|
0 | ACS | Forms | 56 |
1 | COIB | Forms | 1 |
2 | DCA | Consumer Complaint | 2892 |
3 | DCA | DCA / DOH New License Application Request | 186 |
4 | DCAS | Comments | 13 |
... | ... | ... | ... |
216 | TLC | Lost Property | 472 |
217 | TLC | Taxi Complaint | 2416 |
218 | TLC | Taxi Compliment | 41 |
219 | TLC | Taxi Licensee Complaint | 5 |
220 | TLC | Taxi Report | 318 |
221 rows × 3 columns
Which agencies receive the most total 311 requests?#
df_cleaned.groupby("Agency").size().reset_index(name="count")
Agency | count | |
---|---|---|
0 | ACS | 56 |
1 | COIB | 1 |
2 | DCA | 3078 |
3 | DCAS | 149 |
4 | DCP | 4 |
5 | DEP | 36283 |
6 | DFTA | 1182 |
7 | DHS | 3670 |
8 | DOB | 26024 |
9 | DOE | 442 |
10 | DOF | 7223 |
11 | DOHMH | 12690 |
12 | DOITT | 115 |
13 | DOT | 52001 |
14 | DPR | 21086 |
15 | DSNY | 73227 |
16 | EDC | 303 |
17 | HPD | 105634 |
18 | HRA | 1649 |
19 | NYCEM | 25 |
20 | NYPD | 148884 |
21 | TAT | 1 |
22 | TAX | 37 |
23 | TLC | 6194 |
What is the most frequent request per agency?#
First, create a dataframe that contains the count of complaints per Agency
per Complaint Type
.
agency_counts = df_cleaned.groupby(["Agency", "Complaint Type"]).size().reset_index(name="count")
agency_counts.head(20)
Agency | Complaint Type | count | |
---|---|---|---|
0 | ACS | Forms | 56 |
1 | COIB | Forms | 1 |
2 | DCA | Consumer Complaint | 2892 |
3 | DCA | DCA / DOH New License Application Request | 186 |
4 | DCAS | Comments | 13 |
5 | DCAS | Question | 136 |
6 | DCP | Research Questions | 4 |
7 | DEP | Air Quality | 1457 |
8 | DEP | Asbestos | 316 |
9 | DEP | FATF | 23 |
10 | DEP | Hazardous Materials | 549 |
11 | DEP | Industrial Waste | 263 |
12 | DEP | Lead | 1825 |
13 | DEP | Noise | 10937 |
14 | DEP | Plant | 22 |
15 | DEP | Sewer | 6757 |
16 | DEP | Water Conservation | 853 |
17 | DEP | Water Quality | 332 |
18 | DEP | Water System | 12949 |
19 | DFTA | Alzheimer's Care | 25 |
Use drop_duplicates()
to keep the row with the highest value per Agency
.
sorted_agency_counts = agency_counts.sort_values("count", ascending=False)
sorted_agency_counts
Agency | Complaint Type | count | |
---|---|---|---|
198 | NYPD | Noise - Residential | 41311 |
169 | HPD | HEAT/HOT WATER | 39095 |
194 | NYPD | Illegal Parking | 34297 |
154 | DSNY | Request Large Bulky Item Collection | 30939 |
186 | NYPD | Blocked Driveway | 25530 |
... | ... | ... | ... |
119 | DOT | Highway Sign - Dangling | 3 |
94 | DOHMH | Radioactive Material | 3 |
139 | DPR | Unsanitary Condition | 1 |
207 | TAT | Question | 1 |
1 | COIB | Forms | 1 |
221 rows × 3 columns
top_agency_counts = sorted_agency_counts.drop_duplicates("Agency")
top_agency_counts
Agency | Complaint Type | count | |
---|---|---|---|
198 | NYPD | Noise - Residential | 41311 |
169 | HPD | HEAT/HOT WATER | 39095 |
154 | DSNY | Request Large Bulky Item Collection | 30939 |
124 | DOT | Street Condition | 16895 |
18 | DEP | Water System | 12949 |
42 | DOB | General Construction/Plumbing | 9537 |
95 | DOHMH | Rodent | 6253 |
132 | DPR | Damaged Tree | 6065 |
29 | DHS | Homeless Person Assistance | 3670 |
2 | DCA | Consumer Complaint | 2892 |
212 | TLC | For Hire Vehicle Complaint | 2789 |
67 | DOF | DOF Property - Reduction Issue | 1846 |
180 | HRA | Benefit Card Replacement | 1649 |
25 | DFTA | Housing - Low Income Senior | 768 |
53 | DOE | School Maintenance | 442 |
161 | EDC | Noise - Helicopter | 303 |
5 | DCAS | Question | 136 |
106 | DOITT | Public Payphone Complaint | 75 |
0 | ACS | Forms | 56 |
209 | TAX | Question | 34 |
181 | NYCEM | OEM Literature Request | 25 |
6 | DCP | Research Questions | 4 |
207 | TAT | Question | 1 |
1 | COIB | Forms | 1 |
For nicer display, order alphabetically:
top_agency_counts.sort_values("Agency")
Agency | Complaint Type | count | |
---|---|---|---|
0 | ACS | Forms | 56 |
1 | COIB | Forms | 1 |
2 | DCA | Consumer Complaint | 2892 |
5 | DCAS | Question | 136 |
6 | DCP | Research Questions | 4 |
18 | DEP | Water System | 12949 |
25 | DFTA | Housing - Low Income Senior | 768 |
29 | DHS | Homeless Person Assistance | 3670 |
42 | DOB | General Construction/Plumbing | 9537 |
53 | DOE | School Maintenance | 442 |
67 | DOF | DOF Property - Reduction Issue | 1846 |
95 | DOHMH | Rodent | 6253 |
106 | DOITT | Public Payphone Complaint | 75 |
124 | DOT | Street Condition | 16895 |
132 | DPR | Damaged Tree | 6065 |
154 | DSNY | Request Large Bulky Item Collection | 30939 |
161 | EDC | Noise - Helicopter | 303 |
169 | HPD | HEAT/HOT WATER | 39095 |
180 | HRA | Benefit Card Replacement | 1649 |
181 | NYCEM | OEM Literature Request | 25 |
198 | NYPD | Noise - Residential | 41311 |
207 | TAT | Question | 1 |
209 | TAX | Question | 34 |
212 | TLC | For Hire Vehicle Complaint | 2789 |
Another way, only sorting it once:
agency_counts.sort_values(["Agency", "count"]).drop_duplicates("Agency", keep="last")
Agency | Complaint Type | count | |
---|---|---|---|
0 | ACS | Forms | 56 |
1 | COIB | Forms | 1 |
2 | DCA | Consumer Complaint | 2892 |
5 | DCAS | Question | 136 |
6 | DCP | Research Questions | 4 |
18 | DEP | Water System | 12949 |
25 | DFTA | Housing - Low Income Senior | 768 |
29 | DHS | Homeless Person Assistance | 3670 |
42 | DOB | General Construction/Plumbing | 9537 |
53 | DOE | School Maintenance | 442 |
67 | DOF | DOF Property - Reduction Issue | 1846 |
95 | DOHMH | Rodent | 6253 |
106 | DOITT | Public Payphone Complaint | 75 |
124 | DOT | Street Condition | 16895 |
132 | DPR | Damaged Tree | 6065 |
154 | DSNY | Request Large Bulky Item Collection | 30939 |
161 | EDC | Noise - Helicopter | 303 |
169 | HPD | HEAT/HOT WATER | 39095 |
180 | HRA | Benefit Card Replacement | 1649 |
181 | NYCEM | OEM Literature Request | 25 |
198 | NYPD | Noise - Residential | 41311 |
207 | TAT | Question | 1 |
209 | TAX | Question | 34 |
212 | TLC | For Hire Vehicle Complaint | 2789 |
Some pandas/Jupyter best practices#
Make variable names descriptive
Ignore that all examples use
requests
Only do one thing per line
Makes troubleshooting easier
Make notebooks idempotent
Makes your work reproducible
Use
Restart and run all
(⏩ button in toolbar)