# a bit of setup just for lecture; ignore me
import numpy as np

np.random.seed(5)

Class 1: Working with data#

  • Questions?

  • How was the homework? 👍/👎

Fundamental programming concepts#

Variables and functions#

Also:

  • Function arguments are basically variables within that function, which get assigned to whatever is passed in

  • Functions == methods

  • Arguments == parameters

For simplicity, we’ll use them interchangeably.

Packages#

  • a.k.a. “libraries” or “modules”

  • Developers have create them to make code/functionality reusable and easily sharable

  • Software plugins that you import

  • Main packages we’ll use:

    • pandas

    • plotly

Working with CSVs in pure Python#

We will use Python’s CSV DictReader. We’ll open the file, parse it as a CSV, then operate row by row.

Example#

import csv


# set up the file object
with open("people.csv") as csvfile:
    # set up the reader
    reader = csv.DictReader(csvfile)
    # loop through the rows
    for row in reader:
        # access the data in various columns
        first = row["first_name"]
        last = row["last_name"]

        print(f"{first} {last}")

In-class exercise#

311 requests#

Who’s called 311 before?

NYC 311 homepage

311 data#

Today’s goal#

  • Which 311 complaints are most common?

  • Which agencies are responsible for handling them?

Pandas#

  • A Python package (bundled up code that you can reuse)

  • Very common for data science in Python

  • A lot like R

    • Both organize around “data frames”

Start by importing necessary packages#

import pandas as pd

Read and save 311 Service Requests dataset as a pandas dataframe#

We’re using a sample to make it easier/faster to work with. This will take a while (~30 seconds).

requests = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip"
)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_27038/4056869583.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(

Ignore the DtypeWarning for now; we’ll come back to it.

Preview the data contents#

requests.head()  # defaults to providing the first 5 if you don't specify a number
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
0 39885889 08/01/2018 12:05:13 AM 08/01/2018 12:05:13 AM DOT Department of Transportation Street Condition Pothole NaN 11235 3143 SHORE PARKWAY ... NaN NaN NaN NaN NaN NaN NaN 40.585156 -73.959119 (40.585155533520144, -73.95911915841708)
1 39886470 08/01/2018 12:06:05 AM 08/01/2018 12:06:05 AM DOT Department of Transportation Street Condition Pothole NaN 11235 3153 SHORE PARKWAY ... NaN NaN NaN NaN NaN NaN NaN 40.585218 -73.958608 (40.58521848090658, -73.95860788382927)
2 39893543 08/01/2018 12:06:16 AM 08/03/2018 02:03:55 PM HPD Department of Housing Preservation and Develop... HEAT/HOT WATER ENTIRE BUILDING RESIDENTIAL BUILDING 11221 729 LAFAYETTE AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.690733 -73.943964 (40.69073285353906, -73.943963521266)
3 39886233 08/01/2018 12:06:29 AM 08/01/2018 02:54:24 AM NYPD New York City Police Department Noise - Residential Banging/Pounding Residential Building/House 11693 82-01 BEACH CHANNEL DRIVE ... NaN NaN NaN NaN NaN NaN NaN 40.589931 -73.808896 (40.58993080750793, -73.80889570815852)
4 39880309 08/01/2018 12:06:51 AM 08/01/2018 04:54:26 AM NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 11216 64 HERKIMER STREET ... NaN NaN NaN NaN NaN NaN NaN 40.679716 -73.951234 (40.67971590505359, -73.95123396494363)

5 rows × 41 columns

requests.tail(10)  # last 10 records in the dataframe
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
499990 43622247 08/24/2019 01:33:43 AM NaN NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 11385.0 578 FAIRVIEW AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.707576 -73.907325 (40.70757578135031, -73.90732527364065)
499991 43620877 08/24/2019 01:34:32 AM NaN NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 11201.0 160 NAVY WALK ... NaN NaN NaN NaN NaN NaN NaN 40.693967 -73.980210 (40.6939671536727, -73.98020958205214)
499992 43619232 08/24/2019 01:38:44 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Music/Party Street/Sidewalk 11238.0 981 DEAN STREET ... NaN NaN NaN NaN NaN NaN NaN 40.678030 -73.957620 (40.67803039848778, -73.95762012074778)
499993 43626613 08/24/2019 01:43:57 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 10023.0 WEST 65 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.775372 -73.987710 (40.7753720958196, -73.98770974232366)
499994 43619756 08/24/2019 01:44:27 AM NaN NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 11208.0 211 NICHOLS AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.685495 -73.868876 (40.68549515215576, -73.8688762456483)
499995 43622302 08/24/2019 01:46:09 AM NaN NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 10009.0 431 EAST 9 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.727536 -73.983295 (40.72753608835362, -73.98329522742081)
499996 43619709 08/24/2019 01:49:49 AM NaN NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 10304.0 191 BROAD STREET ... NaN NaN NaN NaN NaN NaN NaN 40.624157 -74.081006 (40.62415703282506, -74.08100614362155)
499997 43623124 08/24/2019 01:56:35 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Music/Party Street/Sidewalk 10031.0 534 WEST 153 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.830718 -73.945006 (40.83071800761314, -73.94500557250639)
499998 43625595 08/24/2019 01:56:40 AM NaN NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 10452.0 EAST 170 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.839882 -73.916783 (40.839882158779105, -73.91678321635897)
499999 43622817 08/24/2019 01:57:58 AM NaN NYPD New York City Police Department Noise - Commercial Loud Music/Party Store/Commercial 10033.0 247 AUDUBON AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.846376 -73.934048 (40.84637632367179, -73.93404825809533)

10 rows × 41 columns

requests.sample(5)  # random sample of size determined by you
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Latitude Longitude Location
311150 42044618 03/25/2019 10:16:38 PM 03/25/2019 11:47:09 PM NYPD New York City Police Department Noise - Residential Loud Talking Residential Building/House 10453.0 1800 POPHAM AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.851389 -73.917651 (40.85138859491611, -73.91765079814148)
32142 40107788 08/25/2018 10:24:40 AM 08/25/2018 04:30:08 PM NYPD New York City Police Department Blocked Driveway No Access Street/Sidewalk 11419.0 133-05 107 AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.687721 -73.811536 (40.6877211401399, -73.8115361385404)
120849 40692417 10/29/2018 07:23:28 AM 11/01/2018 09:52:41 PM HPD Department of Housing Preservation and Develop... HEAT/HOT WATER APARTMENT ONLY RESIDENTIAL BUILDING 10040 660 FT WASHINGTON AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.856431 -73.936092 (40.856431325229096, -73.93609157070478)
471582 43386392 07/26/2019 11:17:21 PM 07/27/2019 02:36:26 AM NYPD New York City Police Department Noise - Residential Loud Music/Party Residential Building/House 10303.0 351 HARBOR ROAD ... NaN NaN NaN NaN NaN NaN NaN 40.627510 -74.160446 (40.62751009515846, -74.16044618563123)
455213 43253780 07/11/2019 03:23:28 PM 07/12/2019 11:25:43 AM NYPD New York City Police Department Illegal Parking Double Parked Blocking Traffic Street/Sidewalk 11204.0 61 STREET ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 41 columns

Pandas data structures#

Diagram showing a DataFrame, Series, labels, and indexes

How many records are in the dataset?#

info() method#

requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      500000 non-null  int64  
 1   Created Date                    500000 non-null  object 
 2   Closed Date                     476156 non-null  object 
 3   Agency                          500000 non-null  object 
 4   Agency Name                     500000 non-null  object 
 5   Complaint Type                  500000 non-null  object 
 6   Descriptor                      492534 non-null  object 
 7   Location Type                   392590 non-null  object 
 8   Incident Zip                    480411 non-null  object 
 9   Incident Address                434544 non-null  object 
 10  Street Name                     434519 non-null  object 
 11  Cross Street 1                  300838 non-null  object 
 12  Cross Street 2                  299635 non-null  object 
 13  Intersection Street 1           107383 non-null  object 
 14  Intersection Street 2           107048 non-null  object 
 15  Address Type                    451018 non-null  object 
 16  City                            476649 non-null  object 
 17  Landmark                        32521 non-null   object 
 18  Facility Type                   134923 non-null  object 
 19  Status                          500000 non-null  object 
 20  Due Date                        171559 non-null  object 
 21  Resolution Description          457389 non-null  object 
 22  Resolution Action Updated Date  488801 non-null  object 
 23  Community Board                 500000 non-null  object 
 24  BBL                             407355 non-null  float64
 25  Borough                         500000 non-null  object 
 26  X Coordinate (State Plane)      470830 non-null  float64
 27  Y Coordinate (State Plane)      470830 non-null  float64
 28  Open Data Channel Type          500000 non-null  object 
 29  Park Facility Name              499973 non-null  object 
 30  Park Borough                    500000 non-null  object 
 31  Vehicle Type                    37 non-null      object 
 32  Taxi Company Borough            403 non-null     object 
 33  Taxi Pick Up Location           4475 non-null    object 
 34  Bridge Highway Name             697 non-null     object 
 35  Bridge Highway Direction        766 non-null     object 
 36  Road Ramp                       759 non-null     object 
 37  Bridge Highway Segment          1027 non-null    object 
 38  Latitude                        470830 non-null  float64
 39  Longitude                       470830 non-null  float64
 40  Location                        470830 non-null  object 
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB

What are the distinct sets of values in columns that seem most useful?#

unique() method for getting list of unique values#

Let’s look at the “status” column. What are the status options for these 311 complaints?

requests["Status"].unique()
array(['Closed', 'In Progress', 'Pending', 'Started', 'Open', 'Assigned',
       'Email Sent'], dtype=object)
requests["Open Data Channel Type"].unique()
array(['UNKNOWN', 'MOBILE', 'PHONE', 'ONLINE', 'OTHER'], dtype=object)
requests["Agency"].unique()
array(['DOT', 'HPD', 'NYPD', 'DSNY', 'DEP', 'DOB', 'DPR', 'DOHMH', 'DHS',
       'DCA', 'DOF', 'DFTA', 'TLC', 'HRA', 'EDC', 'DOE', 'DCAS', 'DOITT',
       'ACS', 'DCP', 'NYCEM', 'FDNY', 'TAX', 'MOC', 'TAT', 'COIB', 'DVS'],
      dtype=object)

Excluding bad records from the dataframe#

Let’s look at the list of distinct complaint types.

requests["Complaint Type"].unique()
array(['Street Condition', 'HEAT/HOT WATER', 'Noise - Residential',
       'Illegal Parking', 'Request Large Bulky Item Collection', 'Noise',
       'Noise - Street/Sidewalk', 'Electronics Waste Appointment',
       'Blocked Driveway', 'Dirty Conditions', 'Curb Condition',
       'Noise - Commercial', 'General Construction/Plumbing',
       'Traffic Signal Condition', 'Street Light Condition', 'Lead',
       'Street Sign - Damaged', 'Noise - Vehicle', 'New Tree Request',
       'Sanitation Condition', 'Mosquitoes', 'WATER LEAK',
       'UNSANITARY CONDITION', 'Root/Sewer/Sidewalk Condition',
       'Dead/Dying Tree', 'Derelict Vehicles', 'Collection Truck Noise',
       'Sewer', 'GENERAL', 'Overflowing Litter Baskets', 'Vacant Lot',
       'Sidewalk Condition', 'PAINT/PLASTER', 'Building/Use',
       'Street Sign - Dangling', 'Construction Safety Enforcement',
       'PLUMBING', 'Derelict Vehicle', 'Homeless Person Assistance',
       'ELECTRIC', 'Water System', 'Damaged Tree',
       'Missed Collection (All Materials)', 'Animal Abuse',
       'Borough Office', 'Elevator', 'Food Establishment',
       'Maintenance or Facility', 'FLOORING/STAIRS', 'Consumer Complaint',
       'Overgrown Tree/Branches', 'Graffiti', 'APPLIANCE',
       'Other Enforcement', 'DOF Property - Payment Issue',
       'Hazardous Materials', 'Water Conservation', 'Water Quality',
       'Plumbing', 'DOF Property - Update Account', 'DOOR/WINDOW',
       'Housing - Low Income Senior', 'BEST/Site Safety', 'SCRIE',
       'Animal in a Park', 'DOF Parking - Payment Issue',
       'For Hire Vehicle Complaint', 'Benefit Card Replacement',
       'DOF Property - Reduction Issue',
       'Special Projects Inspection Team (SPIT)', 'SAFETY',
       'Taxi Complaint', 'Beach/Pool/Sauna Complaint',
       'Homeless Encampment', 'DOF Property - RPIE Issue',
       'Recycling Enforcement', 'DPR Internal', 'DRIE',
       'Street Sign - Missing', 'Taxi Report',
       'DOF Parking - Request Status', 'Air Quality',
       'Broken Parking Meter', 'Mold', 'HPD Literature Request',
       'Poison Ivy', 'Non-Emergency Police Matter',
       'DOF Property - Request Copy', 'Housing Options',
       'Illegal Tree Damage', 'Traffic', 'Noise - Park', 'Asbestos',
       'DCA / DOH New License Application Request', 'Cooling Tower',
       'Drinking', 'Indoor Sewage', 'Vending',
       'Emergency Response Team (ERT)', 'Electrical',
       'DOF Property - Owner Issue', 'Parking Card', 'Indoor Air Quality',
       'Violation of Park Rules', 'Mobile Food Vendor',
       'Highway Condition', 'ELEVATOR', 'Noise - Helicopter',
       'Unsanitary Animal Pvt Property', 'Rodent', 'Standing Water',
       'Food Poisoning', 'Litter Basket / Request', 'Day Care',
       'Executive Inspections', 'DOF Parking - Tax Exemption',
       'OUTSIDE BUILDING', 'DOF Parking - Request Copy',
       'DOF Property - Property Value', 'School Maintenance', 'Question',
       'Industrial Waste', 'Senior Center Complaint', 'Smoking',
       'Quality of Life', 'Derelict Bicycle', 'Ferry Complaint',
       'Sweeping/Missed', 'Ferry Inquiry', 'Sweeping/Inadequate',
       'Found Property', 'LinkNYC', 'Sustainability Enforcement',
       'Taxpayer Advocate Inquiry', 'Investigations and Discipline (IAD)',
       'Noise - House of Worship', 'Scaffold Safety', 'Pet Shop',
       'Bus Stop Shelter Complaint', 'Forms', 'Elder Abuse',
       'Drug Activity', 'Unsanitary Pigeon Condition', 'Comments',
       'Boilers', 'Building Marshals office',
       'Home Delivered Meal - Missed Delivery',
       'Bike/Roller/Skate Chronic', 'Posting Advertisement',
       'Research Questions', 'Urinating in Public',
       'Public Payphone Complaint', 'Unleashed Dog', 'Facades',
       'Illegal Animal Sold', 'Cranes and Derricks',
       'Bus Stop Shelter Placement', 'Panhandling', 'Bridge Condition',
       'Illegal Animal Kept as Pet', 'OEM Literature Request',
       'Advocate-Personal Exemptions', 'Abandoned Vehicle',
       "Alzheimer's Care", 'Taxi Compliment', 'Disorderly Youth',
       'Public Assembly', 'Advocate-Prop Refunds/Credits',
       'Unsanitary Condition', 'DOF Parking - DMV Clearance',
       'Home Delivered Meal Complaint',
       'Transportation Provider Complaint', 'Ferry Permit',
       'For Hire Vehicle Report', 'Harboring Bees/Wasps',
       'Advocate - Other', 'Home Care Provider Complaint',
       'Miscellaneous Categories', 'Municipal Parking Facility', 'FATF',
       'Special Natural Area District (SNAD)', 'Illegal Fireworks',
       'Highway Sign - Damaged', 'Drinking Water', 'Calorie Labeling',
       'Case Management Agency Complaint', 'Bike Rack Condition',
       'Unsanitary Animal Facility', 'Plant', 'Public Toilet',
       'Tattooing', 'Animal Facility - No Permit', 'Electronics Waste',
       'General', 'Snow', 'Unlicensed Dog', 'Taxi Licensee Complaint',
       'FHV Licensee Complaint', 'Overflowing Recycling Baskets',
       'Highway Sign - Missing', 'Non-Residential Heat',
       'Special Operations', 'Safety', 'Radioactive Material',
       'Dispatched Taxi Complaint', 'Lifeguard', 'Window Guard',
       'Select Message Type...', 'Bereavement Support Group',
       'Sweeping/Missed-Inadequate', 'Squeegee', 'Tunnel Condition',
       'Foam Ban Enforcement', 'Request Xmas Tree Collection',
       'Advocate-Business Tax', 'Advocate - RPIE',
       'Highway Sign - Dangling', 'X-Ray Machine/Equipment',
       'Advocate-Co-opCondo Abatement', 'Forensic Engineering',
       'Outside Building', 'Lost Property',
       '(select extractvalue(xmltyp...', "Misc. Comments' and 7702=77...",
       'qfix4${695*589}lixaf', "eval(compile('for x in rang...",
       'Misc. Comments"|echo 2...', 'Misc. Comments"|ping -...',
       "Misc. Comments'|echo y9v6fc...", 'file:///c:/windows/win.ini',
       "Misc. Comments'|ping -c 21 ...", '../../../../../../../../../...',
       '\\windows\\win.ini', '..././..././..././..././......',
       'idexf3mrb7)(!(objectClass=*)', 'Misc. Comments  BCC:a5glxzf...',
       '$(sleep 11)', '@(9313*3464)', "{!xmlparser v='<!DOCTYPE...",
       '../WEB-INF/web.xml', 'Animal-Abuse', 'Missed Collection',
       'DEP Street Condition', 'Green Taxi Report',
       'Green Taxi Complaint', 'Bottled Water',
       'Single Occupancy Bathroom', 'Appliance'], dtype=object)

Let’s make that a bit easier to read:

complaints = requests["Complaint Type"].unique()
complaints.sort()
list(complaints)
['$(sleep 11)',
 '(select extractvalue(xmltyp...',
 '..././..././..././..././......',
 '../../../../../../../../../...',
 '../WEB-INF/web.xml',
 '@(9313*3464)',
 'APPLIANCE',
 'Abandoned Vehicle',
 'Advocate - Other',
 'Advocate - RPIE',
 'Advocate-Business Tax',
 'Advocate-Co-opCondo Abatement',
 'Advocate-Personal Exemptions',
 'Advocate-Prop Refunds/Credits',
 'Air Quality',
 "Alzheimer's Care",
 'Animal Abuse',
 'Animal Facility - No Permit',
 'Animal in a Park',
 'Animal-Abuse',
 'Appliance',
 'Asbestos',
 'BEST/Site Safety',
 'Beach/Pool/Sauna Complaint',
 'Benefit Card Replacement',
 'Bereavement Support Group',
 'Bike Rack Condition',
 'Bike/Roller/Skate Chronic',
 'Blocked Driveway',
 'Boilers',
 'Borough Office',
 'Bottled Water',
 'Bridge Condition',
 'Broken Parking Meter',
 'Building Marshals office',
 'Building/Use',
 'Bus Stop Shelter Complaint',
 'Bus Stop Shelter Placement',
 'Calorie Labeling',
 'Case Management Agency Complaint',
 'Collection Truck Noise',
 'Comments',
 'Construction Safety Enforcement',
 'Consumer Complaint',
 'Cooling Tower',
 'Cranes and Derricks',
 'Curb Condition',
 'DCA / DOH New License Application Request',
 'DEP Street Condition',
 'DOF Parking - DMV Clearance',
 'DOF Parking - Payment Issue',
 'DOF Parking - Request Copy',
 'DOF Parking - Request Status',
 'DOF Parking - Tax Exemption',
 'DOF Property - Owner Issue',
 'DOF Property - Payment Issue',
 'DOF Property - Property Value',
 'DOF Property - RPIE Issue',
 'DOF Property - Reduction Issue',
 'DOF Property - Request Copy',
 'DOF Property - Update Account',
 'DOOR/WINDOW',
 'DPR Internal',
 'DRIE',
 'Damaged Tree',
 'Day Care',
 'Dead/Dying Tree',
 'Derelict Bicycle',
 'Derelict Vehicle',
 'Derelict Vehicles',
 'Dirty Conditions',
 'Disorderly Youth',
 'Dispatched Taxi Complaint',
 'Drinking',
 'Drinking Water',
 'Drug Activity',
 'ELECTRIC',
 'ELEVATOR',
 'Elder Abuse',
 'Electrical',
 'Electronics Waste',
 'Electronics Waste Appointment',
 'Elevator',
 'Emergency Response Team (ERT)',
 'Executive Inspections',
 'FATF',
 'FHV Licensee Complaint',
 'FLOORING/STAIRS',
 'Facades',
 'Ferry Complaint',
 'Ferry Inquiry',
 'Ferry Permit',
 'Foam Ban Enforcement',
 'Food Establishment',
 'Food Poisoning',
 'For Hire Vehicle Complaint',
 'For Hire Vehicle Report',
 'Forensic Engineering',
 'Forms',
 'Found Property',
 'GENERAL',
 'General',
 'General Construction/Plumbing',
 'Graffiti',
 'Green Taxi Complaint',
 'Green Taxi Report',
 'HEAT/HOT WATER',
 'HPD Literature Request',
 'Harboring Bees/Wasps',
 'Hazardous Materials',
 'Highway Condition',
 'Highway Sign - Damaged',
 'Highway Sign - Dangling',
 'Highway Sign - Missing',
 'Home Care Provider Complaint',
 'Home Delivered Meal - Missed Delivery',
 'Home Delivered Meal Complaint',
 'Homeless Encampment',
 'Homeless Person Assistance',
 'Housing - Low Income Senior',
 'Housing Options',
 'Illegal Animal Kept as Pet',
 'Illegal Animal Sold',
 'Illegal Fireworks',
 'Illegal Parking',
 'Illegal Tree Damage',
 'Indoor Air Quality',
 'Indoor Sewage',
 'Industrial Waste',
 'Investigations and Discipline (IAD)',
 'Lead',
 'Lifeguard',
 'LinkNYC',
 'Litter Basket / Request',
 'Lost Property',
 'Maintenance or Facility',
 'Misc. Comments  BCC:a5glxzf...',
 'Misc. Comments"|echo 2...',
 'Misc. Comments"|ping -...',
 "Misc. Comments' and 7702=77...",
 "Misc. Comments'|echo y9v6fc...",
 "Misc. Comments'|ping -c 21 ...",
 'Miscellaneous Categories',
 'Missed Collection',
 'Missed Collection (All Materials)',
 'Mobile Food Vendor',
 'Mold',
 'Mosquitoes',
 'Municipal Parking Facility',
 'New Tree Request',
 'Noise',
 'Noise - Commercial',
 'Noise - Helicopter',
 'Noise - House of Worship',
 'Noise - Park',
 'Noise - Residential',
 'Noise - Street/Sidewalk',
 'Noise - Vehicle',
 'Non-Emergency Police Matter',
 'Non-Residential Heat',
 'OEM Literature Request',
 'OUTSIDE BUILDING',
 'Other Enforcement',
 'Outside Building',
 'Overflowing Litter Baskets',
 'Overflowing Recycling Baskets',
 'Overgrown Tree/Branches',
 'PAINT/PLASTER',
 'PLUMBING',
 'Panhandling',
 'Parking Card',
 'Pet Shop',
 'Plant',
 'Plumbing',
 'Poison Ivy',
 'Posting Advertisement',
 'Public Assembly',
 'Public Payphone Complaint',
 'Public Toilet',
 'Quality of Life',
 'Question',
 'Radioactive Material',
 'Recycling Enforcement',
 'Request Large Bulky Item Collection',
 'Request Xmas Tree Collection',
 'Research Questions',
 'Rodent',
 'Root/Sewer/Sidewalk Condition',
 'SAFETY',
 'SCRIE',
 'Safety',
 'Sanitation Condition',
 'Scaffold Safety',
 'School Maintenance',
 'Select Message Type...',
 'Senior Center Complaint',
 'Sewer',
 'Sidewalk Condition',
 'Single Occupancy Bathroom',
 'Smoking',
 'Snow',
 'Special Natural Area District (SNAD)',
 'Special Operations',
 'Special Projects Inspection Team (SPIT)',
 'Squeegee',
 'Standing Water',
 'Street Condition',
 'Street Light Condition',
 'Street Sign - Damaged',
 'Street Sign - Dangling',
 'Street Sign - Missing',
 'Sustainability Enforcement',
 'Sweeping/Inadequate',
 'Sweeping/Missed',
 'Sweeping/Missed-Inadequate',
 'Tattooing',
 'Taxi Complaint',
 'Taxi Compliment',
 'Taxi Licensee Complaint',
 'Taxi Report',
 'Taxpayer Advocate Inquiry',
 'Traffic',
 'Traffic Signal Condition',
 'Transportation Provider Complaint',
 'Tunnel Condition',
 'UNSANITARY CONDITION',
 'Unleashed Dog',
 'Unlicensed Dog',
 'Unsanitary Animal Facility',
 'Unsanitary Animal Pvt Property',
 'Unsanitary Condition',
 'Unsanitary Pigeon Condition',
 'Urinating in Public',
 'Vacant Lot',
 'Vending',
 'Violation of Park Rules',
 'WATER LEAK',
 'Water Conservation',
 'Water Quality',
 'Water System',
 'Window Guard',
 'X-Ray Machine/Equipment',
 '\\windows\\win.ini',
 "eval(compile('for x in rang...",
 'file:///c:/windows/win.ini',
 'idexf3mrb7)(!(objectClass=*)',
 'qfix4${695*589}lixaf',
 "{!xmlparser v='<!DOCTYPE..."]

Let’s see how frequently these invalid Complaint Type values appear in the data.

Use .groupby().size() to get the count of 311 requests per complaint type value. This is very similar to pivot tables in spreadsheets.

with pd.option_context("display.max_rows", 1000):
    display(requests.groupby("Complaint Type").size())
Complaint Type
$(sleep 11)                                      1
(select extractvalue(xmltyp...                   1
..././..././..././..././......                   1
../../../../../../../../../...                   2
../WEB-INF/web.xml                               1
@(9313*3464)                                     1
APPLIANCE                                     2539
Abandoned Vehicle                             1655
Advocate - Other                                26
Advocate - RPIE                                  1
Advocate-Business Tax                            2
Advocate-Co-opCondo Abatement                    4
Advocate-Personal Exemptions                    24
Advocate-Prop Refunds/Credits                   39
Air Quality                                   1457
Alzheimer's Care                                25
Animal Abuse                                  1282
Animal Facility - No Permit                     10
Animal in a Park                               456
Animal-Abuse                                   261
Appliance                                        1
Asbestos                                       426
BEST/Site Safety                               359
Beach/Pool/Sauna Complaint                      42
Benefit Card Replacement                      1649
Bereavement Support Group                        1
Bike Rack Condition                             20
Bike/Roller/Skate Chronic                       97
Blocked Driveway                             25530
Boilers                                        449
Borough Office                                 900
Bottled Water                                    2
Bridge Condition                                64
Broken Parking Meter                          2499
Building Marshals office                       311
Building/Use                                  4869
Bus Stop Shelter Complaint                     172
Bus Stop Shelter Placement                      43
Calorie Labeling                                 4
Case Management Agency Complaint                 4
Collection Truck Noise                          36
Comments                                        16
Construction Safety Enforcement               1740
Consumer Complaint                            2892
Cooling Tower                                    7
Cranes and Derricks                             48
Curb Condition                                1476
DCA / DOH New License Application Request      186
DEP Street Condition                            46
DOF Parking - DMV Clearance                     22
DOF Parking - Payment Issue                    532
DOF Parking - Request Copy                      97
DOF Parking - Request Status                    94
DOF Parking - Tax Exemption                    157
DOF Property - Owner Issue                     280
DOF Property - Payment Issue                  1092
DOF Property - Property Value                  106
DOF Property - RPIE Issue                       80
DOF Property - Reduction Issue                1846
DOF Property - Request Copy                   1000
DOF Property - Update Account                  802
DOOR/WINDOW                                   6479
DPR Internal                                   245
DRIE                                           233
Damaged Tree                                  6065
Day Care                                        98
Dead/Dying Tree                               1995
Derelict Bicycle                               457
Derelict Vehicle                              6056
Derelict Vehicles                             6759
Dirty Conditions                              6776
Disorderly Youth                                10
Dispatched Taxi Complaint                        5
Drinking                                       311
Drinking Water                                  16
Drug Activity                                  352
ELECTRIC                                      4670
ELEVATOR                                       250
Elder Abuse                                     43
Electrical                                     551
Electronics Waste                                1
Electronics Waste Appointment                 5535
Elevator                                      3004
Emergency Response Team (ERT)                  872
Executive Inspections                           73
FATF                                            23
FHV Licensee Complaint                           4
FLOORING/STAIRS                               4177
Facades                                         72
Ferry Complaint                                 38
Ferry Inquiry                                   96
Ferry Permit                                     8
Foam Ban Enforcement                             5
Food Establishment                            2061
Food Poisoning                                 549
For Hire Vehicle Complaint                    2789
For Hire Vehicle Report                        112
Forensic Engineering                             1
Forms                                           57
Found Property                                  25
GENERAL                                       4565
General                                        188
General Construction/Plumbing                 9537
Graffiti                                      4286
Green Taxi Complaint                             7
Green Taxi Report                                2
HEAT/HOT WATER                               39095
HPD Literature Request                         405
Harboring Bees/Wasps                            31
Hazardous Materials                            549
Highway Condition                              628
Highway Sign - Damaged                          20
Highway Sign - Dangling                          3
Highway Sign - Missing                          12
Home Care Provider Complaint                     4
Home Delivered Meal - Missed Delivery           15
Home Delivered Meal Complaint                   13
Homeless Encampment                            977
Homeless Person Assistance                    3670
Housing - Low Income Senior                    768
Housing Options                                251
Illegal Animal Kept as Pet                      37
Illegal Animal Sold                              7
Illegal Fireworks                              138
Illegal Parking                              34297
Illegal Tree Damage                            642
Indoor Air Quality                             848
Indoor Sewage                                  214
Industrial Waste                               263
Investigations and Discipline (IAD)            152
Lead                                          1825
Lifeguard                                        7
LinkNYC                                         40
Litter Basket / Request                        463
Lost Property                                  472
Maintenance or Facility                       3055
Misc. Comments  BCC:a5glxzf...                   1
Misc. Comments"|echo 2...                        1
Misc. Comments"|ping -...                        1
Misc. Comments' and 7702=77...                   1
Misc. Comments'|echo y9v6fc...                   1
Misc. Comments'|ping -c 21 ...                   1
Miscellaneous Categories                        34
Missed Collection                               55
Missed Collection (All Materials)             8340
Mobile Food Vendor                             383
Mold                                            81
Mosquitoes                                     139
Municipal Parking Facility                      15
New Tree Request                              3715
Noise                                        10937
Noise - Commercial                            7894
Noise - Helicopter                             303
Noise - House of Worship                       148
Noise - Park                                   792
Noise - Residential                          41311
Noise - Street/Sidewalk                      16876
Noise - Vehicle                               7244
Non-Emergency Police Matter                   1628
Non-Residential Heat                           169
OEM Literature Request                          25
OUTSIDE BUILDING                               150
Other Enforcement                             1349
Outside Building                                 2
Overflowing Litter Baskets                     140
Overflowing Recycling Baskets                    1
Overgrown Tree/Branches                       2935
PAINT/PLASTER                                 9877
PLUMBING                                      9007
Panhandling                                     96
Parking Card                                     2
Pet Shop                                        28
Plant                                           22
Plumbing                                      1482
Poison Ivy                                      47
Posting Advertisement                           23
Public Assembly                                  1
Public Payphone Complaint                       75
Public Toilet                                    9
Quality of Life                                 65
Question                                       171
Radioactive Material                             3
Recycling Enforcement                          134
Request Large Bulky Item Collection          30939
Request Xmas Tree Collection                     2
Research Questions                               4
Rodent                                        6253
Root/Sewer/Sidewalk Condition                 1485
SAFETY                                        1668
SCRIE                                          770
Safety                                          65
Sanitation Condition                          7151
Scaffold Safety                                 38
School Maintenance                             442
Select Message Type...                           1
Senior Center Complaint                         52
Sewer                                         6757
Sidewalk Condition                            6488
Single Occupancy Bathroom                        4
Smoking                                        456
Snow                                           297
Special Natural Area District (SNAD)            14
Special Operations                              39
Special Projects Inspection Team (SPIT)       1168
Squeegee                                         1
Standing Water                                 484
Street Condition                             16895
Street Light Condition                       13425
Street Sign - Damaged                         1176
Street Sign - Dangling                         482
Street Sign - Missing                         1080
Sustainability Enforcement                     247
Sweeping/Inadequate                             55
Sweeping/Missed                                174
Sweeping/Missed-Inadequate                      11
Tattooing                                       21
Taxi Complaint                                2416
Taxi Compliment                                 41
Taxi Licensee Complaint                          5
Taxi Report                                    318
Taxpayer Advocate Inquiry                       19
Traffic                                        868
Traffic Signal Condition                      7306
Transportation Provider Complaint                7
Tunnel Condition                                 1
UNSANITARY CONDITION                         14913
Unleashed Dog                                  128
Unlicensed Dog                                   6
Unsanitary Animal Facility                      15
Unsanitary Animal Pvt Property                 327
Unsanitary Condition                           942
Unsanitary Pigeon Condition                    103
Urinating in Public                            112
Vacant Lot                                     318
Vending                                        873
Violation of Park Rules                        492
WATER LEAK                                    6641
Water Conservation                             853
Water Quality                                  332
Water System                                 12949
Window Guard                                     6
X-Ray Machine/Equipment                          1
\windows\win.ini                                 1
eval(compile('for x in rang...                   1
file:///c:/windows/win.ini                       1
idexf3mrb7)(!(objectClass=*)                     1
qfix4${695*589}lixaf                             1
{!xmlparser v='<!DOCTYPE...                      1
dtype: int64
with pd.option_context("display.max_rows", 1000):
    display(...)

What this code is doing: showing all cells in a DataFrame with rich output.

requests.groupby('Complaint Type').size()

What this code is doing:

  1. Group the records in the dataset based on their Complaint Type value

  2. Count the records that have been grouped together by their shared Complaint Type value

How should we find junk records to delete?

It looks like most invalid complaint types only have a few records. Try excluding all complaint type categories with < 3 records, assuming that all complaint type categories with < 3 instances in the data are bad data entries.

Why 3? It’s arbitrary. We’re looking for trends in the data, and in this case we don’t care about low frequency entries.

Create a dataframe that captures the count of records per Complaint Type value.

counts = requests.groupby("Complaint Type").size().reset_index(name="count")
counts
# .reset_index(name='count') allows us to name the new column that contains the count of rows
Complaint Type count
0 $(sleep 11) 1
1 (select extractvalue(xmltyp... 1
2 ..././..././..././..././...... 1
3 ../../../../../../../../../... 2
4 ../WEB-INF/web.xml 1
... ... ...
243 eval(compile('for x in rang... 1
244 file:///c:/windows/win.ini 1
245 idexf3mrb7)(!(objectClass=*) 1
246 qfix4${695*589}lixaf 1
247 {!xmlparser v='<!DOCTYPE... 1

248 rows × 2 columns

You can also use .count() but the output is a little different.

Let’s create a “series” that only lists the Complaint Type values that have record counts >= 3.

Remember: A single column from a pandas dataframe is called a series. It’s essentially a list containing all the values in the column.

requests.groupby('Complaint Type').size().reset_index(name='size')

is equivalent to:

requests.groupby('Complaint Type', as_index=False).size()

Without the reset_index() or as_index=False, the Complaint Type values are returned as the index. See also: value_counts().

valid_complaint_types = counts["Complaint Type"][counts["count"] >= 3]
valid_complaint_types
6                          APPLIANCE
7                  Abandoned Vehicle
8                   Advocate - Other
11     Advocate-Co-opCondo Abatement
12      Advocate-Personal Exemptions
                   ...              
236                       WATER LEAK
237               Water Conservation
238                    Water Quality
239                     Water System
240                     Window Guard
Name: Complaint Type, Length: 213, dtype: object

Filter our requests dataframe to only keep the rows where the Complaint Type value is in the valid_complaint_types series we created in the previous step. Save the result in a new dataframe.

df_cleaned = requests[requests["Complaint Type"].isin(valid_complaint_types)]

How can we make sure this worked? Let’s check how many records there were originally in requests vs how many are in df_cleaned.

Before:

requests["Unique Key"].size
500000

After:

df_cleaned["Unique Key"].size
499958

We can also print the set of complaint_type values from our cleaned dataframe to make sure they look correct.

df_cleaned["Complaint Type"].sample(10)
349480                         Borough Office
254578                               PLUMBING
198644    Request Large Bulky Item Collection
319316                 Street Light Condition
29580           General Construction/Plumbing
398404                 Street Light Condition
303870                 HPD Literature Request
86399                                   Sewer
22631                 Noise - Street/Sidewalk
398130                        Illegal Parking
Name: Complaint Type, dtype: object

Great, now those invalid records will be excluded from our analysis!

Another approach to excluding those invalid records would be to use regular expressions (“RegExes”) to find records with weird characters.

Filtering rows#

Slicing and dicing is done through indexing.

DataFrame

Boolean indexing#

DataFrame and Series

The table on the right is missing the state column - need to fix.

How it works#

Done with clean up! Time for the actual analysis:#

Which 311 complaints are most common and which agencies are responsible for handling them?#

Which complaints are the most common?#

df_cleaned.groupby("Complaint Type").size().reset_index(name="count")
Complaint Type count
0 APPLIANCE 2539
1 Abandoned Vehicle 1655
2 Advocate - Other 26
3 Advocate-Co-opCondo Abatement 4
4 Advocate-Personal Exemptions 24
... ... ...
208 WATER LEAK 6641
209 Water Conservation 853
210 Water Quality 332
211 Water System 12949
212 Window Guard 6

213 rows × 2 columns

.reset_index(name='count') isn’t necessary, but it’s helpful to include because it allows us to name the new column that contains the count of rows.

Which agencies are responsible for handling these complaint categories?#

df_cleaned.groupby(["Agency", "Complaint Type"]).size().reset_index(name="count")
Agency Complaint Type count
0 ACS Forms 56
1 COIB Forms 1
2 DCA Consumer Complaint 2892
3 DCA DCA / DOH New License Application Request 186
4 DCAS Comments 13
... ... ... ...
216 TLC Lost Property 472
217 TLC Taxi Complaint 2416
218 TLC Taxi Compliment 41
219 TLC Taxi Licensee Complaint 5
220 TLC Taxi Report 318

221 rows × 3 columns

Which agencies receive the most total 311 requests?#

df_cleaned.groupby("Agency").size().reset_index(name="count")
Agency count
0 ACS 56
1 COIB 1
2 DCA 3078
3 DCAS 149
4 DCP 4
5 DEP 36283
6 DFTA 1182
7 DHS 3670
8 DOB 26024
9 DOE 442
10 DOF 7223
11 DOHMH 12690
12 DOITT 115
13 DOT 52001
14 DPR 21086
15 DSNY 73227
16 EDC 303
17 HPD 105634
18 HRA 1649
19 NYCEM 25
20 NYPD 148884
21 TAT 1
22 TAX 37
23 TLC 6194

What is the most frequent request per agency?#

First, create a dataframe that contains the count of complaints per Agency per Complaint Type.

agency_counts = df_cleaned.groupby(["Agency", "Complaint Type"]).size().reset_index(name="count")
agency_counts.head(20)
Agency Complaint Type count
0 ACS Forms 56
1 COIB Forms 1
2 DCA Consumer Complaint 2892
3 DCA DCA / DOH New License Application Request 186
4 DCAS Comments 13
5 DCAS Question 136
6 DCP Research Questions 4
7 DEP Air Quality 1457
8 DEP Asbestos 316
9 DEP FATF 23
10 DEP Hazardous Materials 549
11 DEP Industrial Waste 263
12 DEP Lead 1825
13 DEP Noise 10937
14 DEP Plant 22
15 DEP Sewer 6757
16 DEP Water Conservation 853
17 DEP Water Quality 332
18 DEP Water System 12949
19 DFTA Alzheimer's Care 25

Use drop_duplicates() to keep the row with the highest value per Agency.

sorted_agency_counts = agency_counts.sort_values("count", ascending=False)
sorted_agency_counts
Agency Complaint Type count
198 NYPD Noise - Residential 41311
169 HPD HEAT/HOT WATER 39095
194 NYPD Illegal Parking 34297
154 DSNY Request Large Bulky Item Collection 30939
186 NYPD Blocked Driveway 25530
... ... ... ...
119 DOT Highway Sign - Dangling 3
94 DOHMH Radioactive Material 3
139 DPR Unsanitary Condition 1
207 TAT Question 1
1 COIB Forms 1

221 rows × 3 columns

top_agency_counts = sorted_agency_counts.drop_duplicates("Agency")
top_agency_counts
Agency Complaint Type count
198 NYPD Noise - Residential 41311
169 HPD HEAT/HOT WATER 39095
154 DSNY Request Large Bulky Item Collection 30939
124 DOT Street Condition 16895
18 DEP Water System 12949
42 DOB General Construction/Plumbing 9537
95 DOHMH Rodent 6253
132 DPR Damaged Tree 6065
29 DHS Homeless Person Assistance 3670
2 DCA Consumer Complaint 2892
212 TLC For Hire Vehicle Complaint 2789
67 DOF DOF Property - Reduction Issue 1846
180 HRA Benefit Card Replacement 1649
25 DFTA Housing - Low Income Senior 768
53 DOE School Maintenance 442
161 EDC Noise - Helicopter 303
5 DCAS Question 136
106 DOITT Public Payphone Complaint 75
0 ACS Forms 56
209 TAX Question 34
181 NYCEM OEM Literature Request 25
6 DCP Research Questions 4
207 TAT Question 1
1 COIB Forms 1

For nicer display, order alphabetically:

top_agency_counts.sort_values("Agency")
Agency Complaint Type count
0 ACS Forms 56
1 COIB Forms 1
2 DCA Consumer Complaint 2892
5 DCAS Question 136
6 DCP Research Questions 4
18 DEP Water System 12949
25 DFTA Housing - Low Income Senior 768
29 DHS Homeless Person Assistance 3670
42 DOB General Construction/Plumbing 9537
53 DOE School Maintenance 442
67 DOF DOF Property - Reduction Issue 1846
95 DOHMH Rodent 6253
106 DOITT Public Payphone Complaint 75
124 DOT Street Condition 16895
132 DPR Damaged Tree 6065
154 DSNY Request Large Bulky Item Collection 30939
161 EDC Noise - Helicopter 303
169 HPD HEAT/HOT WATER 39095
180 HRA Benefit Card Replacement 1649
181 NYCEM OEM Literature Request 25
198 NYPD Noise - Residential 41311
207 TAT Question 1
209 TAX Question 34
212 TLC For Hire Vehicle Complaint 2789

Another way, only sorting it once:

agency_counts.sort_values(["Agency", "count"]).drop_duplicates("Agency", keep="last")
Agency Complaint Type count
0 ACS Forms 56
1 COIB Forms 1
2 DCA Consumer Complaint 2892
5 DCAS Question 136
6 DCP Research Questions 4
18 DEP Water System 12949
25 DFTA Housing - Low Income Senior 768
29 DHS Homeless Person Assistance 3670
42 DOB General Construction/Plumbing 9537
53 DOE School Maintenance 442
67 DOF DOF Property - Reduction Issue 1846
95 DOHMH Rodent 6253
106 DOITT Public Payphone Complaint 75
124 DOT Street Condition 16895
132 DPR Damaged Tree 6065
154 DSNY Request Large Bulky Item Collection 30939
161 EDC Noise - Helicopter 303
169 HPD HEAT/HOT WATER 39095
180 HRA Benefit Card Replacement 1649
181 NYCEM OEM Literature Request 25
198 NYPD Noise - Residential 41311
207 TAT Question 1
209 TAX Question 34
212 TLC For Hire Vehicle Complaint 2789

Some pandas/Jupyter best practices#

  • Make variable names descriptive

    • Ignore that all examples use requests

  • Only do one thing per line

    • Makes troubleshooting easier

  • Make notebooks idempotent

    • Makes your work reproducible

    • Use Restart and run all (⏩ button in toolbar)

Homework 1#