Homework 2#
General assignment information
Coding: Using keywords to categorize 311 requests#
Problem Statement: When you read through the descriptor
and resolution_description
columns in the 311 data, you will see that complaints related to graffiti are actually scattered throughout multiple complaint_type
categories. We want to identify all complaints related to graffiti and see which community districts have the most instances of graffiti.
To help make this assignment easier, there’s a smaller subset of the 311 data for you to use:
https://storage.googleapis.com/python-public-policy/data/cleaned_311_data_hw2.csv.zip
This smaller dataset only contains ~65,000 records from relevant complaint type categories, and has columns renamed to be lowercase and underscored.
Hints#
You can adapt the
recode_borocd_counts()
example from Lecture 2 for this problem.You may run into issues with empty values; how to deal with them.
Ways to do case-insensitive string comparison in pure Python, which translates over to pandas.
Setup#
import ipytest
ipytest.autoconfig()
Step 0#
Load the data.
import pandas as pd
# your code here
Step 1#
Create a flag_graffiti
function that checks each row in the 311 dataframe to see if the word “graffiti” is present in the complaint_type
, descriptor
, and/or resolution_description
. Any of the columns may contain the word, so you should check all of them. If the word “graffiti” is found, the function should return the boolean value True
. If “graffiti” is not found, the function should return the boolean value False
.
Hints#
Make sure to look for “graffiti” in those strings. The strings may contain more than just that word.
Capitalization may be inconsistent.
# your code here
Test by passing in a fake row.
%%ipytest --tb=short --verbosity=1
def test_complaint_type():
test_row = pd.Series({
"complaint_type": "graffiti",
"descriptor": "",
"resolution_description": ""
})
assert flag_graffiti(test_row) == True
def test_descriptor():
test_row = pd.Series({
"complaint_type": "",
"descriptor": "graffiti",
"resolution_description": ""
})
assert flag_graffiti(test_row) == True
def test_description():
test_row = pd.Series({
"complaint_type": "",
"descriptor": "",
"resolution_description": "graffiti"
})
assert flag_graffiti(test_row) == True
def test_none():
test_row = pd.Series({
"complaint_type": "",
"descriptor": "",
"resolution_description": ""
})
assert flag_graffiti(test_row) == False
def test_mixed_cases():
test_row = pd.Series({
"complaint_type": "GrafFiti",
"descriptor": "",
"resolution_description": ""
})
assert flag_graffiti(test_row) == True
def test_substring():
test_row = pd.Series({
"complaint_type": "",
"descriptor": "there's graffiti on the wall",
"resolution_description": ""
})
assert flag_graffiti(test_row) == True
Step 2#
Apply the function created in Step 1 to the 311 dataframe and create a new column called graffiti_flag
that captures the output from the function.
Tip: There are two checks you can use to confirm that the function worked as expected.
Make sure there are records tagged with
graffiti_flag
True
.Make sure that more than one
complaint_type
hasgraffiti_flag
True
(andFalse
).
# your code here
%%ipytest --tb=short --verbosity=1
def test_graffiti_flag():
assert 'graffiti_flag' in df.columns, "column missing"
assert df.dtypes['graffiti_flag'] == 'bool', "column should be booleans"
Step 3#
Create another dataframe df_graffiti
that only contains records where graffiti_flag
is True
.
# your code here
%%ipytest --tb=short --verbosity=1
def test_all_have_graffiti():
assert df_graffiti['graffiti_flag'].all(), "not all have graffiti_flag set to True"
Step 4#
Group your dataframe df_graffiti
to get the count of requests per community_board
. Identify which Community District has the highest count.
# your code here
Bonus 0#
5 points
Create a graffiti_flag2
column using only built-in pandas operations, i.e. without using a custom function (def
). Another way to think about it: Instead of operating on a single row at a time, how can you operate across entire columns? See working with text data for clues.
# your code here
Bonus 1#
5 points
Clean another column of the dataset. Include explanation and code for how you got there.
# your code here
Tutorials#
“You’re Not Mapping Rats, You’re Mapping Gentrification”—article about bias in 311 data
Intro to Plotly Express. You don’t have to work through every one of these examples; just review to get familiar with what types of charts are possible.
Optional#
Participation#
Reminder about the between-class participation requirement.