Homework 2#

General assignment information

Coding: Using keywords to categorize 311 requests#

Problem Statement: When you read through the descriptor and resolution_description columns in the 311 data, you will see that complaints related to graffiti are actually scattered throughout multiple complaint_type categories. We want to identify all complaints related to graffiti and see which community districts have the most instances of graffiti.

To help make this assignment easier, there’s a smaller subset of the 311 data for you to use:

https://storage.googleapis.com/python-public-policy/data/cleaned_311_data_hw2.csv.zip

This smaller dataset only contains ~65,000 records from relevant complaint type categories, and has columns renamed to be lowercase and underscored.

Hints#

Setup#

import ipytest

ipytest.autoconfig()

Step 0#

Load the data.

import pandas as pd
# your code here

Step 1#

Create a flag_graffiti function that checks each row in the 311 dataframe to see if the word “graffiti” is present in the complaint_type, descriptor, and/or resolution_description. Any of the columns may contain the word, so you should check all of them. If the word “graffiti” is found, the function should return the boolean value True. If “graffiti” is not found, the function should return the boolean value False.

Hints#

  • Make sure to look for “graffiti” in those strings. The strings may contain more than just that word.

  • Capitalization may be inconsistent.

# your code here

Test by passing in a fake row.

%%ipytest --tb=short --verbosity=1


def test_complaint_type():
    test_row = pd.Series({
        "complaint_type": "graffiti",
        "descriptor": "",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

    
def test_descriptor():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "graffiti",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

    
def test_description():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "",
        "resolution_description": "graffiti"
    })
    assert flag_graffiti(test_row) == True

    
def test_none():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == False


def test_mixed_cases():
    test_row = pd.Series({
        "complaint_type": "GrafFiti",
        "descriptor": "",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

    
def test_substring():
    test_row = pd.Series({
        "complaint_type": "",
        "descriptor": "there's graffiti on the wall",
        "resolution_description": ""
    })
    assert flag_graffiti(test_row) == True

Step 2#

Apply the function created in Step 1 to the 311 dataframe and create a new column called graffiti_flag that captures the output from the function.

Tip: There are two checks you can use to confirm that the function worked as expected.

  • Make sure there are records tagged with graffiti_flag True.

  • Make sure that more than one complaint_type has graffiti_flag True (and False).

# your code here
%%ipytest --tb=short --verbosity=1

def test_graffiti_flag():
    assert 'graffiti_flag' in df.columns, "column missing"
    assert df.dtypes['graffiti_flag'] == 'bool', "column should be booleans"

Step 3#

Create another dataframe df_graffiti that only contains records where graffiti_flag is True.

# your code here
%%ipytest --tb=short --verbosity=1

def test_all_have_graffiti():
    assert df_graffiti['graffiti_flag'].all(), "not all have graffiti_flag set to True"

Step 4#

Group your dataframe df_graffiti to get the count of requests per community_board. Identify which Community District has the highest count.

# your code here

Bonus 0#

5 points

Create a graffiti_flag2 column using only built-in pandas operations, i.e. without using a custom function (def). Another way to think about it: Instead of operating on a single row at a time, how can you operate across entire columns? See working with text data for clues.

# your code here

Bonus 1#

5 points

Clean another column of the dataset. Include explanation and code for how you got there.

# your code here

Now turn in the assignment.

Tutorials#

Optional#

Participation#

Reminder about the between-class participation requirement.