# Class 2: Manipulating and combining data

How did the homework go?

## Various notes

- Written responses
- Instructions are specific
- [Data cleaning thread](https://edstem.org/us/courses/56920/discussion/4676852) - we'll talk more about it shortly

## Feeling overwhlemed?

Reminder that learning to code is like learning a spoken language. It's not obvious, and people will pick it up at different speeds at different spots. Try:

- Taking notes in the lecture notebooks
- Using [another Python/pandas learning resource](https://python-public-policy.afeld.me/en/nyu/resources.html)
   - Hear things explained another way
   - Ask in [Ed Discussions](https://brightspace.nyu.edu/d2l/le/lessons/366164/topics/9996174) if others have recommendations
- [Comment-driven development](https://www.sitepoint.com/comment-driven-development/)
   - Otherwise, trying to do two steps in your head:
      1. Figuring out the logic
      1. Figuring out the syntax

Small example of comment-driven development:

```python
# find valid ZIP codes
# filter the DataFrame to only invalid ZIP codes
```

## Data cleaning

> Data Cleansing is a process of removing or fixing incorrect, malformed, incomplete, duplicate, or corrupted data

https://hevodata.com/learn/data-cleansing-a-simplified-guide/

### Things to check for

From [my workshop on data cleaning](https://github.com/afeld/data-cleaning):

- Missing data
   - Empty values
- Bad (junk) values
   - Duplicates
   - Mismatched types/formatting
- Categorical data
   - Unique values (cardinality)
   - Value counts
- Continuous values
   - Ranges
   - Spread (distribution)

Notes:

- "Values" in this case can be a single cell (in the spreadsheet sense) or a whole row
- "Missing" or "duplicates" can be columns (Series), tables (DataFrames), rows, or cells
- "Categorical data" have a fixed set of values
- This isn’t everything you can check for, but should cover most things

### Data cleaning [mnemonic](https://literaryterms.net/mnemonic/)

- Empty
- Bad
- Unique
- Spread

## **Today's goal**: Which Community Districts have the most 311 requests? Why might that be?

### What's a Community District?

- 59 local governance districts each run by an appointed [Community Board](https://en.wikipedia.org/wiki/Community_boards_of_New_York_City)
- Community boards advise on land use and zoning, participate in the city budget process, and address service delivery in their district.
- Community boards are each composed of up to 50 volunteer members appointed by the local borough president, half from nominations by the local City Council members.

![Map of community districts from Wikipedia](https://upload.wikimedia.org/wikipedia/commons/4/41/New_York_City_community_districts.svg)

## Setup

In [1]:
import pandas as pd

In [2]:
# Display more rows and columns in the DataFrames
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

### Read our cleaned 311 Service Requests dataset

In [3]:
url = "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
requests = pd.read_csv(url)

  requests = pd.read_csv(url)


## Dealing with dtypes

More data cleaning!

![Minion character vacuuming](https://impulsecreative.com/hs-fs/hubfs/cleaning-minion-gif.gif?width=490&name=cleaning-minion-gif.gif)

```
DtypeWarning: Columns (8,20,31,34) have mixed types.
```

In [4]:
requests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499958 entries, 0 to 499957
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      499958 non-null  int64  
 1   Created Date                    499958 non-null  object 
 2   Closed Date                     476140 non-null  object 
 3   Agency                          499958 non-null  object 
 4   Agency Name                     499958 non-null  object 
 5   Complaint Type                  499958 non-null  object 
 6   Descriptor                      492496 non-null  object 
 7   Location Type                   392573 non-null  object 
 8   Incident Zip                    480394 non-null  object 
 9   Incident Address                434529 non-null  object 
 10  Street Name                     434504 non-null  object 
 11  Cross Street 1                  300825 non-null  object 
 12  Cross Street 2  

In [5]:
list(requests["Incident Zip"].unique())

['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 nan,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10

ZIP codes _look_ numeric, but aren't really.

[Read the ZIP codes in as strings.](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-data-types)

In [6]:
requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})

  requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})


We fixed the dtype warning for column 8 (`Incident Zip`).

In [7]:
list(requests2["Incident Zip"].unique())

['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 <NA>,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '1

### Find invalid ZIP codes

Use a [regular expression (regex)](https://regexone.com/) to [find strings that match a pattern](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#testing-for-strings-that-match-or-contain-a-pattern):

```
^\d{5}(?:-\d{4})?$
│ │ │  │        │└─ end of string
│ │ │  │        └─ optional
│ │ │  └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string
```

[regex101](https://regex101.com/) is useful for testing them.

In [8]:
# find valid ZIP codes
valid_zips = requests2["Incident Zip"].str.contains(r"^\d{5}(?:-\d{4})?$")

# filter the DataFrame to only invalid ZIP codes
invalid_zips = valid_zips == False
requests_with_invalid_zips = requests2[invalid_zips]
requests_with_invalid_zips["Incident Zip"]

55017     HARRISBURG
58100         N5X3A6
80798         100000
120304           IDK
123304          1801
173518     14614-195
192034        979113
201463           100
207158          8682
216745        000000
325071      NJ 07114
425985          1101
441166         DID N
Name: Incident Zip, dtype: string

[Clear](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#inserting-missing-data) any invalid ZIP codes:

In [9]:
requests2.loc[invalid_zips, "Incident Zip"] = None

Additonal data cleaning tips:

- Hard part is finding what needs to be done
- Will be specific to your use case
- Document what you did, since it will affect your results

## [In-class exercise](https://python-public-policy.afeld.me/en/nyu/lecture_2_exercise.html)

_Skipping to save time. While not graded, I encourage you to do it on your own, as it will help prepare you for the Final Project._

## View the contents of the `community_board` column in our 311 data

In [10]:
requests["Community Board"].unique()

array(['15 BROOKLYN', '03 BROOKLYN', '14 QUEENS', '10 BRONX', '08 QUEENS',
       '02 BRONX', '01 QUEENS', '11 QUEENS', '02 MANHATTAN',
       '18 BROOKLYN', '12 QUEENS', '01 STATEN ISLAND', '12 BRONX',
       '05 BROOKLYN', '01 BRONX', '09 QUEENS', '04 BROOKLYN',
       '10 BROOKLYN', '02 STATEN ISLAND', '05 QUEENS', '04 MANHATTAN',
       '11 BRONX', 'Unspecified BROOKLYN', '09 BRONX', '12 MANHATTAN',
       '09 BROOKLYN', '14 BROOKLYN', '06 MANHATTAN', '10 MANHATTAN',
       'Unspecified QUEENS', '01 MANHATTAN', '03 MANHATTAN', '05 BRONX',
       '08 BROOKLYN', '02 QUEENS', '12 BROOKLYN', '01 BROOKLYN',
       '16 BROOKLYN', '13 BROOKLYN', '06 QUEENS', '07 MANHATTAN',
       '11 BROOKLYN', 'Unspecified BRONX', '08 MANHATTAN',
       '03 STATEN ISLAND', '06 BROOKLYN', '03 BRONX', '05 MANHATTAN',
       '07 QUEENS', '13 QUEENS', '17 BROOKLYN', '06 BRONX', '02 BROOKLYN',
       '10 QUEENS', 'Unspecified MANHATTAN', '03 QUEENS', '04 BRONX',
       '11 MANHATTAN', '08 BRONX', '07 BROOKLY

### Get the count of 311 requests per Community District

In [11]:
cb_counts = requests.groupby("Community Board").size().reset_index(name="num_311_requests")
cb_counts = cb_counts.sort_values("num_311_requests", ascending=False)
cb_counts

Unnamed: 0,Community Board,num_311_requests
50,12 MANHATTAN,14110
23,05 QUEENS,12487
51,12 QUEENS,12228
2,01 BROOKLYN,11863
12,03 BROOKLYN,11615
5,01 STATEN ISLAND,11438
31,07 QUEENS,11210
21,05 BROOKLYN,10862
16,04 BRONX,10628
4,01 QUEENS,10410


## **Research Question:** What may account for the variance in count of requests per community district?

### **Hypothesis:** Population size may help explain the variance.

We can combine the counts per community district dataset with population data for each community district.

We'll use [pandas' `.merge()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging), comparable to:

- [SQL `JOIN`](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#join)
- [Spreadsheet `VLOOKUP`](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_spreadsheets.html#merging)

In general, called ["record linkage" or "entity resolution"](https://en.wikipedia.org/wiki/Record_linkage).

### Let's load the population dataset and check out its contents

[Data source for population by Community District](https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Community-Districts/xi7c-iiu2/data)

In [12]:
population = pd.read_csv("https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv")
population.head()

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200


## In order to join the two dataframes, we need to create a common ID in each.

[`BORO CODE`](https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/pluto_datadictionary.pdf#page=38) (a.k.a. `BoroCode`, `borocd`, and `boro_cd`) is a commonly-used a unique ID for community districts. Let's create functions that create that unique ID in our datasets.

**BoroCD** is a 3 digit integer that captures the borough and district number. The borough is represented by the first digit. The district number is padded with zeros so it's always two digits long.

Boroughs are recoded into the following numbers:
- 1: Manhattan
- 2: Bronx
- 3: Brooklyn
- 4: Queens
- 5: Staten Island

Ex: 
- Manhattan 12 --> 112
- Brooklyn 6 --> 306

## First, let's create a `borocd` column in `cb_counts` dataframe

In [13]:
cb_counts.head()

Unnamed: 0,Community Board,num_311_requests
50,12 MANHATTAN,14110
23,05 QUEENS,12487
51,12 QUEENS,12228
2,01 BROOKLYN,11863
12,03 BROOKLYN,11615


[`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) can be used for transforming data with a custom function. How does it work?

```python
def my_function(row):
    # do stuff
    return some_value

new_values = dataframe.apply(my_function, axis=1)
```

While pandas generally operates on an entire column at once, `apply()` is similar to [working with CSVs in pure Python](https://python-public-policy.afeld.me/en/nyu/lecture_1.html#working-with-csvs-in-pure-python) in that you are operating row by row.

Let's create a function called `recode_borocd_counts` that takes a `row` and converts the `Community Board` value into a `borocd` value.

In [14]:
def recode_borocd_counts(row):
    if "MANHATTAN" in row["Community Board"]:
        return "1" + row["Community Board"][0:2]
        # [0:2] provides the first 2 characters, i.e. characters at indexes 0 and 1.
        # you could also use [:2] without the zero.
    elif "BRONX" in row["Community Board"]:
        return "2" + row["Community Board"][0:2]
    elif "BROOKLYN" in row["Community Board"]:
        return "3" + row["Community Board"][0:2]
    elif "QUEENS" in row["Community Board"]:
        return "4" + row["Community Board"][0:2]
    elif "STATEN ISLAND" in row["Community Board"]:
        return "5" + row["Community Board"][0:2]
    else:
        return "Invalid BoroCD"

Let's test out that function in isolation. We'll grab one of the rows and pass it into the function.

In [15]:
sample_row = cb_counts.iloc[0]
sample_row

Community Board     12 MANHATTAN
num_311_requests           14110
Name: 50, dtype: object

In [16]:
recode_borocd_counts(sample_row)

'112'

Now we use `apply()` to do that across _all_ the rows.

In [17]:
cb_counts["boro_cd"] = cb_counts.apply(recode_borocd_counts, axis=1)

- `apply()` (the way we're using it) takes a function and runs it against each row of a DataFrame, returning the results as a Series
- `axis=1` specifies that you want to apply the function across the rows instead of columns
- `cb_counts['borocd'] = …` creates a new column in the DataFrame called `borocd`

In [18]:
cb_counts

Unnamed: 0,Community Board,num_311_requests,boro_cd
50,12 MANHATTAN,14110,112
23,05 QUEENS,12487,405
51,12 QUEENS,12228,412
2,01 BROOKLYN,11863,301
12,03 BROOKLYN,11615,303
5,01 STATEN ISLAND,11438,501
31,07 QUEENS,11210,407
21,05 BROOKLYN,10862,305
16,04 BRONX,10628,204
4,01 QUEENS,10410,401


Uh oh, there are some unexpected `Unspecified` values in here - how can we get around them?

Let's only recode records that don't start with "U".

In [19]:
def recode_borocd_counts(row):
    if "MANHATTAN" in row["Community Board"] and row["Community Board"][0] != "U":
        return "1" + row["Community Board"][0:2]
    elif "BRONX" in row["Community Board"] and row["Community Board"][0] != "U":
        return "2" + row["Community Board"][0:2]
    elif "BROOKLYN" in row["Community Board"] and row["Community Board"][0] != "U":
        return "3" + row["Community Board"][0:2]
    elif "QUEENS" in row["Community Board"] and row["Community Board"][0] != "U":
        return "4" + row["Community Board"][0:2]
    elif "STATEN ISLAND" in row["Community Board"] and row["Community Board"][0] != "U":
        return "5" + row["Community Board"][0:2]
    else:
        return "Invalid BoroCD"


cb_counts["boro_cd"] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

Unnamed: 0,Community Board,num_311_requests,boro_cd
50,12 MANHATTAN,14110,112
23,05 QUEENS,12487,405
51,12 QUEENS,12228,412
2,01 BROOKLYN,11863,301
12,03 BROOKLYN,11615,303
5,01 STATEN ISLAND,11438,501
31,07 QUEENS,11210,407
21,05 BROOKLYN,10862,305
16,04 BRONX,10628,204
4,01 QUEENS,10410,401


We can make this function easier to read by isolating the logic that applies to all the conditions. This is called "refactoring".

In [20]:
def recode_borocd_counts(row):
    board = row["Community Board"]

    # doing a check and then returning from a function early is known as a "guard clause"
    if board.startswith("U"):
        return "Invalid BoroCD"

    num = board[0:2]

    if "MANHATTAN" in board:
        return "1" + num
    elif "BRONX" in board:
        return "2" + num
    elif "BROOKLYN" in board:
        return "3" + num
    elif "QUEENS" in board:
        return "4" + num
    elif "STATEN ISLAND" in board:
        return "5" + num
    else:
        return "Invalid BoroCD"

In [21]:
cb_counts["boro_cd"] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts

Unnamed: 0,Community Board,num_311_requests,boro_cd
50,12 MANHATTAN,14110,112
23,05 QUEENS,12487,405
51,12 QUEENS,12228,412
2,01 BROOKLYN,11863,301
12,03 BROOKLYN,11615,303
5,01 STATEN ISLAND,11438,501
31,07 QUEENS,11210,407
21,05 BROOKLYN,10862,305
16,04 BRONX,10628,204
4,01 QUEENS,10410,401


## Next, let's create the `borocd` column in the population dataset

In [22]:
population.head()

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200


In [23]:
population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Borough          59 non-null     object
 1   CD Number        59 non-null     int64 
 2   CD Name          59 non-null     object
 3   1970 Population  59 non-null     int64 
 4   1980 Population  59 non-null     int64 
 5   1990 Population  59 non-null     int64 
 6   2000 Population  59 non-null     int64 
 7   2010 Population  59 non-null     int64 
dtypes: int64(6), object(2)
memory usage: 3.8+ KB


Create a function `recode_borocd_pop` that combines and recodes the Borough and CD Number values to create a BoroCD unique ID.

In [24]:
def recode_borocd_pop(row):
    if row.Borough == "Manhattan":
        return str(100 + row["CD Number"])
    elif row.Borough == "Bronx":
        return str(200 + row["CD Number"])
    elif row.Borough == "Brooklyn":
        return str(300 + row["CD Number"])
    elif row.Borough == "Queens":
        return str(400 + row["CD Number"])
    elif row.Borough == "Staten Island":
        return str(500 + row["CD Number"])
    else:
        return "Invalid BoroCD"

This is different than `recode_borocd_counts()` because:

- The `Borough` and `CD Number` are seprate columns in the `population` DataFrame, rather than combined in one like the 311 data
- We are working with the `CD Number` as an integer rather than a string

In [25]:
population["borocd"] = population.apply(recode_borocd_pop, axis=1)
population

Unnamed: 0,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population,borocd
0,Bronx,1,"Melrose, Mott Haven, Port Morris",138557,78441,77214,82159,91497,201
1,Bronx,2,"Hunts Point, Longwood",99493,34399,39443,46824,52246,202
2,Bronx,3,"Morrisania, Crotona Park East",150636,53635,57162,68574,79762,203
3,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441,204
4,Bronx,5,"University Hts., Fordham, Mt. Hope",121807,107995,118435,128313,128200,205
5,Bronx,6,"East Tremont, Belmont",114137,65016,68061,75688,83268,206
6,Bronx,7,"Bedford Park, Norwood, Fordham",113764,116827,128588,141411,139286,207
7,Bronx,8,"Riverdale, Kingsbridge, Marble Hill",103543,98275,97030,101332,101731,208
8,Bronx,9,"Soundview, Parkchester",166442,167627,155970,167859,172298,209
9,Bronx,10,"Throgs Nk., Co-op City, Pelham Bay",84948,106516,108093,115948,120392,210


## Join the population data onto the counts data after creating shared `borocd` unique ID

To join dataframes together, we will use the [pandas `.merge()` function](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html#join-tables-using-a-common-identifier).

![merge diagram](https://pandas.pydata.org/pandas-docs/stable/_images/08_merge_left.svg)

In [26]:
merged_data = pd.merge(left=cb_counts, right=population, left_on="boro_cd", right_on="borocd")
merged_data

Unnamed: 0,Community Board,num_311_requests,boro_cd,Borough,CD Number,CD Name,1970 Population,1980 Population,1990 Population,2000 Population,2010 Population,borocd
0,12 MANHATTAN,14110,112,Manhattan,12,"Washington Heights, Inwood",180561,179941,198192,208414,190020,112
1,05 QUEENS,12487,405,Queens,5,"Ridgewood, Glendale, Maspeth",161022,150142,149126,165911,169190,405
2,12 QUEENS,12228,412,Queens,12,"Jamaica, St. Albans, Hollis",206639,189383,201293,223602,225919,412
3,01 BROOKLYN,11863,301,Brooklyn,1,"Williamsburg, Greenpoint",179390,142942,155972,160338,173083,301
4,03 BROOKLYN,11615,303,Brooklyn,3,Bedford Stuyvesant,203380,133379,138696,143867,152985,303
5,01 STATEN ISLAND,11438,501,Staten Island,1,"Stapleton, Port Richmond",135875,138489,137806,162609,175756,501
6,07 QUEENS,11210,407,Queens,7,"Flushing, Bay Terrace",207589,204785,220508,242952,247354,407
7,05 BROOKLYN,10862,305,Brooklyn,5,"East New York, Starrett City",170791,154931,161350,173198,182896,305
8,04 BRONX,10628,204,Bronx,4,"Highbridge, Concourse Village",144207,114312,119962,139563,146441,204
9,01 QUEENS,10410,401,Queens,1,"Astoria, Long Island City",185925,185198,188549,211220,191105,401


[Different types of merges](https://pandas.pydata.org/docs/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra)

In [27]:
# remove the redundant column
merged_data = merged_data.drop("borocd", axis="columns")

# save the data to a file
# merged_data.to_csv("data/community_district_311.csv", index=False)

## [Homework 2](https://python-public-policy.afeld.me/en/nyu/hw_2.html)