Class 2: Manipulating and combining data#

Feeling overwhlemed?#

Reminder that learning to code is like learning a spoken language. It’s not obvious, and people will pick it up at different speeds at different spots. Try:

Small example of comment-driven development:

# find valid ZIP codes
# filter the DataFrame to only invalid ZIP codes

Data cleaning#

Data Cleansing is a process of removing or fixing incorrect, malformed, incomplete, duplicate, or corrupted data

https://hevodata.com/learn/data-cleansing-a-simplified-guide/

Things to check for#

From my workshop on data cleaning:

  • Missing data

    • Empty values

  • Bad (junk) values

    • Duplicates

    • Mismatched types/formatting

  • Categorical data

    • Unique values (cardinality)

    • Value counts

  • Continuous values

    • Ranges

    • Spread (distribution)

Notes:

  • “Values” in this case can be a single cell (in the spreadsheet sense) or a whole row

  • “Missing” or “duplicates” can be columns (Series), tables (DataFrames), rows, or cells

  • “Categorical data” have a fixed set of values

  • This isn’t everything you can check for, but should cover most things

Data cleaning mnemonic#

  • Empty

  • Bad

  • Unique

  • Spread

Today’s goal: Which Community Districts have the most 311 requests? Why might that be?#

What’s a Community District?#

  • 59 local governance districts each run by an appointed Community Board

  • Community boards advise on land use and zoning, participate in the city budget process, and address service delivery in their district.

  • Community boards are each composed of up to 50 volunteer members appointed by the local borough president, half from nominations by the local City Council members.

Map of community districts from Wikipedia

Setup#

import pandas as pd
# Display more rows and columns in the DataFrames
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

Read our cleaned 311 Service Requests dataset#

url = "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
requests = pd.read_csv(url)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_27079/610958175.py:2: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(url)

Dealing with dtypes#

More data cleaning!

Minion character vacuuming

DtypeWarning: Columns (8,20,31,34) have mixed types.
requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499958 entries, 0 to 499957
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      499958 non-null  int64  
 1   Created Date                    499958 non-null  object 
 2   Closed Date                     476140 non-null  object 
 3   Agency                          499958 non-null  object 
 4   Agency Name                     499958 non-null  object 
 5   Complaint Type                  499958 non-null  object 
 6   Descriptor                      492496 non-null  object 
 7   Location Type                   392573 non-null  object 
 8   Incident Zip                    480394 non-null  object 
 9   Incident Address                434529 non-null  object 
 10  Street Name                     434504 non-null  object 
 11  Cross Street 1                  300825 non-null  object 
 12  Cross Street 2                  299624 non-null  object 
 13  Intersection Street 1           107377 non-null  object 
 14  Intersection Street 2           107042 non-null  object 
 15  Address Type                    451006 non-null  object 
 16  City                            476632 non-null  object 
 17  Landmark                        32516 non-null   object 
 18  Facility Type                   134918 non-null  object 
 19  Status                          499958 non-null  object 
 20  Due Date                        171534 non-null  object 
 21  Resolution Description          457354 non-null  object 
 22  Resolution Action Updated Date  488788 non-null  object 
 23  Community Board                 499958 non-null  object 
 24  BBL                             407338 non-null  float64
 25  Borough                         499958 non-null  object 
 26  X Coordinate (State Plane)      470815 non-null  float64
 27  Y Coordinate (State Plane)      470815 non-null  float64
 28  Open Data Channel Type          499958 non-null  object 
 29  Park Facility Name              499931 non-null  object 
 30  Park Borough                    499958 non-null  object 
 31  Vehicle Type                    37 non-null      object 
 32  Taxi Company Borough            403 non-null     object 
 33  Taxi Pick Up Location           4474 non-null    object 
 34  Bridge Highway Name             696 non-null     object 
 35  Bridge Highway Direction        765 non-null     object 
 36  Road Ramp                       759 non-null     object 
 37  Bridge Highway Segment          1007 non-null    object 
 38  Latitude                        470815 non-null  float64
 39  Longitude                       470815 non-null  float64
 40  Location                        470815 non-null  object 
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB
list(requests["Incident Zip"].unique())
['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 nan,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10452',
 '11377',
 '10029',
 '10003',
 '10075',
 '11222',
 '10128',
 '11415',
 '11204',
 '10030',
 '11432',
 '10308',
 '11102',
 '10016',
 '10463',
 '11412',
 '10011',
 '11106',
 '10457',
 '11375',
 '11356',
 '11228',
 '11369',
 '10458',
 '11411',
 '11215',
 '10309',
 '11358',
 '11355',
 '11219',
 '10031',
 '10303',
 '11232',
 '10302',
 '11238',
 '10005',
 '11429',
 '11205',
 '10023',
 '11373',
 '10033',
 '10039',
 '10028',
 '11363',
 '11354',
 '11692',
 '10455',
 '11370',
 '10035',
 '10012',
 '10024',
 '10009',
 '10034',
 '11001',
 '10466',
 '11427',
 '11004',
 '10454',
 '11414',
 '11360',
 '10461',
 '10018',
 '10006',
 '11236',
 '11428',
 '10474',
 '10471',
 '10037',
 '10475',
 '11430',
 '10119',
 '10038',
 '11426',
 '11239',
 '10151',
 '10120',
 '10112',
 '10168',
 '10004',
 '10464',
 '18773-9640',
 '10282',
 '11109',
 '10280',
 '07114',
 '11040',
 '10000',
 '07090',
 '10020',
 '00083',
 '10044',
 '10069',
 '10118',
 '07102',
 '11359',
 '11005',
 '10169',
 '11697',
 '10115',
 11226.0,
 11236.0,
 10466.0,
 11210.0,
 10452.0,
 10457.0,
 10304.0,
 11416.0,
 11205.0,
 11214.0,
 11692.0,
 11101.0,
 11201.0,
 10470.0,
 11375.0,
 10305.0,
 11420.0,
 11421.0,
 11419.0,
 11412.0,
 10309.0,
 11385.0,
 11379.0,
 11356.0,
 10021.0,
 11417.0,
 11377.0,
 11423.0,
 10001.0,
 10469.0,
 10463.0,
 11372.0,
 11422.0,
 10017.0,
 11368.0,
 11213.0,
 11220.0,
 10312.0,
 10025.0,
 11357.0,
 11230.0,
 10456.0,
 11104.0,
 11232.0,
 11208.0,
 11206.0,
 11229.0,
 10465.0,
 10472.0,
 10039.0,
 10016.0,
 10009.0,
 11355.0,
 11211.0,
 10040.0,
 11366.0,
 11234.0,
 10013.0,
 11207.0,
 10032.0,
 10036.0,
 10029.0,
 11204.0,
 11435.0,
 11373.0,
 10007.0,
 11216.0,
 10306.0,
 11378.0,
 11414.0,
 10314.0,
 10128.0,
 10027.0,
 11221.0,
 10024.0,
 11364.0,
 10468.0,
 11233.0,
 10014.0,
 11238.0,
 11374.0,
 11105.0,
 11219.0,
 11106.0,
 10003.0,
 10026.0,
 11411.0,
 11223.0,
 11693.0,
 11365.0,
 11209.0,
 11103.0,
 11361.0,
 11427.0,
 10467.0,
 11203.0,
 11102.0,
 10455.0,
 10454.0,
 11225.0,
 11212.0,
 11434.0,
 10461.0,
 11369.0,
 11432.0,
 10031.0,
 11217.0,
 11235.0,
 10458.0,
 10453.0,
 11215.0,
 11367.0,
 11231.0,
 11413.0,
 11358.0,
 11694.0,
 10462.0,
 11218.0,
 11430.0,
 10473.0,
 10011.0,
 10460.0,
 10019.0,
 10310.0,
 10037.0,
 10451.0,
 10301.0,
 10010.0,
 11354.0,
 11222.0,
 11691.0,
 10034.0,
 10308.0,
 10012.0,
 11237.0,
 10801.0,
 10030.0,
 11428.0,
 11418.0,
 10033.0,
 10459.0,
 11429.0,
 10065.0,
 10022.0,
 10005.0,
 10002.0,
 10035.0,
 11370.0,
 10004.0,
 11415.0,
 11426.0,
 11362.0,
 10006.0,
 11360.0,
 10038.0,
 11249.0,
 11001.0,
 11224.0,
 10075.0,
 10018.0,
 10302.0,
 11436.0,
 11433.0,
 10474.0,
 10023.0,
 11004.0,
 10028.0,
 11228.0,
 10475.0,
 10069.0,
 11363.0,
 10303.0,
 10282.0,
 10307.0,
 10103.0,
 10280.0,
 10471.0,
 10271.0,
 10000.0,
 10464.0,
 10107.0,
 7078.0,
 10044.0,
 10020.0,
 56303.0,
 10119.0,
 11040.0,
 11386.0,
 10952.0,
 6811.0,
 11239.0,
 10121.0,
 37214.0,
 10538.0,
 10112.0,
 10279.0,
 11109.0,
 11371.0,
 18017.0,
 7115.0,
 77036.0,
 7114.0,
 '10123',
 '11030',
 '10801',
 '92626-1902',
 '07302',
 '18773',
 '12222',
 '10710',
 '10103',
 '07057',
 '10162',
 '11582',
 '10281',
 '10271',
 '10107',
 'HARRISBURG',
 '11735',
 '07305',
 'N5X3A6',
 '11746',
 '11371',
 '23450',
 11580.0,
 10591.0,
 11005.0,
 19034.0,
 11596.0,
 11779.0,
 7621.0,
 11021.0,
 11241.0,
 10169.0,
 100000.0,
 11242.0,
 10151.0,
 11697.0,
 7086.0,
 10177.0,
 10118.0,
 10105.0,
 10152.0,
 10168.0,
 7093.0,
 10917.0,
 10110.0,
 10153.0,
 10178.0,
 11570.0,
 10601.0,
 10704.0,
 7424.0,
 10281.0,
 10158.0,
 '10158',
 '10172',
 '10179',
 'IDK',
 '11801',
 '10601',
 '11590',
 '10155',
 '11202',
 '1801',
 '11581',
 11359.0,
 11758.0,
 10278.0,
 43017.0,
 10154.0,
 11553.0,
 10162.0,
 11695.0,
 10041.0,
 11741.0,
 98335.0,
 14814.0,
 10111.0,
 '12345',
 '11572',
 '11520',
 '14614-195',
 '10121',
 '10105',
 '10701',
 0.0,
 17106.0,
 979113.0,
 10120.0,
 12345.0,
 11030.0,
 11797.0,
 100.0,
 11710.0,
 33624.0,
 8682.0,
 11747.0,
 1757.0,
 11561.0,
 7304.0,
 6851.0,
 11590.0,
 94267.0,
 10167.0,
 11749.0,
 11756.0,
 10174.0,
 10550.0,
 89119.0,
 14068.0,
 11722.0,
 11520.0,
 6460.0,
 32255.0,
 10173.0,
 10165.0,
 11946.0,
 '29616-0759',
 '07032',
 '10278',
 '11575',
 11963.0,
 10106.0,
 83.0,
 11566.0,
 6870.0,
 7001.0,
 10710.0,
 11735.0,
 11572.0,
 '10165',
 '10279',
 '11251',
 'NJ 07114',
 '10106',
 '07666',
 '11516',
 '10177',
 '10170',
 '43215-1441',
 '00000',
 '08081',
 '10803',
 11507.0,
 11701.0,
 11563.0,
 7047.0,
 3108.0,
 '11021',
 '07003',
 '10152',
 '07029',
 '10041',
 '31093',
 '11735-0230',
 89118.0,
 11803.0,
 11559.0,
 11565.0,
 7080.0,
 12601.0,
 10155.0,
 10171.0,
 7208.0,
 11757.0,
 11042.0,
 '1101',
 '10111',
 '10173',
 '10096',
 '07087',
 'DID N',
 '10956',
 10048.0,
 10123.0,
 10122.0,
 11251.0]

ZIP codes look numeric, but aren’t really.

Read the ZIP codes in as strings.

requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_27079/1354368569.py:1: DtypeWarning: Columns (20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})

We fixed the dtype warning for column 8 (Incident Zip).

list(requests2["Incident Zip"].unique())
['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 <NA>,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10452',
 '11377',
 '10029',
 '10003',
 '10075',
 '11222',
 '10128',
 '11415',
 '11204',
 '10030',
 '11432',
 '10308',
 '11102',
 '10016',
 '10463',
 '11412',
 '10011',
 '11106',
 '10457',
 '11375',
 '11356',
 '11228',
 '11369',
 '10458',
 '11411',
 '11215',
 '10309',
 '11358',
 '11355',
 '11219',
 '10031',
 '10303',
 '11232',
 '10302',
 '11238',
 '10005',
 '11429',
 '11205',
 '10023',
 '11373',
 '10033',
 '10039',
 '10028',
 '11363',
 '11354',
 '11692',
 '10455',
 '11370',
 '10035',
 '10012',
 '10024',
 '10009',
 '10034',
 '11001',
 '10466',
 '11427',
 '11004',
 '10454',
 '11414',
 '11360',
 '10461',
 '10018',
 '10006',
 '11236',
 '11428',
 '10474',
 '10471',
 '10037',
 '10475',
 '11430',
 '10119',
 '10038',
 '11426',
 '11239',
 '10151',
 '10120',
 '10112',
 '10168',
 '10004',
 '10464',
 '18773-9640',
 '10282',
 '11109',
 '10280',
 '07114',
 '11040',
 '10000',
 '07090',
 '10020',
 '00083',
 '10044',
 '10069',
 '10118',
 '07102',
 '11359',
 '11005',
 '10169',
 '11697',
 '10115',
 '10801',
 '10103',
 '10271',
 '10107',
 '07078',
 '56303',
 '11386',
 '10952',
 '06811',
 '10121',
 '37214',
 '10538',
 '10279',
 '11371',
 '18017',
 '07115',
 '77036',
 '10123',
 '11030',
 '92626-1902',
 '07302',
 '18773',
 '12222',
 '10710',
 '07057',
 '10162',
 '11582',
 '10281',
 'HARRISBURG',
 '11735',
 '07305',
 'N5X3A6',
 '11746',
 '23450',
 '11580',
 '10591',
 '19034',
 '11596',
 '11779',
 '07621',
 '11021',
 '11241',
 '100000',
 '11242',
 '07086',
 '10177',
 '10105',
 '10152',
 '07093',
 '10917',
 '10110',
 '10153',
 '10178',
 '11570',
 '10601',
 '10704',
 '07424',
 '10158',
 '10172',
 '10179',
 'IDK',
 '11801',
 '11590',
 '10155',
 '11202',
 '1801',
 '11581',
 '11758',
 '10278',
 '43017',
 '10154',
 '11553',
 '11695',
 '10041',
 '11741',
 '98335',
 '14814',
 '10111',
 '12345',
 '11572',
 '11520',
 '14614-195',
 '10701',
 '00000',
 '17106',
 '979113',
 '11797',
 '100',
 '11710',
 '33624',
 '8682',
 '11747',
 '01757',
 '11561',
 '07304',
 '000000',
 '06851',
 '94267',
 '10167',
 '11749',
 '11756',
 '10174',
 '10550',
 '89119',
 '14068',
 '11722',
 '06460',
 '32255',
 '10173',
 '10165',
 '11946',
 '29616-0759',
 '07032',
 '11575',
 '11963',
 '10106',
 '11566',
 '06870',
 '07001',
 '11251',
 'NJ 07114',
 '07666',
 '11516',
 '10170',
 '43215-1441',
 '08081',
 '10803',
 '11507',
 '11701',
 '11563',
 '07047',
 '03108',
 '07003',
 '07029',
 '31093',
 '11735-0230',
 '89118',
 '11803',
 '11559',
 '11565',
 '07080',
 '12601',
 '10171',
 '07208',
 '11757',
 '11042',
 '1101',
 '10096',
 '07087',
 'DID N',
 '10956',
 '10048',
 '10122']

Find invalid ZIP codes#

Use a regular expression (regex) to find strings that match a pattern:

^\d{5}(?:-\d{4})?$
│ │ │  │        │└─ end of string
│ │ │  │        └─ optional
│ │ │  └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string

regex101 is useful for testing them.

# find valid ZIP codes
valid_zips = requests2["Incident Zip"].str.contains(r"^\d{5}(?:-\d{4})?$")

# filter the DataFrame to only invalid ZIP codes
invalid_zips = valid_zips == False
requests_with_invalid_zips = requests2[invalid_zips]
requests_with_invalid_zips["Incident Zip"]
55017     HARRISBURG
58100         N5X3A6
80798         100000
120304           IDK
123304          1801
173518     14614-195
192034        979113
201463           100
207158          8682
216745        000000
325071      NJ 07114
425985          1101
441166         DID N
Name: Incident Zip, dtype: string

Clear any invalid ZIP codes:

requests2.loc[invalid_zips, "Incident Zip"] = None

Additonal data cleaning tips:

  • Hard part is finding what needs to be done

  • Will be specific to your use case

  • Document what you did, since it will affect your results

In-class exercise#

Skipping to save time. While not graded, I encourage you to do it on your own, as it will help prepare you for the Final Project.

View the contents of the community_board column in our 311 data#

requests["Community Board"].unique()
array(['15 BROOKLYN', '03 BROOKLYN', '14 QUEENS', '10 BRONX', '08 QUEENS',
       '02 BRONX', '01 QUEENS', '11 QUEENS', '02 MANHATTAN',
       '18 BROOKLYN', '12 QUEENS', '01 STATEN ISLAND', '12 BRONX',
       '05 BROOKLYN', '01 BRONX', '09 QUEENS', '04 BROOKLYN',
       '10 BROOKLYN', '02 STATEN ISLAND', '05 QUEENS', '04 MANHATTAN',
       '11 BRONX', 'Unspecified BROOKLYN', '09 BRONX', '12 MANHATTAN',
       '09 BROOKLYN', '14 BROOKLYN', '06 MANHATTAN', '10 MANHATTAN',
       'Unspecified QUEENS', '01 MANHATTAN', '03 MANHATTAN', '05 BRONX',
       '08 BROOKLYN', '02 QUEENS', '12 BROOKLYN', '01 BROOKLYN',
       '16 BROOKLYN', '13 BROOKLYN', '06 QUEENS', '07 MANHATTAN',
       '11 BROOKLYN', 'Unspecified BRONX', '08 MANHATTAN',
       '03 STATEN ISLAND', '06 BROOKLYN', '03 BRONX', '05 MANHATTAN',
       '07 QUEENS', '13 QUEENS', '17 BROOKLYN', '06 BRONX', '02 BROOKLYN',
       '10 QUEENS', 'Unspecified MANHATTAN', '03 QUEENS', '04 BRONX',
       '11 MANHATTAN', '08 BRONX', '07 BROOKLYN', '07 BRONX',
       '0 Unspecified', '09 MANHATTAN', '04 QUEENS', '83 QUEENS',
       '80 QUEENS', 'Unspecified STATEN ISLAND', '55 BROOKLYN',
       '82 QUEENS', '64 MANHATTAN', '28 BRONX', '95 STATEN ISLAND',
       '81 QUEENS', '27 BRONX', '26 BRONX', '56 BROOKLYN', '84 QUEENS'],
      dtype=object)

Get the count of 311 requests per Community District#

cb_counts = requests.groupby("Community Board").size().reset_index(name="num_311_requests")
cb_counts = cb_counts.sort_values("num_311_requests", ascending=False)
cb_counts
Community Board num_311_requests
50 12 MANHATTAN 14110
23 05 QUEENS 12487
51 12 QUEENS 12228
2 01 BROOKLYN 11863
12 03 BROOKLYN 11615
5 01 STATEN ISLAND 11438
31 07 QUEENS 11210
21 05 BROOKLYN 10862
16 04 BRONX 10628
4 01 QUEENS 10410
58 17 BROOKLYN 10208
54 14 BROOKLYN 10179
28 07 BRONX 9841
59 18 BROOKLYN 9717
39 09 QUEENS 9599
53 13 QUEENS 9459
48 12 BRONX 9412
45 11 BROOKLYN 9309
20 05 BRONX 9094
13 03 MANHATTAN 8905
35 08 QUEENS 8661
17 04 BROOKLYN 8639
42 10 MANHATTAN 8592
15 03 STATEN ISLAND 8524
56 15 BROOKLYN 8508
43 10 QUEENS 8333
49 12 BROOKLYN 8214
30 07 MANHATTAN 8141
36 09 BRONX 8092
41 10 BROOKLYN 7808
33 08 BROOKLYN 7797
7 02 BROOKLYN 7747
37 09 BROOKLYN 7571
25 06 BROOKLYN 7373
10 02 STATEN ISLAND 7059
0 0 Unspecified 6882
14 03 QUEENS 6799
38 09 MANHATTAN 6792
73 Unspecified BROOKLYN 6771
18 04 MANHATTAN 6765
22 05 MANHATTAN 6599
34 08 MANHATTAN 6579
8 02 MANHATTAN 6514
44 11 BRONX 6448
74 Unspecified MANHATTAN 6427
29 07 BROOKLYN 6364
9 02 QUEENS 6142
19 04 QUEENS 5824
32 08 BRONX 5753
47 11 QUEENS 5748
24 06 BRONX 5746
40 10 BRONX 5719
46 11 MANHATTAN 5714
72 Unspecified BRONX 5349
55 14 QUEENS 5284
11 03 BRONX 5262
26 06 MANHATTAN 5158
57 16 BROOKLYN 5100
1 01 BRONX 4915
75 Unspecified QUEENS 4870
27 06 QUEENS 4607
52 13 BROOKLYN 3840
3 01 MANHATTAN 3623
6 02 BRONX 3470
76 Unspecified STATEN ISLAND 541
69 83 QUEENS 193
65 64 MANHATTAN 126
66 80 QUEENS 102
68 82 QUEENS 72
67 81 QUEENS 54
63 55 BROOKLYN 53
62 28 BRONX 26
60 26 BRONX 25
71 95 STATEN ISLAND 21
70 84 QUEENS 18
61 27 BRONX 17
64 56 BROOKLYN 13

Research Question: What may account for the variance in count of requests per community district?#

Hypothesis: Population size may help explain the variance.#

We can combine the counts per community district dataset with population data for each community district.

We’ll use pandas’ .merge(), comparable to:

In general, called “record linkage” or “entity resolution”.

Let’s load the population dataset and check out its contents#

Data source for population by Community District

population = pd.read_csv("https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv")
population.head()
Borough CD Number CD Name 1970 Population 1980 Population 1990 Population 2000 Population 2010 Population
0 Bronx 1 Melrose, Mott Haven, Port Morris 138557 78441 77214 82159 91497
1 Bronx 2 Hunts Point, Longwood 99493 34399 39443 46824 52246
2 Bronx 3 Morrisania, Crotona Park East 150636 53635 57162 68574 79762
3 Bronx 4 Highbridge, Concourse Village 144207 114312 119962 139563 146441
4 Bronx 5 University Hts., Fordham, Mt. Hope 121807 107995 118435 128313 128200

In order to join the two dataframes, we need to create a common ID in each.#

BORO CODE (a.k.a. BoroCode, borocd, and boro_cd) is a commonly-used a unique ID for community districts. Let’s create functions that create that unique ID in our datasets.

BoroCD is a 3 digit integer that captures the borough and district number. The borough is represented by the first digit. The district number is padded with zeros so it’s always two digits long.

Boroughs are recoded into the following numbers:

  • 1: Manhattan

  • 2: Bronx

  • 3: Brooklyn

  • 4: Queens

  • 5: Staten Island

Ex:

  • Manhattan 12 –> 112

  • Brooklyn 6 –> 306

First, let’s create a borocd column in cb_counts dataframe#

cb_counts.head()
Community Board num_311_requests
50 12 MANHATTAN 14110
23 05 QUEENS 12487
51 12 QUEENS 12228
2 01 BROOKLYN 11863
12 03 BROOKLYN 11615

apply() can be used for transforming data with a custom function. How does it work?

def my_function(row):
    # do stuff
    return some_value

new_values = dataframe.apply(my_function, axis=1)

While pandas generally operates on an entire column at once, apply() is similar to working with CSVs in pure Python in that you are operating row by row.

Let’s create a function called recode_borocd_counts that takes a row and converts the Community Board value into a borocd value.

def recode_borocd_counts(row):
    if "MANHATTAN" in row["Community Board"]:
        return "1" + row["Community Board"][0:2]
        # [0:2] provides the first 2 characters, i.e. characters at indexes 0 and 1.
        # you could also use [:2] without the zero.
    elif "BRONX" in row["Community Board"]:
        return "2" + row["Community Board"][0:2]
    elif "BROOKLYN" in row["Community Board"]:
        return "3" + row["Community Board"][0:2]
    elif "QUEENS" in row["Community Board"]:
        return "4" + row["Community Board"][0:2]
    elif "STATEN ISLAND" in row["Community Board"]:
        return "5" + row["Community Board"][0:2]
    else:
        return "Invalid BoroCD"

Let’s test out that function in isolation. We’ll grab one of the rows and pass it into the function.

sample_row = cb_counts.iloc[0]
sample_row
Community Board     12 MANHATTAN
num_311_requests           14110
Name: 50, dtype: object
recode_borocd_counts(sample_row)
'112'

Now we use apply() to do that across all the rows.

cb_counts["boro_cd"] = cb_counts.apply(recode_borocd_counts, axis=1)
  • apply() (the way we’re using it) takes a function and runs it against each row of a DataFrame, returning the results as a Series

  • axis=1 specifies that you want to apply the function across the rows instead of columns

  • cb_counts['borocd'] = creates a new column in the DataFrame called borocd

cb_counts
Community Board num_311_requests boro_cd
50 12 MANHATTAN 14110 112
23 05 QUEENS 12487 405
51 12 QUEENS 12228 412
2 01 BROOKLYN 11863 301
12 03 BROOKLYN 11615 303
5 01 STATEN ISLAND 11438 501
31 07 QUEENS 11210 407
21 05 BROOKLYN 10862 305
16 04 BRONX 10628 204
4 01 QUEENS 10410 401
58 17 BROOKLYN 10208 317
54 14 BROOKLYN 10179 314
28 07 BRONX 9841 207
59 18 BROOKLYN 9717 318
39 09 QUEENS 9599 409
53 13 QUEENS 9459 413
48 12 BRONX 9412 212
45 11 BROOKLYN 9309 311
20 05 BRONX 9094 205
13 03 MANHATTAN 8905 103
35 08 QUEENS 8661 408
17 04 BROOKLYN 8639 304
42 10 MANHATTAN 8592 110
15 03 STATEN ISLAND 8524 503
56 15 BROOKLYN 8508 315
43 10 QUEENS 8333 410
49 12 BROOKLYN 8214 312
30 07 MANHATTAN 8141 107
36 09 BRONX 8092 209
41 10 BROOKLYN 7808 310
33 08 BROOKLYN 7797 308
7 02 BROOKLYN 7747 302
37 09 BROOKLYN 7571 309
25 06 BROOKLYN 7373 306
10 02 STATEN ISLAND 7059 502
0 0 Unspecified 6882 Invalid BoroCD
14 03 QUEENS 6799 403
38 09 MANHATTAN 6792 109
73 Unspecified BROOKLYN 6771 3Un
18 04 MANHATTAN 6765 104
22 05 MANHATTAN 6599 105
34 08 MANHATTAN 6579 108
8 02 MANHATTAN 6514 102
44 11 BRONX 6448 211
74 Unspecified MANHATTAN 6427 1Un
29 07 BROOKLYN 6364 307
9 02 QUEENS 6142 402
19 04 QUEENS 5824 404
32 08 BRONX 5753 208
47 11 QUEENS 5748 411
24 06 BRONX 5746 206
40 10 BRONX 5719 210
46 11 MANHATTAN 5714 111
72 Unspecified BRONX 5349 2Un
55 14 QUEENS 5284 414
11 03 BRONX 5262 203
26 06 MANHATTAN 5158 106
57 16 BROOKLYN 5100 316
1 01 BRONX 4915 201
75 Unspecified QUEENS 4870 4Un
27 06 QUEENS 4607 406
52 13 BROOKLYN 3840 313
3 01 MANHATTAN 3623 101
6 02 BRONX 3470 202
76 Unspecified STATEN ISLAND 541 5Un
69 83 QUEENS 193 483
65 64 MANHATTAN 126 164
66 80 QUEENS 102 480
68 82 QUEENS 72 482
67 81 QUEENS 54 481
63 55 BROOKLYN 53 355
62 28 BRONX 26 228
60 26 BRONX 25 226
71 95 STATEN ISLAND 21 595
70 84 QUEENS 18 484
61 27 BRONX 17 227
64 56 BROOKLYN 13 356

Uh oh, there are some unexpected Unspecified values in here - how can we get around them?

Let’s only recode records that don’t start with “U”.

def recode_borocd_counts(row):
    if "MANHATTAN" in row["Community Board"] and row["Community Board"][0] != "U":
        return "1" + row["Community Board"][0:2]
    elif "BRONX" in row["Community Board"] and row["Community Board"][0] != "U":
        return "2" + row["Community Board"][0:2]
    elif "BROOKLYN" in row["Community Board"] and row["Community Board"][0] != "U":
        return "3" + row["Community Board"][0:2]
    elif "QUEENS" in row["Community Board"] and row["Community Board"][0] != "U":
        return "4" + row["Community Board"][0:2]
    elif "STATEN ISLAND" in row["Community Board"] and row["Community Board"][0] != "U":
        return "5" + row["Community Board"][0:2]
    else:
        return "Invalid BoroCD"


cb_counts["boro_cd"] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts
Community Board num_311_requests boro_cd
50 12 MANHATTAN 14110 112
23 05 QUEENS 12487 405
51 12 QUEENS 12228 412
2 01 BROOKLYN 11863 301
12 03 BROOKLYN 11615 303
5 01 STATEN ISLAND 11438 501
31 07 QUEENS 11210 407
21 05 BROOKLYN 10862 305
16 04 BRONX 10628 204
4 01 QUEENS 10410 401
58 17 BROOKLYN 10208 317
54 14 BROOKLYN 10179 314
28 07 BRONX 9841 207
59 18 BROOKLYN 9717 318
39 09 QUEENS 9599 409
53 13 QUEENS 9459 413
48 12 BRONX 9412 212
45 11 BROOKLYN 9309 311
20 05 BRONX 9094 205
13 03 MANHATTAN 8905 103
35 08 QUEENS 8661 408
17 04 BROOKLYN 8639 304
42 10 MANHATTAN 8592 110
15 03 STATEN ISLAND 8524 503
56 15 BROOKLYN 8508 315
43 10 QUEENS 8333 410
49 12 BROOKLYN 8214 312
30 07 MANHATTAN 8141 107
36 09 BRONX 8092 209
41 10 BROOKLYN 7808 310
33 08 BROOKLYN 7797 308
7 02 BROOKLYN 7747 302
37 09 BROOKLYN 7571 309
25 06 BROOKLYN 7373 306
10 02 STATEN ISLAND 7059 502
0 0 Unspecified 6882 Invalid BoroCD
14 03 QUEENS 6799 403
38 09 MANHATTAN 6792 109
73 Unspecified BROOKLYN 6771 Invalid BoroCD
18 04 MANHATTAN 6765 104
22 05 MANHATTAN 6599 105
34 08 MANHATTAN 6579 108
8 02 MANHATTAN 6514 102
44 11 BRONX 6448 211
74 Unspecified MANHATTAN 6427 Invalid BoroCD
29 07 BROOKLYN 6364 307
9 02 QUEENS 6142 402
19 04 QUEENS 5824 404
32 08 BRONX 5753 208
47 11 QUEENS 5748 411
24 06 BRONX 5746 206
40 10 BRONX 5719 210
46 11 MANHATTAN 5714 111
72 Unspecified BRONX 5349 Invalid BoroCD
55 14 QUEENS 5284 414
11 03 BRONX 5262 203
26 06 MANHATTAN 5158 106
57 16 BROOKLYN 5100 316
1 01 BRONX 4915 201
75 Unspecified QUEENS 4870 Invalid BoroCD
27 06 QUEENS 4607 406
52 13 BROOKLYN 3840 313
3 01 MANHATTAN 3623 101
6 02 BRONX 3470 202
76 Unspecified STATEN ISLAND 541 Invalid BoroCD
69 83 QUEENS 193 483
65 64 MANHATTAN 126 164
66 80 QUEENS 102 480
68 82 QUEENS 72 482
67 81 QUEENS 54 481
63 55 BROOKLYN 53 355
62 28 BRONX 26 228
60 26 BRONX 25 226
71 95 STATEN ISLAND 21 595
70 84 QUEENS 18 484
61 27 BRONX 17 227
64 56 BROOKLYN 13 356

We can make this function easier to read by isolating the logic that applies to all the conditions. This is called “refactoring”.

def recode_borocd_counts(row):
    board = row["Community Board"]

    # doing a check and then returning from a function early is known as a "guard clause"
    if board.startswith("U"):
        return "Invalid BoroCD"

    num = board[0:2]

    if "MANHATTAN" in board:
        return "1" + num
    elif "BRONX" in board:
        return "2" + num
    elif "BROOKLYN" in board:
        return "3" + num
    elif "QUEENS" in board:
        return "4" + num
    elif "STATEN ISLAND" in board:
        return "5" + num
    else:
        return "Invalid BoroCD"
cb_counts["boro_cd"] = cb_counts.apply(recode_borocd_counts, axis=1)
cb_counts
Community Board num_311_requests boro_cd
50 12 MANHATTAN 14110 112
23 05 QUEENS 12487 405
51 12 QUEENS 12228 412
2 01 BROOKLYN 11863 301
12 03 BROOKLYN 11615 303
5 01 STATEN ISLAND 11438 501
31 07 QUEENS 11210 407
21 05 BROOKLYN 10862 305
16 04 BRONX 10628 204
4 01 QUEENS 10410 401
58 17 BROOKLYN 10208 317
54 14 BROOKLYN 10179 314
28 07 BRONX 9841 207
59 18 BROOKLYN 9717 318
39 09 QUEENS 9599 409
53 13 QUEENS 9459 413
48 12 BRONX 9412 212
45 11 BROOKLYN 9309 311
20 05 BRONX 9094 205
13 03 MANHATTAN 8905 103
35 08 QUEENS 8661 408
17 04 BROOKLYN 8639 304
42 10 MANHATTAN 8592 110
15 03 STATEN ISLAND 8524 503
56 15 BROOKLYN 8508 315
43 10 QUEENS 8333 410
49 12 BROOKLYN 8214 312
30 07 MANHATTAN 8141 107
36 09 BRONX 8092 209
41 10 BROOKLYN 7808 310
33 08 BROOKLYN 7797 308
7 02 BROOKLYN 7747 302
37 09 BROOKLYN 7571 309
25 06 BROOKLYN 7373 306
10 02 STATEN ISLAND 7059 502
0 0 Unspecified 6882 Invalid BoroCD
14 03 QUEENS 6799 403
38 09 MANHATTAN 6792 109
73 Unspecified BROOKLYN 6771 Invalid BoroCD
18 04 MANHATTAN 6765 104
22 05 MANHATTAN 6599 105
34 08 MANHATTAN 6579 108
8 02 MANHATTAN 6514 102
44 11 BRONX 6448 211
74 Unspecified MANHATTAN 6427 Invalid BoroCD
29 07 BROOKLYN 6364 307
9 02 QUEENS 6142 402
19 04 QUEENS 5824 404
32 08 BRONX 5753 208
47 11 QUEENS 5748 411
24 06 BRONX 5746 206
40 10 BRONX 5719 210
46 11 MANHATTAN 5714 111
72 Unspecified BRONX 5349 Invalid BoroCD
55 14 QUEENS 5284 414
11 03 BRONX 5262 203
26 06 MANHATTAN 5158 106
57 16 BROOKLYN 5100 316
1 01 BRONX 4915 201
75 Unspecified QUEENS 4870 Invalid BoroCD
27 06 QUEENS 4607 406
52 13 BROOKLYN 3840 313
3 01 MANHATTAN 3623 101
6 02 BRONX 3470 202
76 Unspecified STATEN ISLAND 541 Invalid BoroCD
69 83 QUEENS 193 483
65 64 MANHATTAN 126 164
66 80 QUEENS 102 480
68 82 QUEENS 72 482
67 81 QUEENS 54 481
63 55 BROOKLYN 53 355
62 28 BRONX 26 228
60 26 BRONX 25 226
71 95 STATEN ISLAND 21 595
70 84 QUEENS 18 484
61 27 BRONX 17 227
64 56 BROOKLYN 13 356

Next, let’s create the borocd column in the population dataset#

population.head()
Borough CD Number CD Name 1970 Population 1980 Population 1990 Population 2000 Population 2010 Population
0 Bronx 1 Melrose, Mott Haven, Port Morris 138557 78441 77214 82159 91497
1 Bronx 2 Hunts Point, Longwood 99493 34399 39443 46824 52246
2 Bronx 3 Morrisania, Crotona Park East 150636 53635 57162 68574 79762
3 Bronx 4 Highbridge, Concourse Village 144207 114312 119962 139563 146441
4 Bronx 5 University Hts., Fordham, Mt. Hope 121807 107995 118435 128313 128200
population.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Borough          59 non-null     object
 1   CD Number        59 non-null     int64 
 2   CD Name          59 non-null     object
 3   1970 Population  59 non-null     int64 
 4   1980 Population  59 non-null     int64 
 5   1990 Population  59 non-null     int64 
 6   2000 Population  59 non-null     int64 
 7   2010 Population  59 non-null     int64 
dtypes: int64(6), object(2)
memory usage: 3.8+ KB

Create a function recode_borocd_pop that combines and recodes the Borough and CD Number values to create a BoroCD unique ID.

def recode_borocd_pop(row):
    if row.Borough == "Manhattan":
        return str(100 + row["CD Number"])
    elif row.Borough == "Bronx":
        return str(200 + row["CD Number"])
    elif row.Borough == "Brooklyn":
        return str(300 + row["CD Number"])
    elif row.Borough == "Queens":
        return str(400 + row["CD Number"])
    elif row.Borough == "Staten Island":
        return str(500 + row["CD Number"])
    else:
        return "Invalid BoroCD"

This is different than recode_borocd_counts() because:

  • The Borough and CD Number are seprate columns in the population DataFrame, rather than combined in one like the 311 data

  • We are working with the CD Number as an integer rather than a string

population["borocd"] = population.apply(recode_borocd_pop, axis=1)
population
Borough CD Number CD Name 1970 Population 1980 Population 1990 Population 2000 Population 2010 Population borocd
0 Bronx 1 Melrose, Mott Haven, Port Morris 138557 78441 77214 82159 91497 201
1 Bronx 2 Hunts Point, Longwood 99493 34399 39443 46824 52246 202
2 Bronx 3 Morrisania, Crotona Park East 150636 53635 57162 68574 79762 203
3 Bronx 4 Highbridge, Concourse Village 144207 114312 119962 139563 146441 204
4 Bronx 5 University Hts., Fordham, Mt. Hope 121807 107995 118435 128313 128200 205
5 Bronx 6 East Tremont, Belmont 114137 65016 68061 75688 83268 206
6 Bronx 7 Bedford Park, Norwood, Fordham 113764 116827 128588 141411 139286 207
7 Bronx 8 Riverdale, Kingsbridge, Marble Hill 103543 98275 97030 101332 101731 208
8 Bronx 9 Soundview, Parkchester 166442 167627 155970 167859 172298 209
9 Bronx 10 Throgs Nk., Co-op City, Pelham Bay 84948 106516 108093 115948 120392 210
10 Bronx 11 Pelham Pkwy, Morris Park, Laconia 105980 99080 97842 110706 113232 211
11 Bronx 12 Wakefield, Williamsbridge 135010 128226 129620 149077 152344 212
12 Brooklyn 1 Williamsburg, Greenpoint 179390 142942 155972 160338 173083 301
13 Brooklyn 2 Brooklyn Heights, Fort Greene 110221 92732 94534 98620 99617 302
14 Brooklyn 3 Bedford Stuyvesant 203380 133379 138696 143867 152985 303
15 Brooklyn 4 Bushwick 137902 92497 102572 104358 112634 304
16 Brooklyn 5 East New York, Starrett City 170791 154931 161350 173198 182896 305
17 Brooklyn 6 Park Slope, Carroll Gardens 138933 110228 102724 104054 104709 306
18 Brooklyn 7 Sunset Park, Windsor Terrace 111607 98567 102553 120063 126230 307
19 Brooklyn 8 Crown Heights North 121821 88796 96400 96076 96317 308
20 Brooklyn 9 Crown Heights South, Wingate 101047 96669 110715 104014 98429 309
21 Brooklyn 10 Bay Ridge, Dyker Heights 129822 118187 110612 122542 124491 310
22 Brooklyn 11 Bensonhurst, Bath Beach 170119 155072 149994 172129 181981 311
23 Brooklyn 12 Borough Park, Ocean Parkway 166301 155899 160018 185046 191382 312
24 Brooklyn 13 Coney Island, Brighton Beach 97750 100030 102596 106120 104278 313
25 Brooklyn 14 Flatbush, Midwood 137041 143859 159825 168806 160664 314
26 Brooklyn 15 Sheepshead Bay, Gerritsen Beach 164815 149572 143477 160319 159650 315
27 Brooklyn 16 Brownsville, Ocean Hill 122589 73801 84923 85343 86468 316
28 Brooklyn 17 East Flatbush, Rugby, Farragut 149496 154596 161261 165753 155252 317
29 Brooklyn 18 Canarsie, Flatlands 188643 169092 162428 194653 193543 318
30 Manhattan 1 Battery Park City, Tribeca 7706 15918 25366 34420 60978 101
31 Manhattan 2 Greenwich Village, Soho 84337 87069 94105 93119 90016 102
32 Manhattan 3 Lower East Side, Chinatown 181845 154848 161617 164407 163277 103
33 Manhattan 4 Chelsea, Clinton 83601 82164 84431 87479 103245 104
34 Manhattan 5 Midtown Business District 31076 39544 43507 44028 51673 105
35 Manhattan 6 Stuyvesant Town, Turtle Bay 122465 127554 133748 136152 142745 106
36 Manhattan 7 West Side, Upper West Side 212422 206669 210993 207699 209084 107
37 Manhattan 8 Upper East Side 200851 204305 210880 217063 219920 108
38 Manhattan 9 Manhattanville, Hamilton Heights 113606 103038 106978 111724 110193 109
39 Manhattan 10 Central Harlem 159267 105641 99519 107109 115723 110
40 Manhattan 11 East Harlem 154662 114569 110508 117743 120511 111
41 Manhattan 12 Washington Heights, Inwood 180561 179941 198192 208414 190020 112
42 Queens 1 Astoria, Long Island City 185925 185198 188549 211220 191105 401
43 Queens 2 Sunnyside, Woodside 95073 88927 94845 109920 113200 402
44 Queens 3 Jackson Heights, North Corona 123635 122090 128924 169083 171576 403
45 Queens 4 Elmhurst, South Corona 108233 118430 137023 167005 172598 404
46 Queens 5 Ridgewood, Glendale, Maspeth 161022 150142 149126 165911 169190 405
47 Queens 6 Forest Hills, Rego Park 120429 112245 106996 115967 113257 406
48 Queens 7 Flushing, Bay Terrace 207589 204785 220508 242952 247354 407
49 Queens 8 Fresh Meadows, Briarwood 142468 125312 132101 146594 151107 408
50 Queens 9 Woodhaven, Richmond Hill 110367 109505 112151 141608 143317 409
51 Queens 10 Ozone Park, Howard Beach 113857 105651 107768 127274 122396 410
52 Queens 11 Bayside, Douglaston, Little Neck 127883 110963 108056 116404 116431 411
53 Queens 12 Jamaica, St. Albans, Hollis 206639 189383 201293 223602 225919 412
54 Queens 13 Queens Village, Rosedale 184647 173178 177535 196284 188593 413
55 Queens 14 The Rockaways, Broad Channel 98228 100592 100596 106686 114978 414
56 Staten Island 1 Stapleton, Port Richmond 135875 138489 137806 162609 175756 501
57 Staten Island 2 New Springville, South Beach 85985 105128 113944 127071 132003 502
58 Staten Island 3 Tottenville, Woodrow, Great Kills 72815 108249 126956 152908 160209 503

Join the population data onto the counts data after creating shared borocd unique ID#

To join dataframes together, we will use the pandas .merge() function.

merge diagram

merged_data = pd.merge(left=cb_counts, right=population, left_on="boro_cd", right_on="borocd")
merged_data
Community Board num_311_requests boro_cd Borough CD Number CD Name 1970 Population 1980 Population 1990 Population 2000 Population 2010 Population borocd
0 12 MANHATTAN 14110 112 Manhattan 12 Washington Heights, Inwood 180561 179941 198192 208414 190020 112
1 05 QUEENS 12487 405 Queens 5 Ridgewood, Glendale, Maspeth 161022 150142 149126 165911 169190 405
2 12 QUEENS 12228 412 Queens 12 Jamaica, St. Albans, Hollis 206639 189383 201293 223602 225919 412
3 01 BROOKLYN 11863 301 Brooklyn 1 Williamsburg, Greenpoint 179390 142942 155972 160338 173083 301
4 03 BROOKLYN 11615 303 Brooklyn 3 Bedford Stuyvesant 203380 133379 138696 143867 152985 303
5 01 STATEN ISLAND 11438 501 Staten Island 1 Stapleton, Port Richmond 135875 138489 137806 162609 175756 501
6 07 QUEENS 11210 407 Queens 7 Flushing, Bay Terrace 207589 204785 220508 242952 247354 407
7 05 BROOKLYN 10862 305 Brooklyn 5 East New York, Starrett City 170791 154931 161350 173198 182896 305
8 04 BRONX 10628 204 Bronx 4 Highbridge, Concourse Village 144207 114312 119962 139563 146441 204
9 01 QUEENS 10410 401 Queens 1 Astoria, Long Island City 185925 185198 188549 211220 191105 401
10 17 BROOKLYN 10208 317 Brooklyn 17 East Flatbush, Rugby, Farragut 149496 154596 161261 165753 155252 317
11 14 BROOKLYN 10179 314 Brooklyn 14 Flatbush, Midwood 137041 143859 159825 168806 160664 314
12 07 BRONX 9841 207 Bronx 7 Bedford Park, Norwood, Fordham 113764 116827 128588 141411 139286 207
13 18 BROOKLYN 9717 318 Brooklyn 18 Canarsie, Flatlands 188643 169092 162428 194653 193543 318
14 09 QUEENS 9599 409 Queens 9 Woodhaven, Richmond Hill 110367 109505 112151 141608 143317 409
15 13 QUEENS 9459 413 Queens 13 Queens Village, Rosedale 184647 173178 177535 196284 188593 413
16 12 BRONX 9412 212 Bronx 12 Wakefield, Williamsbridge 135010 128226 129620 149077 152344 212
17 11 BROOKLYN 9309 311 Brooklyn 11 Bensonhurst, Bath Beach 170119 155072 149994 172129 181981 311
18 05 BRONX 9094 205 Bronx 5 University Hts., Fordham, Mt. Hope 121807 107995 118435 128313 128200 205
19 03 MANHATTAN 8905 103 Manhattan 3 Lower East Side, Chinatown 181845 154848 161617 164407 163277 103
20 08 QUEENS 8661 408 Queens 8 Fresh Meadows, Briarwood 142468 125312 132101 146594 151107 408
21 04 BROOKLYN 8639 304 Brooklyn 4 Bushwick 137902 92497 102572 104358 112634 304
22 10 MANHATTAN 8592 110 Manhattan 10 Central Harlem 159267 105641 99519 107109 115723 110
23 03 STATEN ISLAND 8524 503 Staten Island 3 Tottenville, Woodrow, Great Kills 72815 108249 126956 152908 160209 503
24 15 BROOKLYN 8508 315 Brooklyn 15 Sheepshead Bay, Gerritsen Beach 164815 149572 143477 160319 159650 315
25 10 QUEENS 8333 410 Queens 10 Ozone Park, Howard Beach 113857 105651 107768 127274 122396 410
26 12 BROOKLYN 8214 312 Brooklyn 12 Borough Park, Ocean Parkway 166301 155899 160018 185046 191382 312
27 07 MANHATTAN 8141 107 Manhattan 7 West Side, Upper West Side 212422 206669 210993 207699 209084 107
28 09 BRONX 8092 209 Bronx 9 Soundview, Parkchester 166442 167627 155970 167859 172298 209
29 10 BROOKLYN 7808 310 Brooklyn 10 Bay Ridge, Dyker Heights 129822 118187 110612 122542 124491 310
30 08 BROOKLYN 7797 308 Brooklyn 8 Crown Heights North 121821 88796 96400 96076 96317 308
31 02 BROOKLYN 7747 302 Brooklyn 2 Brooklyn Heights, Fort Greene 110221 92732 94534 98620 99617 302
32 09 BROOKLYN 7571 309 Brooklyn 9 Crown Heights South, Wingate 101047 96669 110715 104014 98429 309
33 06 BROOKLYN 7373 306 Brooklyn 6 Park Slope, Carroll Gardens 138933 110228 102724 104054 104709 306
34 02 STATEN ISLAND 7059 502 Staten Island 2 New Springville, South Beach 85985 105128 113944 127071 132003 502
35 03 QUEENS 6799 403 Queens 3 Jackson Heights, North Corona 123635 122090 128924 169083 171576 403
36 09 MANHATTAN 6792 109 Manhattan 9 Manhattanville, Hamilton Heights 113606 103038 106978 111724 110193 109
37 04 MANHATTAN 6765 104 Manhattan 4 Chelsea, Clinton 83601 82164 84431 87479 103245 104
38 05 MANHATTAN 6599 105 Manhattan 5 Midtown Business District 31076 39544 43507 44028 51673 105
39 08 MANHATTAN 6579 108 Manhattan 8 Upper East Side 200851 204305 210880 217063 219920 108
40 02 MANHATTAN 6514 102 Manhattan 2 Greenwich Village, Soho 84337 87069 94105 93119 90016 102
41 11 BRONX 6448 211 Bronx 11 Pelham Pkwy, Morris Park, Laconia 105980 99080 97842 110706 113232 211
42 07 BROOKLYN 6364 307 Brooklyn 7 Sunset Park, Windsor Terrace 111607 98567 102553 120063 126230 307
43 02 QUEENS 6142 402 Queens 2 Sunnyside, Woodside 95073 88927 94845 109920 113200 402
44 04 QUEENS 5824 404 Queens 4 Elmhurst, South Corona 108233 118430 137023 167005 172598 404
45 08 BRONX 5753 208 Bronx 8 Riverdale, Kingsbridge, Marble Hill 103543 98275 97030 101332 101731 208
46 11 QUEENS 5748 411 Queens 11 Bayside, Douglaston, Little Neck 127883 110963 108056 116404 116431 411
47 06 BRONX 5746 206 Bronx 6 East Tremont, Belmont 114137 65016 68061 75688 83268 206
48 10 BRONX 5719 210 Bronx 10 Throgs Nk., Co-op City, Pelham Bay 84948 106516 108093 115948 120392 210
49 11 MANHATTAN 5714 111 Manhattan 11 East Harlem 154662 114569 110508 117743 120511 111
50 14 QUEENS 5284 414 Queens 14 The Rockaways, Broad Channel 98228 100592 100596 106686 114978 414
51 03 BRONX 5262 203 Bronx 3 Morrisania, Crotona Park East 150636 53635 57162 68574 79762 203
52 06 MANHATTAN 5158 106 Manhattan 6 Stuyvesant Town, Turtle Bay 122465 127554 133748 136152 142745 106
53 16 BROOKLYN 5100 316 Brooklyn 16 Brownsville, Ocean Hill 122589 73801 84923 85343 86468 316
54 01 BRONX 4915 201 Bronx 1 Melrose, Mott Haven, Port Morris 138557 78441 77214 82159 91497 201
55 06 QUEENS 4607 406 Queens 6 Forest Hills, Rego Park 120429 112245 106996 115967 113257 406
56 13 BROOKLYN 3840 313 Brooklyn 13 Coney Island, Brighton Beach 97750 100030 102596 106120 104278 313
57 01 MANHATTAN 3623 101 Manhattan 1 Battery Park City, Tribeca 7706 15918 25366 34420 60978 101
58 02 BRONX 3470 202 Bronx 2 Hunts Point, Longwood 99493 34399 39443 46824 52246 202

Different types of merges

# remove the redundant column
merged_data = merged_data.drop("borocd", axis="columns")

# save the data to a file
# merged_data.to_csv("data/community_district_311.csv", index=False)

Homework 2#