Lecture 2: Manipulating and combining data#

Please sign attendance sheet

Feeling overwhlemed?#

Reminder that learning to code is like learning a spoken language. It’s not obvious, and people will pick it up at different speeds at different spots. Try:

Small example of comment-driven development:

# find valid ZIP codes
# filter the DataFrame to only invalid ZIP codes

Boolean indexing#

import pandas as pd

people = pd.DataFrame(
    {
        "names": ["Fred", "Krishna", "Saron", "Sally"],
        "age": [50, 13, 45, 22],
    }
)
people
names age
0 Fred 50
1 Krishna 13
2 Saron 45
3 Sally 22
people["age"]
0    50
1    13
2    45
3    22
Name: age, dtype: int64
type(people["age"])
pandas.core.series.Series

When we compare single values (like x > 6), we get a single boolean back. Here, we are checking a bunch of values, so we’re going to get multiple booleans, returned as a Series.

over_40 = people["age"] > 40
over_40
0     True
1    False
2     True
3    False
Name: age, dtype: bool
type(over_40)
pandas.core.series.Series
people[over_40]
names age
0 Fred 50
2 Saron 45

Can also stuff into one line:

people[people["age"] > 40]

Data cleaning#

Data Cleansing is a process of removing or fixing incorrect, malformed, incomplete, duplicate, or corrupted data

https://hevodata.com/learn/data-cleansing-a-simplified-guide/

When have you needed to clean data?

What are continuous values?

What are categorical values?

Things to check for#

From my workshop on data cleaning:

  • Missing data

    • Empty values

  • Bad (junk) values

    • Duplicates

    • Mismatched types/formatting

  • Categorical values

    • Uniqueness (cardinality)

    • Value counts

  • Continuous values

    • Ranges

    • Spread (distribution)

Notes:

  • “Values” in this case can be a single cell (in the spreadsheet sense) or a whole row

  • “Missing” or “duplicates” can be columns (Series), tables (DataFrames), rows, or cells

Data cleaning mnemonic#

  • Empty

  • Bad

  • Unique

  • Spread

Setup#

# Display more rows and columns in the DataFrames
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

Read our cleaned 311 Service Requests dataset#

url = "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
requests = pd.read_csv(url)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_8064/610958175.py:2: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(url)

Dealing with dtypes#

More data cleaning!

Minion character vacuuming

DtypeWarning: Columns (8,20,31,34) have mixed types.
requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499958 entries, 0 to 499957
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      499958 non-null  int64  
 1   Created Date                    499958 non-null  object 
 2   Closed Date                     476140 non-null  object 
 3   Agency                          499958 non-null  object 
 4   Agency Name                     499958 non-null  object 
 5   Complaint Type                  499958 non-null  object 
 6   Descriptor                      492496 non-null  object 
 7   Location Type                   392573 non-null  object 
 8   Incident Zip                    480394 non-null  object 
 9   Incident Address                434529 non-null  object 
 10  Street Name                     434504 non-null  object 
 11  Cross Street 1                  300825 non-null  object 
 12  Cross Street 2                  299624 non-null  object 
 13  Intersection Street 1           107377 non-null  object 
 14  Intersection Street 2           107042 non-null  object 
 15  Address Type                    451006 non-null  object 
 16  City                            476632 non-null  object 
 17  Landmark                        32516 non-null   object 
 18  Facility Type                   134918 non-null  object 
 19  Status                          499958 non-null  object 
 20  Due Date                        171534 non-null  object 
 21  Resolution Description          457354 non-null  object 
 22  Resolution Action Updated Date  488788 non-null  object 
 23  Community Board                 499958 non-null  object 
 24  BBL                             407338 non-null  float64
 25  Borough                         499958 non-null  object 
 26  X Coordinate (State Plane)      470815 non-null  float64
 27  Y Coordinate (State Plane)      470815 non-null  float64
 28  Open Data Channel Type          499958 non-null  object 
 29  Park Facility Name              499931 non-null  object 
 30  Park Borough                    499958 non-null  object 
 31  Vehicle Type                    37 non-null      object 
 32  Taxi Company Borough            403 non-null     object 
 33  Taxi Pick Up Location           4474 non-null    object 
 34  Bridge Highway Name             696 non-null     object 
 35  Bridge Highway Direction        765 non-null     object 
 36  Road Ramp                       759 non-null     object 
 37  Bridge Highway Segment          1007 non-null    object 
 38  Latitude                        470815 non-null  float64
 39  Longitude                       470815 non-null  float64
 40  Location                        470815 non-null  object 
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB
list(requests["Incident Zip"].unique())
['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 nan,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10452',
 '11377',
 '10029',
 '10003',
 '10075',
 '11222',
 '10128',
 '11415',
 '11204',
 '10030',
 '11432',
 '10308',
 '11102',
 '10016',
 '10463',
 '11412',
 '10011',
 '11106',
 '10457',
 '11375',
 '11356',
 '11228',
 '11369',
 '10458',
 '11411',
 '11215',
 '10309',
 '11358',
 '11355',
 '11219',
 '10031',
 '10303',
 '11232',
 '10302',
 '11238',
 '10005',
 '11429',
 '11205',
 '10023',
 '11373',
 '10033',
 '10039',
 '10028',
 '11363',
 '11354',
 '11692',
 '10455',
 '11370',
 '10035',
 '10012',
 '10024',
 '10009',
 '10034',
 '11001',
 '10466',
 '11427',
 '11004',
 '10454',
 '11414',
 '11360',
 '10461',
 '10018',
 '10006',
 '11236',
 '11428',
 '10474',
 '10471',
 '10037',
 '10475',
 '11430',
 '10119',
 '10038',
 '11426',
 '11239',
 '10151',
 '10120',
 '10112',
 '10168',
 '10004',
 '10464',
 '18773-9640',
 '10282',
 '11109',
 '10280',
 '07114',
 '11040',
 '10000',
 '07090',
 '10020',
 '00083',
 '10044',
 '10069',
 '10118',
 '07102',
 '11359',
 '11005',
 '10169',
 '11697',
 '10115',
 11226.0,
 11236.0,
 10466.0,
 11210.0,
 10452.0,
 10457.0,
 10304.0,
 11416.0,
 11205.0,
 11214.0,
 11692.0,
 11101.0,
 11201.0,
 10470.0,
 11375.0,
 10305.0,
 11420.0,
 11421.0,
 11419.0,
 11412.0,
 10309.0,
 11385.0,
 11379.0,
 11356.0,
 10021.0,
 11417.0,
 11377.0,
 11423.0,
 10001.0,
 10469.0,
 10463.0,
 11372.0,
 11422.0,
 10017.0,
 11368.0,
 11213.0,
 11220.0,
 10312.0,
 10025.0,
 11357.0,
 11230.0,
 10456.0,
 11104.0,
 11232.0,
 11208.0,
 11206.0,
 11229.0,
 10465.0,
 10472.0,
 10039.0,
 10016.0,
 10009.0,
 11355.0,
 11211.0,
 10040.0,
 11366.0,
 11234.0,
 10013.0,
 11207.0,
 10032.0,
 10036.0,
 10029.0,
 11204.0,
 11435.0,
 11373.0,
 10007.0,
 11216.0,
 10306.0,
 11378.0,
 11414.0,
 10314.0,
 10128.0,
 10027.0,
 11221.0,
 10024.0,
 11364.0,
 10468.0,
 11233.0,
 10014.0,
 11238.0,
 11374.0,
 11105.0,
 11219.0,
 11106.0,
 10003.0,
 10026.0,
 11411.0,
 11223.0,
 11693.0,
 11365.0,
 11209.0,
 11103.0,
 11361.0,
 11427.0,
 10467.0,
 11203.0,
 11102.0,
 10455.0,
 10454.0,
 11225.0,
 11212.0,
 11434.0,
 10461.0,
 11369.0,
 11432.0,
 10031.0,
 11217.0,
 11235.0,
 10458.0,
 10453.0,
 11215.0,
 11367.0,
 11231.0,
 11413.0,
 11358.0,
 11694.0,
 10462.0,
 11218.0,
 11430.0,
 10473.0,
 10011.0,
 10460.0,
 10019.0,
 10310.0,
 10037.0,
 10451.0,
 10301.0,
 10010.0,
 11354.0,
 11222.0,
 11691.0,
 10034.0,
 10308.0,
 10012.0,
 11237.0,
 10801.0,
 10030.0,
 11428.0,
 11418.0,
 10033.0,
 10459.0,
 11429.0,
 10065.0,
 10022.0,
 10005.0,
 10002.0,
 10035.0,
 11370.0,
 10004.0,
 11415.0,
 11426.0,
 11362.0,
 10006.0,
 11360.0,
 10038.0,
 11249.0,
 11001.0,
 11224.0,
 10075.0,
 10018.0,
 10302.0,
 11436.0,
 11433.0,
 10474.0,
 10023.0,
 11004.0,
 10028.0,
 11228.0,
 10475.0,
 10069.0,
 11363.0,
 10303.0,
 10282.0,
 10307.0,
 10103.0,
 10280.0,
 10471.0,
 10271.0,
 10000.0,
 10464.0,
 10107.0,
 7078.0,
 10044.0,
 10020.0,
 56303.0,
 10119.0,
 11040.0,
 11386.0,
 10952.0,
 6811.0,
 11239.0,
 10121.0,
 37214.0,
 10538.0,
 10112.0,
 10279.0,
 11109.0,
 11371.0,
 18017.0,
 7115.0,
 77036.0,
 7114.0,
 '10123',
 '11030',
 '10801',
 '92626-1902',
 '07302',
 '18773',
 '12222',
 '10710',
 '10103',
 '07057',
 '10162',
 '11582',
 '10281',
 '10271',
 '10107',
 'HARRISBURG',
 '11735',
 '07305',
 'N5X3A6',
 '11746',
 '11371',
 '23450',
 11580.0,
 10591.0,
 11005.0,
 19034.0,
 11596.0,
 11779.0,
 7621.0,
 11021.0,
 11241.0,
 10169.0,
 100000.0,
 11242.0,
 10151.0,
 11697.0,
 7086.0,
 10177.0,
 10118.0,
 10105.0,
 10152.0,
 10168.0,
 7093.0,
 10917.0,
 10110.0,
 10153.0,
 10178.0,
 11570.0,
 10601.0,
 10704.0,
 7424.0,
 10281.0,
 10158.0,
 '10158',
 '10172',
 '10179',
 'IDK',
 '11801',
 '10601',
 '11590',
 '10155',
 '11202',
 '1801',
 '11581',
 11359.0,
 11758.0,
 10278.0,
 43017.0,
 10154.0,
 11553.0,
 10162.0,
 11695.0,
 10041.0,
 11741.0,
 98335.0,
 14814.0,
 10111.0,
 '12345',
 '11572',
 '11520',
 '14614-195',
 '10121',
 '10105',
 '10701',
 0.0,
 17106.0,
 979113.0,
 10120.0,
 12345.0,
 11030.0,
 11797.0,
 100.0,
 11710.0,
 33624.0,
 8682.0,
 11747.0,
 1757.0,
 11561.0,
 7304.0,
 6851.0,
 11590.0,
 94267.0,
 10167.0,
 11749.0,
 11756.0,
 10174.0,
 10550.0,
 89119.0,
 14068.0,
 11722.0,
 11520.0,
 6460.0,
 32255.0,
 10173.0,
 10165.0,
 11946.0,
 '29616-0759',
 '07032',
 '10278',
 '11575',
 11963.0,
 10106.0,
 83.0,
 11566.0,
 6870.0,
 7001.0,
 10710.0,
 11735.0,
 11572.0,
 '10165',
 '10279',
 '11251',
 'NJ 07114',
 '10106',
 '07666',
 '11516',
 '10177',
 '10170',
 '43215-1441',
 '00000',
 '08081',
 '10803',
 11507.0,
 11701.0,
 11563.0,
 7047.0,
 3108.0,
 '11021',
 '07003',
 '10152',
 '07029',
 '10041',
 '31093',
 '11735-0230',
 89118.0,
 11803.0,
 11559.0,
 11565.0,
 7080.0,
 12601.0,
 10155.0,
 10171.0,
 7208.0,
 11757.0,
 11042.0,
 '1101',
 '10111',
 '10173',
 '10096',
 '07087',
 'DID N',
 '10956',
 10048.0,
 10123.0,
 10122.0,
 11251.0]

ZIP codes look numeric, but aren’t really.

Read the ZIP codes in as strings.

requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_8064/1354368569.py:1: DtypeWarning: Columns (20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})

We fixed the dtype warning for column 8 (Incident Zip).

list(requests2["Incident Zip"].unique())
['11235',
 '11221',
 '11693',
 '11216',
 '10465',
 '11367',
 '10459',
 '11101',
 '11362',
 '10014',
 '11234',
 '11436',
 '10305',
 '10467',
 '11208',
 '10451',
 '11419',
 '11237',
 '11220',
 '10469',
 '11385',
 '10470',
 '11694',
 '10036',
 <NA>,
 '10473',
 '11435',
 '10040',
 '10472',
 '11225',
 '10019',
 '11434',
 '11226',
 '10010',
 '11211',
 '11421',
 '10026',
 '10013',
 '11423',
 '10002',
 '10453',
 '11213',
 '11104',
 '11249',
 '11361',
 '11233',
 '11224',
 '11374',
 '10025',
 '10022',
 '11214',
 '11209',
 '11366',
 '10304',
 '10027',
 '11378',
 '11206',
 '10021',
 '11364',
 '10065',
 '10456',
 '10314',
 '10312',
 '11212',
 '11379',
 '10462',
 '11231',
 '10460',
 '11416',
 '10001',
 '11357',
 '11413',
 '11210',
 '11217',
 '11223',
 '11417',
 '11418',
 '11218',
 '11230',
 '11207',
 '11691',
 '10468',
 '10007',
 '10310',
 '10306',
 '11103',
 '11105',
 '11433',
 '11203',
 '10307',
 '11229',
 '11372',
 '10032',
 '11420',
 '10017',
 '10301',
 '11368',
 '11201',
 '11365',
 '11422',
 '10452',
 '11377',
 '10029',
 '10003',
 '10075',
 '11222',
 '10128',
 '11415',
 '11204',
 '10030',
 '11432',
 '10308',
 '11102',
 '10016',
 '10463',
 '11412',
 '10011',
 '11106',
 '10457',
 '11375',
 '11356',
 '11228',
 '11369',
 '10458',
 '11411',
 '11215',
 '10309',
 '11358',
 '11355',
 '11219',
 '10031',
 '10303',
 '11232',
 '10302',
 '11238',
 '10005',
 '11429',
 '11205',
 '10023',
 '11373',
 '10033',
 '10039',
 '10028',
 '11363',
 '11354',
 '11692',
 '10455',
 '11370',
 '10035',
 '10012',
 '10024',
 '10009',
 '10034',
 '11001',
 '10466',
 '11427',
 '11004',
 '10454',
 '11414',
 '11360',
 '10461',
 '10018',
 '10006',
 '11236',
 '11428',
 '10474',
 '10471',
 '10037',
 '10475',
 '11430',
 '10119',
 '10038',
 '11426',
 '11239',
 '10151',
 '10120',
 '10112',
 '10168',
 '10004',
 '10464',
 '18773-9640',
 '10282',
 '11109',
 '10280',
 '07114',
 '11040',
 '10000',
 '07090',
 '10020',
 '00083',
 '10044',
 '10069',
 '10118',
 '07102',
 '11359',
 '11005',
 '10169',
 '11697',
 '10115',
 '10801',
 '10103',
 '10271',
 '10107',
 '07078',
 '56303',
 '11386',
 '10952',
 '06811',
 '10121',
 '37214',
 '10538',
 '10279',
 '11371',
 '18017',
 '07115',
 '77036',
 '10123',
 '11030',
 '92626-1902',
 '07302',
 '18773',
 '12222',
 '10710',
 '07057',
 '10162',
 '11582',
 '10281',
 'HARRISBURG',
 '11735',
 '07305',
 'N5X3A6',
 '11746',
 '23450',
 '11580',
 '10591',
 '19034',
 '11596',
 '11779',
 '07621',
 '11021',
 '11241',
 '100000',
 '11242',
 '07086',
 '10177',
 '10105',
 '10152',
 '07093',
 '10917',
 '10110',
 '10153',
 '10178',
 '11570',
 '10601',
 '10704',
 '07424',
 '10158',
 '10172',
 '10179',
 'IDK',
 '11801',
 '11590',
 '10155',
 '11202',
 '1801',
 '11581',
 '11758',
 '10278',
 '43017',
 '10154',
 '11553',
 '11695',
 '10041',
 '11741',
 '98335',
 '14814',
 '10111',
 '12345',
 '11572',
 '11520',
 '14614-195',
 '10701',
 '00000',
 '17106',
 '979113',
 '11797',
 '100',
 '11710',
 '33624',
 '8682',
 '11747',
 '01757',
 '11561',
 '07304',
 '000000',
 '06851',
 '94267',
 '10167',
 '11749',
 '11756',
 '10174',
 '10550',
 '89119',
 '14068',
 '11722',
 '06460',
 '32255',
 '10173',
 '10165',
 '11946',
 '29616-0759',
 '07032',
 '11575',
 '11963',
 '10106',
 '11566',
 '06870',
 '07001',
 '11251',
 'NJ 07114',
 '07666',
 '11516',
 '10170',
 '43215-1441',
 '08081',
 '10803',
 '11507',
 '11701',
 '11563',
 '07047',
 '03108',
 '07003',
 '07029',
 '31093',
 '11735-0230',
 '89118',
 '11803',
 '11559',
 '11565',
 '07080',
 '12601',
 '10171',
 '07208',
 '11757',
 '11042',
 '1101',
 '10096',
 '07087',
 'DID N',
 '10956',
 '10048',
 '10122']

Find invalid ZIP codes#

Use a regular expression (regex) to find strings that match a pattern:

^\d{5}(?:-\d{4})?$
│ │ │  │        │└─ end of string
│ │ │  │        └─ optional
│ │ │  └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string

regex101 is useful for testing them.

# find valid ZIP codes
valid_zips = requests2["Incident Zip"].str.contains(r"^\d{5}(?:-\d{4})?$")
valid_zips
0         True
1         True
2         True
3         True
4         True
          ... 
499953    True
499954    True
499955    True
499956    True
499957    True
Name: Incident Zip, Length: 499958, dtype: boolean
requests2.shape
(499958, 41)
requests2[valid_zips].shape
(480381, 41)
# filter the DataFrame to only invalid ZIP codes
invalid_zips = valid_zips == False
invalid_zips
0         False
1         False
2         False
3         False
4         False
          ...  
499953    False
499954    False
499955    False
499956    False
499957    False
Name: Incident Zip, Length: 499958, dtype: boolean
requests_with_invalid_zips = requests2[invalid_zips]
requests_with_invalid_zips["Incident Zip"]
55017     HARRISBURG
58100         N5X3A6
80798         100000
120304           IDK
123304          1801
173518     14614-195
192034        979113
201463           100
207158          8682
216745        000000
325071      NJ 07114
425985          1101
441166         DID N
Name: Incident Zip, dtype: string

Clear any invalid ZIP codes:

requests2.loc[invalid_zips, "Incident Zip"] = None

.loc[] is used for overwriting a subset of values.

Additonal data cleaning tips:

  • Hard part is finding what needs to be done

  • Will be specific to your use case

  • Document what you did, since it will affect your results

In-class exercise#

Simple merge#

I had Copilot generate the DataFrames, so no idea if the numbers are real.

populations = pd.DataFrame(
    {
        "Country": ["China", "India", "Pakistan"],
        "Population": [1444216107, 1393409038, 220892331],
    }
)

populations
Country Population
0 China 1444216107
1 India 1393409038
2 Pakistan 220892331
gdps = pd.DataFrame(
    {
        "Country": ["China", "India", "United States", "Indonesia", "Pakistan"],
        "GDP": [14342903, 2875142, 21433226, 1058393, 263687],
    }
)
gdps
Country GDP
0 China 14342903
1 India 2875142
2 United States 21433226
3 Indonesia 1058393
4 Pakistan 263687

How should we combine them?

populations
Country Population
0 China 1444216107
1 India 1393409038
2 Pakistan 220892331
gdps
Country GDP
0 China 14342903
1 India 2875142
2 United States 21433226
3 Indonesia 1058393
4 Pakistan 263687

To join dataframes together, we will use the pandas .merge() function.

merge diagram

.merge() is comparable to:

In general, called “record linkage” or “entity resolution”.

pd.merge(populations, gdps, on="Country")
Country Population GDP
0 China 1444216107 14342903
1 India 1393409038 2875142
2 Pakistan 220892331 263687
pd.merge(populations, gdps, on="Country", how="outer")
Country Population GDP
0 China 1.444216e+09 14342903
1 India 1.393409e+09 2875142
2 Indonesia NaN 1058393
3 Pakistan 2.208923e+08 263687
4 United States NaN 21433226

Different types of merges

In-class exercise 2#

Let’s compute the migrant population as a percent of total by country using UN data.

Homework 2#