Lecture 2: Manipulating and combining data#
Please sign attendance sheet
Feeling overwhlemed?#
Reminder that learning to code is like learning a spoken language. It’s not obvious, and people will pick it up at different speeds at different spots. Try:
Taking notes in the lecture notebooks
Using another Python/pandas learning resource
Hear things explained another way
Ask in Ed Discussions if others have recommendations
-
Otherwise, trying to do two steps in your head:
Figuring out the logic
Figuring out the syntax
Small example of comment-driven development:
# find valid ZIP codes
# filter the DataFrame to only invalid ZIP codes
Boolean indexing#
import pandas as pd
people = pd.DataFrame(
{
"names": ["Fred", "Krishna", "Saron", "Sally"],
"age": [50, 13, 45, 22],
}
)
people
names | age | |
---|---|---|
0 | Fred | 50 |
1 | Krishna | 13 |
2 | Saron | 45 |
3 | Sally | 22 |
people["age"]
0 50
1 13
2 45
3 22
Name: age, dtype: int64
type(people["age"])
pandas.core.series.Series
When we compare single values (like x > 6
), we get a single boolean back. Here, we are checking a bunch of values, so we’re going to get multiple booleans, returned as a Series.
over_40 = people["age"] > 40
over_40
0 True
1 False
2 True
3 False
Name: age, dtype: bool
type(over_40)
pandas.core.series.Series
people[over_40]
names | age | |
---|---|---|
0 | Fred | 50 |
2 | Saron | 45 |
Can also stuff into one line:
people[people["age"] > 40]
Data cleaning#
Data Cleansing is a process of removing or fixing incorrect, malformed, incomplete, duplicate, or corrupted data
https://hevodata.com/learn/data-cleansing-a-simplified-guide/
When have you needed to clean data?
What are continuous values?
What are categorical values?
Things to check for#
From my workshop on data cleaning:
Missing data
Empty values
Bad (junk) values
Duplicates
Mismatched types/formatting
Categorical values
Uniqueness (cardinality)
Value counts
Continuous values
Ranges
Spread (distribution)
Notes:
“Values” in this case can be a single cell (in the spreadsheet sense) or a whole row
“Missing” or “duplicates” can be columns (Series), tables (DataFrames), rows, or cells
Data cleaning mnemonic#
Empty
Bad
Unique
Spread
Setup#
# Display more rows and columns in the DataFrames
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
Read our cleaned 311 Service Requests dataset#
url = "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
requests = pd.read_csv(url)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_8064/610958175.py:2: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
requests = pd.read_csv(url)
Dealing with dtypes#
More data cleaning!
DtypeWarning: Columns (8,20,31,34) have mixed types.
requests.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499958 entries, 0 to 499957
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unique Key 499958 non-null int64
1 Created Date 499958 non-null object
2 Closed Date 476140 non-null object
3 Agency 499958 non-null object
4 Agency Name 499958 non-null object
5 Complaint Type 499958 non-null object
6 Descriptor 492496 non-null object
7 Location Type 392573 non-null object
8 Incident Zip 480394 non-null object
9 Incident Address 434529 non-null object
10 Street Name 434504 non-null object
11 Cross Street 1 300825 non-null object
12 Cross Street 2 299624 non-null object
13 Intersection Street 1 107377 non-null object
14 Intersection Street 2 107042 non-null object
15 Address Type 451006 non-null object
16 City 476632 non-null object
17 Landmark 32516 non-null object
18 Facility Type 134918 non-null object
19 Status 499958 non-null object
20 Due Date 171534 non-null object
21 Resolution Description 457354 non-null object
22 Resolution Action Updated Date 488788 non-null object
23 Community Board 499958 non-null object
24 BBL 407338 non-null float64
25 Borough 499958 non-null object
26 X Coordinate (State Plane) 470815 non-null float64
27 Y Coordinate (State Plane) 470815 non-null float64
28 Open Data Channel Type 499958 non-null object
29 Park Facility Name 499931 non-null object
30 Park Borough 499958 non-null object
31 Vehicle Type 37 non-null object
32 Taxi Company Borough 403 non-null object
33 Taxi Pick Up Location 4474 non-null object
34 Bridge Highway Name 696 non-null object
35 Bridge Highway Direction 765 non-null object
36 Road Ramp 759 non-null object
37 Bridge Highway Segment 1007 non-null object
38 Latitude 470815 non-null float64
39 Longitude 470815 non-null float64
40 Location 470815 non-null object
dtypes: float64(5), int64(1), object(35)
memory usage: 156.4+ MB
list(requests["Incident Zip"].unique())
['11235',
'11221',
'11693',
'11216',
'10465',
'11367',
'10459',
'11101',
'11362',
'10014',
'11234',
'11436',
'10305',
'10467',
'11208',
'10451',
'11419',
'11237',
'11220',
'10469',
'11385',
'10470',
'11694',
'10036',
nan,
'10473',
'11435',
'10040',
'10472',
'11225',
'10019',
'11434',
'11226',
'10010',
'11211',
'11421',
'10026',
'10013',
'11423',
'10002',
'10453',
'11213',
'11104',
'11249',
'11361',
'11233',
'11224',
'11374',
'10025',
'10022',
'11214',
'11209',
'11366',
'10304',
'10027',
'11378',
'11206',
'10021',
'11364',
'10065',
'10456',
'10314',
'10312',
'11212',
'11379',
'10462',
'11231',
'10460',
'11416',
'10001',
'11357',
'11413',
'11210',
'11217',
'11223',
'11417',
'11418',
'11218',
'11230',
'11207',
'11691',
'10468',
'10007',
'10310',
'10306',
'11103',
'11105',
'11433',
'11203',
'10307',
'11229',
'11372',
'10032',
'11420',
'10017',
'10301',
'11368',
'11201',
'11365',
'11422',
'10452',
'11377',
'10029',
'10003',
'10075',
'11222',
'10128',
'11415',
'11204',
'10030',
'11432',
'10308',
'11102',
'10016',
'10463',
'11412',
'10011',
'11106',
'10457',
'11375',
'11356',
'11228',
'11369',
'10458',
'11411',
'11215',
'10309',
'11358',
'11355',
'11219',
'10031',
'10303',
'11232',
'10302',
'11238',
'10005',
'11429',
'11205',
'10023',
'11373',
'10033',
'10039',
'10028',
'11363',
'11354',
'11692',
'10455',
'11370',
'10035',
'10012',
'10024',
'10009',
'10034',
'11001',
'10466',
'11427',
'11004',
'10454',
'11414',
'11360',
'10461',
'10018',
'10006',
'11236',
'11428',
'10474',
'10471',
'10037',
'10475',
'11430',
'10119',
'10038',
'11426',
'11239',
'10151',
'10120',
'10112',
'10168',
'10004',
'10464',
'18773-9640',
'10282',
'11109',
'10280',
'07114',
'11040',
'10000',
'07090',
'10020',
'00083',
'10044',
'10069',
'10118',
'07102',
'11359',
'11005',
'10169',
'11697',
'10115',
11226.0,
11236.0,
10466.0,
11210.0,
10452.0,
10457.0,
10304.0,
11416.0,
11205.0,
11214.0,
11692.0,
11101.0,
11201.0,
10470.0,
11375.0,
10305.0,
11420.0,
11421.0,
11419.0,
11412.0,
10309.0,
11385.0,
11379.0,
11356.0,
10021.0,
11417.0,
11377.0,
11423.0,
10001.0,
10469.0,
10463.0,
11372.0,
11422.0,
10017.0,
11368.0,
11213.0,
11220.0,
10312.0,
10025.0,
11357.0,
11230.0,
10456.0,
11104.0,
11232.0,
11208.0,
11206.0,
11229.0,
10465.0,
10472.0,
10039.0,
10016.0,
10009.0,
11355.0,
11211.0,
10040.0,
11366.0,
11234.0,
10013.0,
11207.0,
10032.0,
10036.0,
10029.0,
11204.0,
11435.0,
11373.0,
10007.0,
11216.0,
10306.0,
11378.0,
11414.0,
10314.0,
10128.0,
10027.0,
11221.0,
10024.0,
11364.0,
10468.0,
11233.0,
10014.0,
11238.0,
11374.0,
11105.0,
11219.0,
11106.0,
10003.0,
10026.0,
11411.0,
11223.0,
11693.0,
11365.0,
11209.0,
11103.0,
11361.0,
11427.0,
10467.0,
11203.0,
11102.0,
10455.0,
10454.0,
11225.0,
11212.0,
11434.0,
10461.0,
11369.0,
11432.0,
10031.0,
11217.0,
11235.0,
10458.0,
10453.0,
11215.0,
11367.0,
11231.0,
11413.0,
11358.0,
11694.0,
10462.0,
11218.0,
11430.0,
10473.0,
10011.0,
10460.0,
10019.0,
10310.0,
10037.0,
10451.0,
10301.0,
10010.0,
11354.0,
11222.0,
11691.0,
10034.0,
10308.0,
10012.0,
11237.0,
10801.0,
10030.0,
11428.0,
11418.0,
10033.0,
10459.0,
11429.0,
10065.0,
10022.0,
10005.0,
10002.0,
10035.0,
11370.0,
10004.0,
11415.0,
11426.0,
11362.0,
10006.0,
11360.0,
10038.0,
11249.0,
11001.0,
11224.0,
10075.0,
10018.0,
10302.0,
11436.0,
11433.0,
10474.0,
10023.0,
11004.0,
10028.0,
11228.0,
10475.0,
10069.0,
11363.0,
10303.0,
10282.0,
10307.0,
10103.0,
10280.0,
10471.0,
10271.0,
10000.0,
10464.0,
10107.0,
7078.0,
10044.0,
10020.0,
56303.0,
10119.0,
11040.0,
11386.0,
10952.0,
6811.0,
11239.0,
10121.0,
37214.0,
10538.0,
10112.0,
10279.0,
11109.0,
11371.0,
18017.0,
7115.0,
77036.0,
7114.0,
'10123',
'11030',
'10801',
'92626-1902',
'07302',
'18773',
'12222',
'10710',
'10103',
'07057',
'10162',
'11582',
'10281',
'10271',
'10107',
'HARRISBURG',
'11735',
'07305',
'N5X3A6',
'11746',
'11371',
'23450',
11580.0,
10591.0,
11005.0,
19034.0,
11596.0,
11779.0,
7621.0,
11021.0,
11241.0,
10169.0,
100000.0,
11242.0,
10151.0,
11697.0,
7086.0,
10177.0,
10118.0,
10105.0,
10152.0,
10168.0,
7093.0,
10917.0,
10110.0,
10153.0,
10178.0,
11570.0,
10601.0,
10704.0,
7424.0,
10281.0,
10158.0,
'10158',
'10172',
'10179',
'IDK',
'11801',
'10601',
'11590',
'10155',
'11202',
'1801',
'11581',
11359.0,
11758.0,
10278.0,
43017.0,
10154.0,
11553.0,
10162.0,
11695.0,
10041.0,
11741.0,
98335.0,
14814.0,
10111.0,
'12345',
'11572',
'11520',
'14614-195',
'10121',
'10105',
'10701',
0.0,
17106.0,
979113.0,
10120.0,
12345.0,
11030.0,
11797.0,
100.0,
11710.0,
33624.0,
8682.0,
11747.0,
1757.0,
11561.0,
7304.0,
6851.0,
11590.0,
94267.0,
10167.0,
11749.0,
11756.0,
10174.0,
10550.0,
89119.0,
14068.0,
11722.0,
11520.0,
6460.0,
32255.0,
10173.0,
10165.0,
11946.0,
'29616-0759',
'07032',
'10278',
'11575',
11963.0,
10106.0,
83.0,
11566.0,
6870.0,
7001.0,
10710.0,
11735.0,
11572.0,
'10165',
'10279',
'11251',
'NJ 07114',
'10106',
'07666',
'11516',
'10177',
'10170',
'43215-1441',
'00000',
'08081',
'10803',
11507.0,
11701.0,
11563.0,
7047.0,
3108.0,
'11021',
'07003',
'10152',
'07029',
'10041',
'31093',
'11735-0230',
89118.0,
11803.0,
11559.0,
11565.0,
7080.0,
12601.0,
10155.0,
10171.0,
7208.0,
11757.0,
11042.0,
'1101',
'10111',
'10173',
'10096',
'07087',
'DID N',
'10956',
10048.0,
10123.0,
10122.0,
11251.0]
ZIP codes look numeric, but aren’t really.
Read the ZIP codes in as strings.
requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_8064/1354368569.py:1: DtypeWarning: Columns (20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
requests2 = pd.read_csv(url, dtype={"Incident Zip": "string"})
We fixed the dtype warning for column 8 (Incident Zip
).
list(requests2["Incident Zip"].unique())
['11235',
'11221',
'11693',
'11216',
'10465',
'11367',
'10459',
'11101',
'11362',
'10014',
'11234',
'11436',
'10305',
'10467',
'11208',
'10451',
'11419',
'11237',
'11220',
'10469',
'11385',
'10470',
'11694',
'10036',
<NA>,
'10473',
'11435',
'10040',
'10472',
'11225',
'10019',
'11434',
'11226',
'10010',
'11211',
'11421',
'10026',
'10013',
'11423',
'10002',
'10453',
'11213',
'11104',
'11249',
'11361',
'11233',
'11224',
'11374',
'10025',
'10022',
'11214',
'11209',
'11366',
'10304',
'10027',
'11378',
'11206',
'10021',
'11364',
'10065',
'10456',
'10314',
'10312',
'11212',
'11379',
'10462',
'11231',
'10460',
'11416',
'10001',
'11357',
'11413',
'11210',
'11217',
'11223',
'11417',
'11418',
'11218',
'11230',
'11207',
'11691',
'10468',
'10007',
'10310',
'10306',
'11103',
'11105',
'11433',
'11203',
'10307',
'11229',
'11372',
'10032',
'11420',
'10017',
'10301',
'11368',
'11201',
'11365',
'11422',
'10452',
'11377',
'10029',
'10003',
'10075',
'11222',
'10128',
'11415',
'11204',
'10030',
'11432',
'10308',
'11102',
'10016',
'10463',
'11412',
'10011',
'11106',
'10457',
'11375',
'11356',
'11228',
'11369',
'10458',
'11411',
'11215',
'10309',
'11358',
'11355',
'11219',
'10031',
'10303',
'11232',
'10302',
'11238',
'10005',
'11429',
'11205',
'10023',
'11373',
'10033',
'10039',
'10028',
'11363',
'11354',
'11692',
'10455',
'11370',
'10035',
'10012',
'10024',
'10009',
'10034',
'11001',
'10466',
'11427',
'11004',
'10454',
'11414',
'11360',
'10461',
'10018',
'10006',
'11236',
'11428',
'10474',
'10471',
'10037',
'10475',
'11430',
'10119',
'10038',
'11426',
'11239',
'10151',
'10120',
'10112',
'10168',
'10004',
'10464',
'18773-9640',
'10282',
'11109',
'10280',
'07114',
'11040',
'10000',
'07090',
'10020',
'00083',
'10044',
'10069',
'10118',
'07102',
'11359',
'11005',
'10169',
'11697',
'10115',
'10801',
'10103',
'10271',
'10107',
'07078',
'56303',
'11386',
'10952',
'06811',
'10121',
'37214',
'10538',
'10279',
'11371',
'18017',
'07115',
'77036',
'10123',
'11030',
'92626-1902',
'07302',
'18773',
'12222',
'10710',
'07057',
'10162',
'11582',
'10281',
'HARRISBURG',
'11735',
'07305',
'N5X3A6',
'11746',
'23450',
'11580',
'10591',
'19034',
'11596',
'11779',
'07621',
'11021',
'11241',
'100000',
'11242',
'07086',
'10177',
'10105',
'10152',
'07093',
'10917',
'10110',
'10153',
'10178',
'11570',
'10601',
'10704',
'07424',
'10158',
'10172',
'10179',
'IDK',
'11801',
'11590',
'10155',
'11202',
'1801',
'11581',
'11758',
'10278',
'43017',
'10154',
'11553',
'11695',
'10041',
'11741',
'98335',
'14814',
'10111',
'12345',
'11572',
'11520',
'14614-195',
'10701',
'00000',
'17106',
'979113',
'11797',
'100',
'11710',
'33624',
'8682',
'11747',
'01757',
'11561',
'07304',
'000000',
'06851',
'94267',
'10167',
'11749',
'11756',
'10174',
'10550',
'89119',
'14068',
'11722',
'06460',
'32255',
'10173',
'10165',
'11946',
'29616-0759',
'07032',
'11575',
'11963',
'10106',
'11566',
'06870',
'07001',
'11251',
'NJ 07114',
'07666',
'11516',
'10170',
'43215-1441',
'08081',
'10803',
'11507',
'11701',
'11563',
'07047',
'03108',
'07003',
'07029',
'31093',
'11735-0230',
'89118',
'11803',
'11559',
'11565',
'07080',
'12601',
'10171',
'07208',
'11757',
'11042',
'1101',
'10096',
'07087',
'DID N',
'10956',
'10048',
'10122']
Find invalid ZIP codes#
Use a regular expression (regex) to find strings that match a pattern:
^\d{5}(?:-\d{4})?$
│ │ │ │ │└─ end of string
│ │ │ │ └─ optional
│ │ │ └─ capture group
│ │ └─ count
│ └─ numeric/digit character
└─ start of string
regex101 is useful for testing them.
# find valid ZIP codes
valid_zips = requests2["Incident Zip"].str.contains(r"^\d{5}(?:-\d{4})?$")
valid_zips
0 True
1 True
2 True
3 True
4 True
...
499953 True
499954 True
499955 True
499956 True
499957 True
Name: Incident Zip, Length: 499958, dtype: boolean
requests2.shape
(499958, 41)
requests2[valid_zips].shape
(480381, 41)
# filter the DataFrame to only invalid ZIP codes
invalid_zips = valid_zips == False
invalid_zips
0 False
1 False
2 False
3 False
4 False
...
499953 False
499954 False
499955 False
499956 False
499957 False
Name: Incident Zip, Length: 499958, dtype: boolean
requests_with_invalid_zips = requests2[invalid_zips]
requests_with_invalid_zips["Incident Zip"]
55017 HARRISBURG
58100 N5X3A6
80798 100000
120304 IDK
123304 1801
173518 14614-195
192034 979113
201463 100
207158 8682
216745 000000
325071 NJ 07114
425985 1101
441166 DID N
Name: Incident Zip, dtype: string
Clear any invalid ZIP codes:
requests2.loc[invalid_zips, "Incident Zip"] = None
.loc[]
is used for overwriting a subset of values.
Additonal data cleaning tips:
Hard part is finding what needs to be done
Will be specific to your use case
Document what you did, since it will affect your results
In-class exercise#
Simple merge#
I had Copilot generate the DataFrames, so no idea if the numbers are real.
populations = pd.DataFrame(
{
"Country": ["China", "India", "Pakistan"],
"Population": [1444216107, 1393409038, 220892331],
}
)
populations
Country | Population | |
---|---|---|
0 | China | 1444216107 |
1 | India | 1393409038 |
2 | Pakistan | 220892331 |
gdps = pd.DataFrame(
{
"Country": ["China", "India", "United States", "Indonesia", "Pakistan"],
"GDP": [14342903, 2875142, 21433226, 1058393, 263687],
}
)
gdps
Country | GDP | |
---|---|---|
0 | China | 14342903 |
1 | India | 2875142 |
2 | United States | 21433226 |
3 | Indonesia | 1058393 |
4 | Pakistan | 263687 |
How should we combine them?
populations
Country | Population | |
---|---|---|
0 | China | 1444216107 |
1 | India | 1393409038 |
2 | Pakistan | 220892331 |
gdps
Country | GDP | |
---|---|---|
0 | China | 14342903 |
1 | India | 2875142 |
2 | United States | 21433226 |
3 | Indonesia | 1058393 |
4 | Pakistan | 263687 |
To join dataframes together, we will use the pandas .merge()
function.
.merge()
is comparable to:
In general, called “record linkage” or “entity resolution”.
pd.merge(populations, gdps, on="Country")
Country | Population | GDP | |
---|---|---|---|
0 | China | 1444216107 | 14342903 |
1 | India | 1393409038 | 2875142 |
2 | Pakistan | 220892331 | 263687 |
pd.merge(populations, gdps, on="Country", how="outer")
Country | Population | GDP | |
---|---|---|---|
0 | China | 1.444216e+09 | 14342903 |
1 | India | 1.393409e+09 | 2875142 |
2 | Indonesia | NaN | 1058393 |
3 | Pakistan | 2.208923e+08 | 263687 |
4 | United States | NaN | 21433226 |
In-class exercise 2#
Let’s compute the migrant population as a percent of total by country using UN data.