Investigating Elevated Blood Lead Level Rates in Children in NYC#
By Tara Merigan
Introduction#
Dataset:
I am using the Children Under 6 yrs with Elevated Blood Lead Levels Dataset from NYC Open Data. The dataset contains 576 rows: the number of children with elevated blood lead levels (with various amounts - ranging from 5-15µg/dL), the number of children tested and the rate per 1000 tested at each blood lead level (BLL). The data is across locations (borough and neighbourhood) and years.
Questions:
How has the amount of children under 6 with elevated blood lead levels in NYC changed over the years?
Does this vary at different concentrations (mcg/dL)?
How does the amount of children with elevated blood lead levels vary between boroughs?
Between neighbourhoods?
Does this vary at different concentrations?
How does the change in rate of children with elevated blood lead levels over the years vary between neighbourhoods/boroughs?
Hypothesis:
I hypothesise that the number of children with elevated blood lead level rates will fall over the years, as regulation and infrastructure adapts with knowledge about the dangers of lead exposure to children. I predict that there will be areas (both boroughs and neighbourhoods) with more concentrated frequencies of children with elevated lead blood levels and these areas are more likely to have higher amounts of more severe lead levels (10 or 15µg/dL). I estimate that in areas with higher concentrations of elevated blood lead levels the rate of change of the years will more pronounced (decrease) than in areas with low concentration as the severity of these cases call for more urgent government intervention.
Step 1#
I began by importing the necessary packages - pandas and plotly. I then imported another plotly package to allow for PDF export - as plotly is a package which creates charts, graphs and other data visualisation tools.
Then I read my dataframe (Children_Elevated_BLL.csv) into the notebook and displayed the beginning lines to see what it looked like and that it had read correctly. I then performed some functions to look at the contents of various columns (namely the columns which contained notes) and to get more information about the dataframe which will be helpful later.
import pandas as pd
import plotly.express as px
bll_df = pd.read_csv('Children_Elevated_BLL.csv')
bll_df.head()
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Children under 6 years with elevated blood lead levels (BLL) Number BLL >=5 µg/dL | Children under 6 years with elevated blood lead levels (BLL) Number BLL >=5 µg/dL _NOTES | Children under 6 years with elevated blood lead levels (BLL) Number BLL>=10 µg/dL | Children under 6 years with elevated blood lead levels (BLL) Number BLL>=10 µg/dL _NOTES | Children under 6 years with elevated blood lead levels (BLL) Number BLL>=15 µg/dL | Children under 6 years with elevated blood lead levels (BLL) Number BLL>=15 µg/dL _NOTES | Children under 6 years with elevated blood lead levels (BLL) Number Tested | Children under 6 years with elevated blood lead levels (BLL) Number Tested _NOTES | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=5 µg/dL per 1,000 tested | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=5 µg/dL per 1,000 tested_NOTES | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=10 µg/dL per 1,000 tested | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=10 µg/dL per 1,000 tested_NOTES | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=15 µg/dL per 1,000 tested | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=15 µg/dL per 1,000 tested_NOTES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Borough | 1 | Bronx | 1.0 | 2005 | 8245 | NaN | 595 | NaN | 167 | NaN | 64500 | NaN | 127.7 | NaN | 9.2 | NaN | 2.6 | NaN |
1 | Borough | 1 | Bronx | 1.0 | 2006 | 7272 | NaN | 474 | NaN | 144 | NaN | 67200 | NaN | 108.2 | NaN | 7.1 | NaN | 2.1 | NaN |
2 | Borough | 1 | Bronx | 1.0 | 2007 | 6174 | NaN | 438 | NaN | 135 | NaN | 68300 | NaN | 90.4 | NaN | 6.4 | NaN | 2.0 | NaN |
3 | Borough | 1 | Bronx | 1.0 | 2008 | 4254 | NaN | 292 | NaN | 105 | NaN | 69800 | NaN | 60.9 | NaN | 4.2 | NaN | 1.5 | NaN |
4 | Borough | 1 | Bronx | 1.0 | 2009 | 2742 | NaN | 278 | NaN | 103 | NaN | 70000 | NaN | 39.2 | NaN | 4.0 | NaN | 1.5 | NaN |
bll_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 576 entries, 0 to 575
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geo_type 576 non-null object
1 geo_area_id 576 non-null int64
2 geo_area_name 576 non-null object
3 borough_id 564 non-null float64
4 time_period 576 non-null int64
5 Children under 6 years with elevated blood lead levels (BLL) Number BLL >=5 µg/dL 576 non-null int64
6 Children under 6 years with elevated blood lead levels (BLL) Number BLL >=5 µg/dL _NOTES 9 non-null object
7 Children under 6 years with elevated blood lead levels (BLL) Number BLL>=10 µg/dL 576 non-null int64
8 Children under 6 years with elevated blood lead levels (BLL) Number BLL>=10 µg/dL _NOTES 139 non-null object
9 Children under 6 years with elevated blood lead levels (BLL) Number BLL>=15 µg/dL 576 non-null int64
10 Children under 6 years with elevated blood lead levels (BLL) Number BLL>=15 µg/dL _NOTES 295 non-null object
11 Children under 6 years with elevated blood lead levels (BLL) Number Tested 576 non-null int64
12 Children under 6 years with elevated blood lead levels (BLL) Number Tested _NOTES 0 non-null float64
13 Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=5 µg/dL per 1,000 tested 576 non-null float64
14 Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=5 µg/dL per 1,000 tested_NOTES 9 non-null object
15 Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=10 µg/dL per 1,000 tested 576 non-null float64
16 Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=10 µg/dL per 1,000 tested_NOTES 139 non-null object
17 Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=15 µg/dL per 1,000 tested 576 non-null float64
18 Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=15 µg/dL per 1,000 tested_NOTES 295 non-null object
dtypes: float64(5), int64(6), object(8)
memory usage: 85.6+ KB
bll_df['Children under 6 years with elevated blood lead levels (BLL) Number BLL>=10 µg/dL _NOTES'].unique()
array([nan,
'*Estimate is based on small numbers so should be interpreted with caution.'],
dtype=object)
bll_df['geo_type'].unique()
array(['Borough', 'Neighborhood (UHF 42)', 'Citywide'], dtype=object)
Step 2#
After looking at the contents of the ‘notes’ columns, I chose to remove them from the dataframe as they did not include information that I think is useful for the questions I am investigating. I believe the warning to interpret the small numbers with caution speaks more broadly to regression analysis, or making inferences about the wider population. I then renamed some of the columns to make the output tables more appealing visually and to remove any extraneous information in the column titles.
bll_df.drop(bll_df.columns[[6, 8, 10, 12, 14, 16, 18]], axis = 1, inplace = True)
bll_df.head()
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Children under 6 years with elevated blood lead levels (BLL) Number BLL >=5 µg/dL | Children under 6 years with elevated blood lead levels (BLL) Number BLL>=10 µg/dL | Children under 6 years with elevated blood lead levels (BLL) Number BLL>=15 µg/dL | Children under 6 years with elevated blood lead levels (BLL) Number Tested | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=5 µg/dL per 1,000 tested | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=10 µg/dL per 1,000 tested | Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=15 µg/dL per 1,000 tested | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Borough | 1 | Bronx | 1.0 | 2005 | 8245 | 595 | 167 | 64500 | 127.7 | 9.2 | 2.6 |
1 | Borough | 1 | Bronx | 1.0 | 2006 | 7272 | 474 | 144 | 67200 | 108.2 | 7.1 | 2.1 |
2 | Borough | 1 | Bronx | 1.0 | 2007 | 6174 | 438 | 135 | 68300 | 90.4 | 6.4 | 2.0 |
3 | Borough | 1 | Bronx | 1.0 | 2008 | 4254 | 292 | 105 | 69800 | 60.9 | 4.2 | 1.5 |
4 | Borough | 1 | Bronx | 1.0 | 2009 | 2742 | 278 | 103 | 70000 | 39.2 | 4.0 | 1.5 |
bll_df.rename(columns = {'Children under 6 years with elevated blood lead levels (BLL) Number BLL >=5 µg/dL':
'Elevated BLL >=5',
'Children under 6 years with elevated blood lead levels (BLL) Number BLL>=10 µg/dL':
'Elevated BLL >=10',
'Children under 6 years with elevated blood lead levels (BLL) Number BLL>=15 µg/dL':
'Elevated BLL >=15',
'Children under 6 years with elevated blood lead levels (BLL) Number Tested':
'Number Tested',
'Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=5 µg/dL per 1,000 tested':
'Rate BLL>=5 per 1000 tested',
'Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=10 µg/dL per 1,000 tested':
'Rate BLL>=10 per 1000 tested',
'Children under 6 years with elevated blood lead levels (BLL) Rate BLL>=15 µg/dL per 1,000 tested':
'Rate BLL>=15 per 1000 tested'},
inplace = True)
bll_df.head()
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Elevated BLL >=5 | Elevated BLL >=10 | Elevated BLL >=15 | Number Tested | Rate BLL>=5 per 1000 tested | Rate BLL>=10 per 1000 tested | Rate BLL>=15 per 1000 tested | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Borough | 1 | Bronx | 1.0 | 2005 | 8245 | 595 | 167 | 64500 | 127.7 | 9.2 | 2.6 |
1 | Borough | 1 | Bronx | 1.0 | 2006 | 7272 | 474 | 144 | 67200 | 108.2 | 7.1 | 2.1 |
2 | Borough | 1 | Bronx | 1.0 | 2007 | 6174 | 438 | 135 | 68300 | 90.4 | 6.4 | 2.0 |
3 | Borough | 1 | Bronx | 1.0 | 2008 | 4254 | 292 | 105 | 69800 | 60.9 | 4.2 | 1.5 |
4 | Borough | 1 | Bronx | 1.0 | 2009 | 2742 | 278 | 103 | 70000 | 39.2 | 4.0 | 1.5 |
Step 3#
I then created a new dataframe which only included the rows where the geographic area name was ‘New York City’ as this provided data from the entire city for each year. I then reshaped this new dataframe using the ‘melt’ function to include the year, rates (µg/dL) and number per 1000 children tested. This allowed me to create a line chart which has different lines for each BLL rate.
citywide = bll_df[bll_df.geo_area_name == 'New York City']
citywide = citywide.sort_values('time_period')
citywide
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Elevated BLL >=5 | Elevated BLL >=10 | Elevated BLL >=15 | Number Tested | Rate BLL>=5 per 1000 tested | Rate BLL>=10 per 1000 tested | Rate BLL>=15 per 1000 tested | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
98 | Citywide | 1 | New York City | NaN | 2005 | 37344 | 3082 | 1014 | 310100 | 120.4 | 9.9 | 3.3 |
345 | Citywide | 1 | New York City | NaN | 2006 | 34629 | 2767 | 928 | 313900 | 110.3 | 8.8 | 3.0 |
327 | Citywide | 1 | New York City | NaN | 2007 | 30493 | 2282 | 745 | 318200 | 95.8 | 7.2 | 2.3 |
65 | Citywide | 1 | New York City | NaN | 2008 | 20423 | 1803 | 612 | 328000 | 62.3 | 5.5 | 1.9 |
212 | Citywide | 1 | New York City | NaN | 2009 | 15224 | 1565 | 565 | 331800 | 45.9 | 4.7 | 1.7 |
204 | Citywide | 1 | New York City | NaN | 2010 | 13951 | 1574 | 566 | 340900 | 40.9 | 4.6 | 1.7 |
90 | Citywide | 1 | New York City | NaN | 2011 | 11437 | 1332 | 447 | 342900 | 33.4 | 3.9 | 1.3 |
259 | Citywide | 1 | New York City | NaN | 2012 | 8179 | 1053 | 392 | 328600 | 24.9 | 3.2 | 1.2 |
313 | Citywide | 1 | New York City | NaN | 2013 | 7204 | 910 | 325 | 322900 | 22.3 | 2.8 | 1.0 |
523 | Citywide | 1 | New York City | NaN | 2014 | 6550 | 959 | 341 | 314500 | 20.8 | 3.0 | 1.1 |
446 | Citywide | 1 | New York City | NaN | 2015 | 5371 | 908 | 318 | 311300 | 17.3 | 2.9 | 1.0 |
319 | Citywide | 1 | New York City | NaN | 2016 | 4928 | 822 | 300 | 299000 | 16.5 | 2.7 | 1.0 |
citywide_conc = citywide.melt(id_vars='time_period',
value_vars=['Rate BLL>=5 per 1000 tested',
'Rate BLL>=10 per 1000 tested',
'Rate BLL>=15 per 1000 tested'],
var_name='Rates', value_name='Number per 1000 Tested')
citywide_conc.sample(5)
time_period | Rates | Number per 1000 Tested | |
---|---|---|---|
0 | 2005 | Rate BLL>=5 per 1000 tested | 120.4 |
20 | 2013 | Rate BLL>=10 per 1000 tested | 2.8 |
7 | 2012 | Rate BLL>=5 per 1000 tested | 24.9 |
35 | 2016 | Rate BLL>=15 per 1000 tested | 1.0 |
17 | 2010 | Rate BLL>=10 per 1000 tested | 4.6 |
fig = px.line(citywide_conc, x= "time_period", y="Number per 1000 Tested",
title = "Citywide Elevated BLL Rates per 1000 tested",
color = "Rates",
labels={'time_period':'Year'})
fig.show()
Citywide Line Chart
The above line chart shows how elevated BLL rates have fallen over the years, consistent with my hypothesis. As there are a relatively few number of children with elevated BLL >=10 or >=15 µg/dL per 1000, I chose to proceed with the majority of my analysis using the BLL >=5 µg/dL rate per 1000 tested. Additionally the CDC reference for lead exposure is 3.5µg/dL and thus using >=5µg/dL is adequate for analysing elevated BLL for children across the city.
Step 4#
I am now looking at the elevated BLL rates for each borough, using the >=5 µg/dL concentration. I began by creating a new dataframe by finding the rows that had ‘Borough’ in the geo_type column. I then produced a line chart which had the rate per 1000 tested for each borough over the years.
boroughs = bll_df[bll_df.geo_type == 'Borough']
boroughs = boroughs.sort_values('time_period')
boroughs.head()
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Elevated BLL >=5 | Elevated BLL >=10 | Elevated BLL >=15 | Number Tested | Rate BLL>=5 per 1000 tested | Rate BLL>=10 per 1000 tested | Rate BLL>=15 per 1000 tested | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Borough | 1 | Bronx | 1.0 | 2005 | 8245 | 595 | 167 | 64500 | 127.7 | 9.2 | 2.6 |
24 | Borough | 3 | Manhattan | 3.0 | 2005 | 4851 | 324 | 85 | 43900 | 110.6 | 7.4 | 1.9 |
36 | Borough | 4 | Queens | 4.0 | 2005 | 8238 | 750 | 278 | 80400 | 102.5 | 9.3 | 3.5 |
12 | Borough | 2 | Brooklyn | 2.0 | 2005 | 15015 | 1301 | 448 | 106800 | 140.6 | 12.2 | 4.2 |
48 | Borough | 5 | Staten Island | 5.0 | 2005 | 990 | 112 | 36 | 14500 | 68.2 | 7.7 | 2.5 |
fig = px.line(boroughs, x= "time_period", y="Rate BLL>=5 per 1000 tested", color = 'geo_area_name',
title = "Elevated BLL Rate per 1000 tested",
labels={'time_period':'Year', "Rate BLL>=5 per 1000 tested":'Rate BLL>=5 µg/dL per 1000 tested',
'geo_area_name':'Borough'})
fig.show()
Borough Line Chart
The above line chart displays the rate of children per 1000 tested who had elevated BLL >=5µg/dL. All five boroughs show a reduction in the rate over time. Staten Island is the only borough to show an increase rate for some years, though it still has a downward trend. Brooklyn consistently has the highest rate across the years of all the boroughs, although over time all five converge to relatively low rates per 1000. In 2005, Manhattan was the median rate across the five boroughs - but the lowest by 2016.
Step 5#
I the pivoted the boroughs dataframe so that the columns each contained years, allowing me to later find the relative change of rate per 1000 children tested in each borough (>=5 µg/dL concentration). I had to convert the year column titles to strings (as opposed to integers) so I could refer to them by their title. I then created a column for the relative change by using the values from the 2005 and 2016 columns for each borough.
boroughs_five = boroughs.pivot(index='geo_area_name', columns='time_period', values='Rate BLL>=5 per 1000 tested')
boroughs_five
time_period | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
geo_area_name | ||||||||||||
Bronx | 127.7 | 108.2 | 90.4 | 60.9 | 39.2 | 37.5 | 28.5 | 20.9 | 20.1 | 18.7 | 15.7 | 15.0 |
Brooklyn | 140.6 | 136.9 | 120.7 | 76.6 | 60.0 | 52.6 | 45.8 | 35.3 | 30.1 | 26.8 | 22.6 | 22.3 |
Manhattan | 110.6 | 101.7 | 88.2 | 55.4 | 36.4 | 27.2 | 22.2 | 15.3 | 15.1 | 14.0 | 10.6 | 8.1 |
Queens | 102.5 | 92.3 | 79.6 | 53.7 | 40.2 | 38.2 | 28.5 | 20.6 | 18.2 | 18.6 | 15.4 | 14.3 |
Staten Island | 68.2 | 62.9 | 63.5 | 38.2 | 33.8 | 24.3 | 20.6 | 16.8 | 17.7 | 17.0 | 11.9 | 14.8 |
boroughs_five.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Bronx to Staten Island
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 2005 5 non-null float64
1 2006 5 non-null float64
2 2007 5 non-null float64
3 2008 5 non-null float64
4 2009 5 non-null float64
5 2010 5 non-null float64
6 2011 5 non-null float64
7 2012 5 non-null float64
8 2013 5 non-null float64
9 2014 5 non-null float64
10 2015 5 non-null float64
11 2016 5 non-null float64
dtypes: float64(12)
memory usage: 520.0+ bytes
boroughs_five.columns = boroughs_five.columns.astype(str)
boroughs_five['Relative Change'] = ((boroughs_five['2016'] - boroughs_five['2005']) / boroughs_five['2005'])*100
boroughs_five.head()
time_period | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | Relative Change |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
geo_area_name | |||||||||||||
Bronx | 127.7 | 108.2 | 90.4 | 60.9 | 39.2 | 37.5 | 28.5 | 20.9 | 20.1 | 18.7 | 15.7 | 15.0 | -88.253720 |
Brooklyn | 140.6 | 136.9 | 120.7 | 76.6 | 60.0 | 52.6 | 45.8 | 35.3 | 30.1 | 26.8 | 22.6 | 22.3 | -84.139403 |
Manhattan | 110.6 | 101.7 | 88.2 | 55.4 | 36.4 | 27.2 | 22.2 | 15.3 | 15.1 | 14.0 | 10.6 | 8.1 | -92.676311 |
Queens | 102.5 | 92.3 | 79.6 | 53.7 | 40.2 | 38.2 | 28.5 | 20.6 | 18.2 | 18.6 | 15.4 | 14.3 | -86.048780 |
Staten Island | 68.2 | 62.9 | 63.5 | 38.2 | 33.8 | 24.3 | 20.6 | 16.8 | 17.7 | 17.0 | 11.9 | 14.8 | -78.299120 |
Borough Relative Change
The above table shows the relative change of the rate of elevated BLL for each borough from 2005 to 2016. All show a reduction and the relative changes are fairly similar. Manhattan had the highest relative change and Staten Island the lowest.
Step 6#
I then wanted to compare the values of the 2005 rate per 1000 and the relative change, to see if there appeared to be a correlation between high rates and more urgent intervention (as measured by relative change). I created a new dataframe and dropped the columns from 2006 to 2016, reset the dataframes index, sorted the boroughs by the 2005 rate and got the absolute value of the relative change column. I then created a bar chart from this dataframe to visually compare the 2005 rates and absolute relative change across the boroughs.
boroughs_compare = boroughs_five.drop(['2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', '2014', '2015', '2016'], axis=1)
boroughs_compare = boroughs_compare.reset_index()
boroughs_compare = boroughs_compare.sort_values(('2005'), ascending=False)
boroughs_compare['Relative Change'] = boroughs_compare['Relative Change'].abs()
boroughs_compare.head()
time_period | geo_area_name | 2005 | Relative Change |
---|---|---|---|
1 | Brooklyn | 140.6 | 84.139403 |
0 | Bronx | 127.7 | 88.253720 |
2 | Manhattan | 110.6 | 92.676311 |
3 | Queens | 102.5 | 86.048780 |
4 | Staten Island | 68.2 | 78.299120 |
fig = px.bar(boroughs_compare, x='geo_area_name', y=['2005', 'Relative Change'], barmode='group',
title = 'Relative Change and 2005 Rate per 1000 Comparison by Borough',
labels={'geo_area_name':'Borough', 'value':'Value', 'variable': 'Variable'})
fig.show()
Bar Chart
Whilst the values for relative change and the rate are different units, this bar chart shows that there is no clear correlation between the elevated BLL rates per 1000 tested in 2005 and absolute relative change for each borough.
Step 7#
I then wanted to filter the original dataframe to just include neighbourhoods, I did this by finding ‘Neighborhood (UHF 42)’ under the ‘geo_type’ column and sorted by ‘geo_area_id’ to group them into boroughs. Whilst I will refer to each row in the dataframe as a ‘neighbourhood’ they can include multiple neighbourhoods grouped together and approximate community planning districts.
I then created another dataframe with only the values from 2005 to get the initial rates, and sorted it by the elevated BLL rate per 1000 tested (>=5µg/dL). I then wanted to create stacked bar chart to show the breakdowns of the elevated BLL rate of each neighbourhood by boroughs. I renamed the ‘borough_id’ values to their corresponding borough names in the dataframe to make the chart easier to interpret.
neighbourhoods = bll_df[bll_df.geo_type == 'Neighborhood (UHF 42)']
neighbourhoods = neighbourhoods.sort_values('geo_area_id')
neighbourhoods.head()
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Elevated BLL >=5 | Elevated BLL >=10 | Elevated BLL >=15 | Number Tested | Rate BLL>=5 per 1000 tested | Rate BLL>=10 per 1000 tested | Rate BLL>=15 per 1000 tested | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
201 | Neighborhood (UHF 42) | 101 | Kingsbridge - Riverdale | 1.0 | 2009 | 85 | 7 | 1 | 3300 | 25.5 | 2.1 | 0.3 |
307 | Neighborhood (UHF 42) | 101 | Kingsbridge - Riverdale | 1.0 | 2013 | 30 | 4 | 0 | 3300 | 9.0 | 1.2 | 0.0 |
218 | Neighborhood (UHF 42) | 101 | Kingsbridge - Riverdale | 1.0 | 2007 | 243 | 13 | 4 | 3200 | 76.6 | 4.1 | 1.3 |
233 | Neighborhood (UHF 42) | 101 | Kingsbridge - Riverdale | 1.0 | 2005 | 198 | 22 | 4 | 2800 | 71.9 | 8.0 | 1.5 |
175 | Neighborhood (UHF 42) | 101 | Kingsbridge - Riverdale | 1.0 | 2008 | 147 | 3 | 1 | 3200 | 45.7 | 0.9 | 0.3 |
neighbourhoods_initial = neighbourhoods[neighbourhoods.time_period == 2005]
neighbourhoods_initial = neighbourhoods_initial.sort_values('Rate BLL>=5 per 1000 tested', ascending=False)
neighbourhoods_initial.head()
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Elevated BLL >=5 | Elevated BLL >=10 | Elevated BLL >=15 | Number Tested | Rate BLL>=5 per 1000 tested | Rate BLL>=10 per 1000 tested | Rate BLL>=15 per 1000 tested | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
339 | Neighborhood (UHF 42) | 211 | Williamsburg - Bushwick | 2.0 | 2005 | 2072 | 178 | 70 | 11600 | 178.3 | 15.3 | 6.0 |
360 | Neighborhood (UHF 42) | 201 | Greenpoint | 2.0 | 2005 | 869 | 71 | 22 | 5100 | 171.1 | 14.0 | 4.3 |
521 | Neighborhood (UHF 42) | 204 | East New York | 2.0 | 2005 | 1767 | 156 | 47 | 10800 | 163.8 | 14.5 | 4.4 |
515 | Neighborhood (UHF 42) | 303 | East Harlem | 3.0 | 2005 | 778 | 42 | 6 | 4800 | 161.8 | 8.7 | 1.2 |
520 | Neighborhood (UHF 42) | 203 | Bedford Stuyvesant - Crown Heights | 2.0 | 2005 | 2528 | 220 | 69 | 16100 | 156.8 | 13.6 | 4.3 |
neighbourhoods_initial.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 42 entries, 339 to 268
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 geo_type 42 non-null object
1 geo_area_id 42 non-null int64
2 geo_area_name 42 non-null object
3 borough_id 42 non-null float64
4 time_period 42 non-null int64
5 Elevated BLL >=5 42 non-null int64
6 Elevated BLL >=10 42 non-null int64
7 Elevated BLL >=15 42 non-null int64
8 Number Tested 42 non-null int64
9 Rate BLL>=5 per 1000 tested 42 non-null float64
10 Rate BLL>=10 per 1000 tested 42 non-null float64
11 Rate BLL>=15 per 1000 tested 42 non-null float64
dtypes: float64(4), int64(6), object(2)
memory usage: 4.3+ KB
neighbourhoods_initial['borough_id'].replace({1:'Bronx', 2:'Brooklyn', 3:'Manhtttan',
4:'Queens', 5:'Staten Island'}, inplace=True)
neighbourhoods_initial.head()
geo_type | geo_area_id | geo_area_name | borough_id | time_period | Elevated BLL >=5 | Elevated BLL >=10 | Elevated BLL >=15 | Number Tested | Rate BLL>=5 per 1000 tested | Rate BLL>=10 per 1000 tested | Rate BLL>=15 per 1000 tested | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
339 | Neighborhood (UHF 42) | 211 | Williamsburg - Bushwick | Brooklyn | 2005 | 2072 | 178 | 70 | 11600 | 178.3 | 15.3 | 6.0 |
360 | Neighborhood (UHF 42) | 201 | Greenpoint | Brooklyn | 2005 | 869 | 71 | 22 | 5100 | 171.1 | 14.0 | 4.3 |
521 | Neighborhood (UHF 42) | 204 | East New York | Brooklyn | 2005 | 1767 | 156 | 47 | 10800 | 163.8 | 14.5 | 4.4 |
515 | Neighborhood (UHF 42) | 303 | East Harlem | Manhtttan | 2005 | 778 | 42 | 6 | 4800 | 161.8 | 8.7 | 1.2 |
520 | Neighborhood (UHF 42) | 203 | Bedford Stuyvesant - Crown Heights | Brooklyn | 2005 | 2528 | 220 | 69 | 16100 | 156.8 | 13.6 | 4.3 |
fig = px.bar(neighbourhoods_initial,
x='borough_id',
y='Rate BLL>=5 per 1000 tested',
hover_data=['geo_area_name'],
labels={'borough_id':'Borough'}, title='Rate per 1000 Tested by Borough and Neighbourhood, 2005')
fig.show()
Stacked Bar Chart
The above bar chart shows the elevated BLL rates per 1000 tested in each neighbourhood by borough in 2005. Hovering over the chart allows one to see where the highest rates are in each borough. Williamsburg/Bushwick (Brooklyn), East Harlem (Manhattan), Hunts Point/Mott Haven (the Bronx), West Queens (Queens) and Port Richmond (Staten Island) each had the highest elevated BLL rates for their repsective boroughs.
Step 8#
I then wanted to to look at the relative change for each neighbourhood between 2005 and 2016, and compare this to the initial elevated BLL rates per 1000 tested. I began by creating a choropleth map of the initial rate for each neighbourhood, allowing me to see where the highest frequencies were concentrated across the city. I then wanted to create a new dataframe that could be used to find the relative change for each neighbourhood. I did this by pivoting the neighbourhoods dataframe to create a new dataframe that had a column for each year. I created a new column in this dataframe for the relative change using the 2005 and 2016 values. Next, I filtered this dataframe to only include the columns with relevant data to the choropleth map and reset the index. Finally, I used this dataframe to create the choropleth map of relative change in rates across each neighbourhood.
import json
f = open('uhf42.geojson')
geojson = json.load(f)
geojson['features'][1]['properties']
{'cartodb_id': 5,
'objectid': 5,
'borough': 'Bronx',
'uhf_neigh': 'Pelham - Throgs Neck',
'shape_area': 386573664.368,
'shape_leng': 250903.372273,
'uhfcode': 104}
fig = px.choropleth_mapbox(neighbourhoods_initial,
geojson=geojson,
locations='geo_area_id',
featureidkey='properties.uhfcode',
color='Rate BLL>=5 per 1000 tested',
hover_data=['geo_area_name'],
center = {'lat': 40.73, 'lon': -73.98},
zoom=9,
mapbox_style='carto-positron',
title='Elevated BLL Rate per 1000 Tested, 2005')
fig.update_layout(height=700)
fig.show()
2005 Neighbourhood Rates Choropleth Map
The above map shows the initial elevated BLL rates (>=5µg/dL concentration) for each neighbourhood area. The highest frequency of elevated BLL per 1000 tested was in Williamsburg/Bushwick and the other notably high rates of elevated BLL were concentrated nearby in other parts of Brooklyn. Staten Island had noticeably low rates per 1000 tested in 2005. The Bronx did not have the highest elevated BLL rates, but had multiple neighbourhoods with rates on the higher end. The disparity between rates in neighbouring areas in Manhattan is worth noting - with East Harlem/Harlem/Morningside Heights having some of the highest elevated BLL rates across the city and being right next to the Upper West and East Sides which had some of the lowest.
neighbourhoods_five = neighbourhoods.pivot(index=['geo_area_id', 'geo_area_name'],
columns='time_period', values='Rate BLL>=5 per 1000 tested')
neighbourhoods_five.head()
time_period | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
geo_area_id | geo_area_name | ||||||||||||
101 | Kingsbridge - Riverdale | 71.9 | 60.8 | 76.6 | 45.7 | 25.5 | 21.4 | 18.1 | 9.8 | 9.0 | 10.2 | 9.5 | 9.0 |
102 | Northeast Bronx | 121.6 | 103.5 | 89.5 | 57.3 | 36.5 | 34.7 | 26.4 | 19.7 | 18.6 | 18.6 | 16.2 | 14.3 |
103 | Fordham - Bronx Pk | 130.5 | 109.7 | 97.0 | 69.5 | 46.5 | 45.1 | 32.5 | 27.0 | 24.2 | 22.2 | 17.4 | 17.5 |
104 | Pelham - Throgs Neck | 113.2 | 108.2 | 84.6 | 54.9 | 37.4 | 32.8 | 27.6 | 16.1 | 17.2 | 17.8 | 18.5 | 16.9 |
105 | Crotona -Tremont | 135.5 | 111.9 | 90.5 | 61.4 | 40.3 | 36.5 | 28.4 | 22.8 | 19.0 | 17.5 | 15.1 | 13.4 |
neighbourhoods_five.columns = neighbourhoods_five.columns.astype(str)
neighbourhoods_five['Relative Change'] = ((neighbourhoods_five['2016'] - neighbourhoods_five['2005']) / neighbourhoods_five['2005'])*100
neighbourhoods_five.head()
time_period | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | Relative Change | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
geo_area_id | geo_area_name | |||||||||||||
101 | Kingsbridge - Riverdale | 71.9 | 60.8 | 76.6 | 45.7 | 25.5 | 21.4 | 18.1 | 9.8 | 9.0 | 10.2 | 9.5 | 9.0 | -87.482615 |
102 | Northeast Bronx | 121.6 | 103.5 | 89.5 | 57.3 | 36.5 | 34.7 | 26.4 | 19.7 | 18.6 | 18.6 | 16.2 | 14.3 | -88.240132 |
103 | Fordham - Bronx Pk | 130.5 | 109.7 | 97.0 | 69.5 | 46.5 | 45.1 | 32.5 | 27.0 | 24.2 | 22.2 | 17.4 | 17.5 | -86.590038 |
104 | Pelham - Throgs Neck | 113.2 | 108.2 | 84.6 | 54.9 | 37.4 | 32.8 | 27.6 | 16.1 | 17.2 | 17.8 | 18.5 | 16.9 | -85.070671 |
105 | Crotona -Tremont | 135.5 | 111.9 | 90.5 | 61.4 | 40.3 | 36.5 | 28.4 | 22.8 | 19.0 | 17.5 | 15.1 | 13.4 | -90.110701 |
neighbourhoods_change = neighbourhoods_five.filter(['geo_area_id', 'Relative Change'], axis=1)
neighbourhoods_change = neighbourhoods_change.reset_index(level=[0,1])
neighbourhoods_change.head()
time_period | geo_area_id | geo_area_name | Relative Change |
---|---|---|---|
0 | 101 | Kingsbridge - Riverdale | -87.482615 |
1 | 102 | Northeast Bronx | -88.240132 |
2 | 103 | Fordham - Bronx Pk | -86.590038 |
3 | 104 | Pelham - Throgs Neck | -85.070671 |
4 | 105 | Crotona -Tremont | -90.110701 |
fig = px.choropleth_mapbox(neighbourhoods_change,
geojson=geojson,
locations='geo_area_id',
featureidkey='properties.uhfcode',
color='Relative Change',
hover_data=['geo_area_name'],
center = {'lat': 40.73, 'lon': -73.98},
zoom=9,
mapbox_style='carto-positron',
title='Relative Change in Elevated BLL Rate per 1000 Tested, 2005-2016')
fig.update_layout(height=700)
fig.show()
Neighbourhood Relative Change Choropleth Map
The above map displays the relative change in elevated BLL rates per 1000 tested for each neighbourhood between 2005 and 2016. Staten Island’s neighbourhoods had a comparatively low relative change - however, as seen on the previous map they also had low elevated BLL rates in 2005. Greenpoint has the lowest relative change between 2005 and 2016 elevated BLL rate per 1000 tested. It is worth noting tha Greenpoint also had the second highest rate in 2005.
Conclusions#
My hypothesis was conistent regarding the falling number of elevated BLL in children under six in NYC over the years. This was true at all three measured concentrations used to identify elevated BLL rates in children. Due to the low frequency of elevated BLL rates at the 10µg/dL and 15µg/dL concentrations citywide, my analysis dealt mostly with 5µg/dL concentration.
In line with my hypothesis, there were areas that had higher rates than others acoss the city. Brooklyn consistently had the the highest rate per 1000 tested at the 5µg/dL concentration over the yers, although all five of the boroughs saw some convergence to lower rates over time. This was similarly reflected in the breakdown by neighbourhood areas.
I hypothesised that there would be larger relative changes in the 2005 and 2016 elevated BLL rates for areas that had high concentrations - however, in light of the low frequency of these more severe cases I chose to use 2005 elevated BLL rates per 1000 tested as a measure of severity. Notably, there did not appear to be a correlation between the 2005 rates and the relative change in either boroughs or neighbourhoods. Manhattan had the highest relative change, despite having the median 2005 rate per 1000 tested. In a similar vein, Greenpoint had the second highest rate in 2005, and the lowest relative change. This may speak to socio-economic and geographic factors that influence the speed of infrastructure and regulation change/compliance - government priorities, physical or financial barriers to changing infrastructure in areas with high lead exposure risks and individuals’ abilities to guard their children against lead exposure.