Class 3: Data visualization#
“Data visualization”, “chart”, “graph”, and will be used interchangeably.
Today’s goal: Visualizing requests per community district#
This should help us better understand trends across the city.
Data from where we left off last class#
Derived dataset containing count of complaints per community district.
import pandas as pd
districts = pd.read_csv(
"https://storage.googleapis.com/python-public-policy/data/community_district_311.csv.zip"
)
districts.head()
Community Board | num_311_requests | boro_cd | Borough | CD Number | CD Name | 1970 Population | 1980 Population | 1990 Population | 2000 Population | 2010 Population | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12 MANHATTAN | 14110 | 112 | Manhattan | 12 | Washington Heights, Inwood | 180561 | 179941 | 198192 | 208414 | 190020 |
1 | 05 QUEENS | 12487 | 405 | Queens | 5 | Ridgewood, Glendale, Maspeth | 161022 | 150142 | 149126 | 165911 | 169190 |
2 | 12 QUEENS | 12228 | 412 | Queens | 12 | Jamaica, St. Albans, Hollis | 206639 | 189383 | 201293 | 223602 | 225919 |
3 | 01 BROOKLYN | 11863 | 301 | Brooklyn | 1 | Williamsburg, Greenpoint | 179390 | 142942 | 155972 | 160338 | 173083 |
4 | 03 BROOKLYN | 11615 | 303 | Brooklyn | 3 | Bedford Stuyvesant | 203380 | 133379 | 138696 | 143867 | 152985 |
Start by importing necessary packages#
import plotly.express as px
# boilerplate for allowing PDF export
import plotly.io as pio
pio.renderers.default = "notebook_connected+pdf"
Let’s start with making a histogram to better visualize the difference in scale of 311 requests across community boards#
Adapting the basic histogram example:
fig = px.histogram(
districts,
x="num_311_requests",
title="Distribution of number of 311 requests by number of Community Districts",
)
fig.show()
Note that histogram values are equivalent to doing a .groupby().size()
.
Looking at raw volume is probably less useful than density.
Calculate 311 requests per capita#
Divide request count by 2010 population to get requests per capita
districts["requests_per_capita"] = districts["num_311_requests"] / districts["2010 Population"]
districts.head()
Community Board | num_311_requests | boro_cd | Borough | CD Number | CD Name | 1970 Population | 1980 Population | 1990 Population | 2000 Population | 2010 Population | requests_per_capita | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12 MANHATTAN | 14110 | 112 | Manhattan | 12 | Washington Heights, Inwood | 180561 | 179941 | 198192 | 208414 | 190020 | 0.074255 |
1 | 05 QUEENS | 12487 | 405 | Queens | 5 | Ridgewood, Glendale, Maspeth | 161022 | 150142 | 149126 | 165911 | 169190 | 0.073805 |
2 | 12 QUEENS | 12228 | 412 | Queens | 12 | Jamaica, St. Albans, Hollis | 206639 | 189383 | 201293 | 223602 | 225919 | 0.054126 |
3 | 01 BROOKLYN | 11863 | 301 | Brooklyn | 1 | Williamsburg, Greenpoint | 179390 | 142942 | 155972 | 160338 | 173083 | 0.068539 |
4 | 03 BROOKLYN | 11615 | 303 | Brooklyn | 3 | Bedford Stuyvesant | 203380 | 133379 | 138696 | 143867 | 152985 | 0.075922 |
Let’s create a simplified new dataframe that only include the columns we care about and in a better order.
columns = [
"boro_cd",
"Borough",
"CD Name",
"2010 Population",
"num_311_requests",
"requests_per_capita",
]
cd_data = districts[columns]
cd_data
boro_cd | Borough | CD Name | 2010 Population | num_311_requests | requests_per_capita | |
---|---|---|---|---|---|---|
0 | 112 | Manhattan | Washington Heights, Inwood | 190020 | 14110 | 0.074255 |
1 | 405 | Queens | Ridgewood, Glendale, Maspeth | 169190 | 12487 | 0.073805 |
2 | 412 | Queens | Jamaica, St. Albans, Hollis | 225919 | 12228 | 0.054126 |
3 | 301 | Brooklyn | Williamsburg, Greenpoint | 173083 | 11863 | 0.068539 |
4 | 303 | Brooklyn | Bedford Stuyvesant | 152985 | 11615 | 0.075922 |
5 | 501 | Staten Island | Stapleton, Port Richmond | 175756 | 11438 | 0.065079 |
6 | 407 | Queens | Flushing, Bay Terrace | 247354 | 11210 | 0.045320 |
7 | 305 | Brooklyn | East New York, Starrett City | 182896 | 10862 | 0.059389 |
8 | 204 | Bronx | Highbridge, Concourse Village | 146441 | 10628 | 0.072575 |
9 | 401 | Queens | Astoria, Long Island City | 191105 | 10410 | 0.054473 |
10 | 317 | Brooklyn | East Flatbush, Rugby, Farragut | 155252 | 10208 | 0.065751 |
11 | 314 | Brooklyn | Flatbush, Midwood | 160664 | 10179 | 0.063356 |
12 | 207 | Bronx | Bedford Park, Norwood, Fordham | 139286 | 9841 | 0.070653 |
13 | 318 | Brooklyn | Canarsie, Flatlands | 193543 | 9717 | 0.050206 |
14 | 409 | Queens | Woodhaven, Richmond Hill | 143317 | 9599 | 0.066977 |
15 | 413 | Queens | Queens Village, Rosedale | 188593 | 9459 | 0.050156 |
16 | 212 | Bronx | Wakefield, Williamsbridge | 152344 | 9412 | 0.061781 |
17 | 311 | Brooklyn | Bensonhurst, Bath Beach | 181981 | 9309 | 0.051154 |
18 | 205 | Bronx | University Hts., Fordham, Mt. Hope | 128200 | 9094 | 0.070936 |
19 | 103 | Manhattan | Lower East Side, Chinatown | 163277 | 8905 | 0.054539 |
20 | 408 | Queens | Fresh Meadows, Briarwood | 151107 | 8661 | 0.057317 |
21 | 304 | Brooklyn | Bushwick | 112634 | 8639 | 0.076700 |
22 | 110 | Manhattan | Central Harlem | 115723 | 8592 | 0.074246 |
23 | 503 | Staten Island | Tottenville, Woodrow, Great Kills | 160209 | 8524 | 0.053206 |
24 | 315 | Brooklyn | Sheepshead Bay, Gerritsen Beach | 159650 | 8508 | 0.053292 |
25 | 410 | Queens | Ozone Park, Howard Beach | 122396 | 8333 | 0.068082 |
26 | 312 | Brooklyn | Borough Park, Ocean Parkway | 191382 | 8214 | 0.042919 |
27 | 107 | Manhattan | West Side, Upper West Side | 209084 | 8141 | 0.038937 |
28 | 209 | Bronx | Soundview, Parkchester | 172298 | 8092 | 0.046965 |
29 | 310 | Brooklyn | Bay Ridge, Dyker Heights | 124491 | 7808 | 0.062719 |
30 | 308 | Brooklyn | Crown Heights North | 96317 | 7797 | 0.080951 |
31 | 302 | Brooklyn | Brooklyn Heights, Fort Greene | 99617 | 7747 | 0.077768 |
32 | 309 | Brooklyn | Crown Heights South, Wingate | 98429 | 7571 | 0.076918 |
33 | 306 | Brooklyn | Park Slope, Carroll Gardens | 104709 | 7373 | 0.070414 |
34 | 502 | Staten Island | New Springville, South Beach | 132003 | 7059 | 0.053476 |
35 | 403 | Queens | Jackson Heights, North Corona | 171576 | 6799 | 0.039627 |
36 | 109 | Manhattan | Manhattanville, Hamilton Heights | 110193 | 6792 | 0.061637 |
37 | 104 | Manhattan | Chelsea, Clinton | 103245 | 6765 | 0.065524 |
38 | 105 | Manhattan | Midtown Business District | 51673 | 6599 | 0.127707 |
39 | 108 | Manhattan | Upper East Side | 219920 | 6579 | 0.029915 |
40 | 102 | Manhattan | Greenwich Village, Soho | 90016 | 6514 | 0.072365 |
41 | 211 | Bronx | Pelham Pkwy, Morris Park, Laconia | 113232 | 6448 | 0.056945 |
42 | 307 | Brooklyn | Sunset Park, Windsor Terrace | 126230 | 6364 | 0.050416 |
43 | 402 | Queens | Sunnyside, Woodside | 113200 | 6142 | 0.054258 |
44 | 404 | Queens | Elmhurst, South Corona | 172598 | 5824 | 0.033743 |
45 | 208 | Bronx | Riverdale, Kingsbridge, Marble Hill | 101731 | 5753 | 0.056551 |
46 | 411 | Queens | Bayside, Douglaston, Little Neck | 116431 | 5748 | 0.049368 |
47 | 206 | Bronx | East Tremont, Belmont | 83268 | 5746 | 0.069006 |
48 | 210 | Bronx | Throgs Nk., Co-op City, Pelham Bay | 120392 | 5719 | 0.047503 |
49 | 111 | Manhattan | East Harlem | 120511 | 5714 | 0.047415 |
50 | 414 | Queens | The Rockaways, Broad Channel | 114978 | 5284 | 0.045957 |
51 | 203 | Bronx | Morrisania, Crotona Park East | 79762 | 5262 | 0.065971 |
52 | 106 | Manhattan | Stuyvesant Town, Turtle Bay | 142745 | 5158 | 0.036134 |
53 | 316 | Brooklyn | Brownsville, Ocean Hill | 86468 | 5100 | 0.058981 |
54 | 201 | Bronx | Melrose, Mott Haven, Port Morris | 91497 | 4915 | 0.053718 |
55 | 406 | Queens | Forest Hills, Rego Park | 113257 | 4607 | 0.040677 |
56 | 313 | Brooklyn | Coney Island, Brighton Beach | 104278 | 3840 | 0.036825 |
57 | 101 | Manhattan | Battery Park City, Tribeca | 60978 | 3623 | 0.059415 |
58 | 202 | Bronx | Hunts Point, Longwood | 52246 | 3470 | 0.066417 |
Let’s check out which Community Districts have the highest complaints per capita
cd_data.sort_values("requests_per_capita", ascending=False).head(10)
boro_cd | Borough | CD Name | 2010 Population | num_311_requests | requests_per_capita | |
---|---|---|---|---|---|---|
38 | 105 | Manhattan | Midtown Business District | 51673 | 6599 | 0.127707 |
30 | 308 | Brooklyn | Crown Heights North | 96317 | 7797 | 0.080951 |
31 | 302 | Brooklyn | Brooklyn Heights, Fort Greene | 99617 | 7747 | 0.077768 |
32 | 309 | Brooklyn | Crown Heights South, Wingate | 98429 | 7571 | 0.076918 |
21 | 304 | Brooklyn | Bushwick | 112634 | 8639 | 0.076700 |
4 | 303 | Brooklyn | Bedford Stuyvesant | 152985 | 11615 | 0.075922 |
0 | 112 | Manhattan | Washington Heights, Inwood | 190020 | 14110 | 0.074255 |
22 | 110 | Manhattan | Central Harlem | 115723 | 8592 | 0.074246 |
1 | 405 | Queens | Ridgewood, Glendale, Maspeth | 169190 | 12487 | 0.073805 |
8 | 204 | Bronx | Highbridge, Concourse Village | 146441 | 10628 | 0.072575 |
While Inwood (112) had the highest number of complaints, it ranks further down on the list for requests per capita. Midtown may also be an outlier, based on it’s low residential population.
In-class exercise#
How does the per-capita distribution compare to that of the raw counts?
fig = px.histogram(districts, x="requests_per_capita", height=200)
fig.show()
fig = px.histogram(districts, x="num_311_requests", height=200)
fig.show()
Let’s improve the formatting (based on the .histogram()
documentation):
fig = px.histogram(
districts,
x="requests_per_capita",
title="Volume of 311 requests, 2018-2019",
labels={"requests_per_capita": "311 requests per capita"},
)
# y-axis needs to be done separately, since it's derived
fig.update_layout(yaxis_title_text="Number of community districts")
fig.show()
Scatterplot#
fig = px.scatter(
districts,
x="2010 Population",
y="num_311_requests",
hover_data=["boro_cd", "CD Name"],
title="Number of 311 requests per Community District by population",
)
fig.show()
fig = px.scatter(
districts,
x="2010 Population",
y="num_311_requests",
hover_data=["boro_cd", "CD Name"],
title="Number of 311 requests per Community District by population",
trendline="ols",
)
fig.show()
Let’s take a look at the statistical summary, via the statsmodels
package, following Plotly’s example:
trend_results = px.get_trendline_results(fig).iloc[0, 0]
trend_results.summary()
Dep. Variable: | y | R-squared: | 0.469 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.460 |
Method: | Least Squares | F-statistic: | 50.39 |
Date: | Mon, 04 Sep 2023 | Prob (F-statistic): | 2.18e-09 |
Time: | 21:54:17 | Log-Likelihood: | -523.81 |
No. Observations: | 59 | AIC: | 1052. |
Df Residuals: | 57 | BIC: | 1056. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 2692.4746 | 773.994 | 3.479 | 0.001 | 1142.578 | 4242.371 |
x1 | 0.0379 | 0.005 | 7.099 | 0.000 | 0.027 | 0.049 |
Omnibus: | 0.047 | Durbin-Watson: | 1.999 |
---|---|---|---|
Prob(Omnibus): | 0.977 | Jarque-Bera (JB): | 0.080 |
Skew: | -0.052 | Prob(JB): | 0.961 |
Kurtosis: | 2.853 | Cond. No. | 4.88e+05 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
“In general, the higher the R-squared, the better the model fits your data.”
Map complaint counts by CD#
We’ll follow this example, using community district GIS data.
Jump ahead to the map
First, let’s take a look at the GeoJSON data. We’re looking for what we can match our boro_cd
column up to. One way to inspect it:
Open Chrome
Install JSON Viewer
Open https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson
Load the GeoJSON data using the requests package (nothing to do with 311 requests):
import requests
response = requests.get("https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson")
shapes = response.json()
print("loaded")
# intentionally not outputting the data here since it's large
loaded
This is equivalent to the use of urlopen()
and json.load()
in the Plotly examples.
The structure looks something like:
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "MultiPolygon",
"coordinates": [
[
[
[-73.8718461029101, 40.843760777855834],
…
]
]
]
},
"properties": {
"boro_cd": "206",
"shape_leng": "35875.7117328",
"shape_area": "42664311.5086"
}
},
…
]
}
Peek at the properties
of one of the features
a.k.a. shapes a.k.a. Community Districts:
shapes["features"][0]["properties"]
{'boro_cd': '308',
'shape_leng': '38232.8866494',
'shape_area': '45603787.0874'}
Notes:
boro_cd
is the property we’re looking for. We’ll specify this as thefeatureidkey
.response.json()
turns JSON data into nested Python objects:shapes
is a dictionary,features
is a list beneath it, etc.
def plot_nyc(df):
fig = px.choropleth_mapbox(
df,
locations="boro_cd", # column name to match on
color="requests_per_capita", # column name for values
geojson=shapes,
featureidkey="properties.boro_cd", # GeoJSON property to match on
hover_data=["CD Name"],
center={"lat": 40.71, "lon": -73.98},
zoom=9,
mapbox_style="carto-positron",
height=600,
title="Requests per capita across Community Districts",
)
fig.show()
Wrapping this Plotly code in a function to:
Save space on subsequent slides
Make the code reusable for plotting different DataFrames
plot_nyc(districts)