Class 3: Data visualization#

“Data visualization”, “chart”, “graph”, and will be used interchangeably.

Today’s goal: Visualizing requests per community district#

This should help us better understand trends across the city.

Start by importing necessary packages#

import pandas as pd
import plotly.express as px

Populations#

Load the data:

population = pd.read_csv("https://data.cityofnewyork.us/api/views/xi7c-iiu2/rows.csv")
population.head()
Borough CD Number CD Name 1970 Population 1980 Population 1990 Population 2000 Population 2010 Population
0 Bronx 1 Melrose, Mott Haven, Port Morris 138557 78441 77214 82159 91497
1 Bronx 2 Hunts Point, Longwood 99493 34399 39443 46824 52246
2 Bronx 3 Morrisania, Crotona Park East 150636 53635 57162 68574 79762
3 Bronx 4 Highbridge, Concourse Village 144207 114312 119962 139563 146441
4 Bronx 5 University Hts., Fordham, Mt. Hope 121807 107995 118435 128313 128200

Adapting the basic histogram example:

fig = px.histogram(
    population,
    x="Borough",
    title="Number of community districts in each borough",
)
fig.show()

Histogram values are equivalent to doing a .groupby().size().

fig = px.histogram(
    population,
    x="2010 Population",
    title="Distribution of Community District populations, 2010",
    nbins=20,
)
fig.show()

Data from where we left off last class#

Derived dataset containing count of complaints per community district.

districts = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy2/data/community_district_311.csv.zip"
)
districts.head()
Community Board num_311_requests boro_cd Borough CD Number CD Name 1970 Population 1980 Population 1990 Population 2000 Population 2010 Population
0 12 MANHATTAN 14110 112 Manhattan 12 Washington Heights, Inwood 180561 179941 198192 208414 190020
1 05 QUEENS 12487 405 Queens 5 Ridgewood, Glendale, Maspeth 161022 150142 149126 165911 169190
2 12 QUEENS 12228 412 Queens 12 Jamaica, St. Albans, Hollis 206639 189383 201293 223602 225919
3 01 BROOKLYN 11863 301 Brooklyn 1 Williamsburg, Greenpoint 179390 142942 155972 160338 173083
4 03 BROOKLYN 11615 303 Brooklyn 3 Bedford Stuyvesant 203380 133379 138696 143867 152985

Looking at raw volume is probably less useful than density.

Calculate 311 requests per capita#

Divide request count by 2010 population to get requests per capita

districts["requests_per_capita"] = districts["num_311_requests"] / districts["2010 Population"]

districts.head()
Community Board num_311_requests boro_cd Borough CD Number CD Name 1970 Population 1980 Population 1990 Population 2000 Population 2010 Population requests_per_capita
0 12 MANHATTAN 14110 112 Manhattan 12 Washington Heights, Inwood 180561 179941 198192 208414 190020 0.074255
1 05 QUEENS 12487 405 Queens 5 Ridgewood, Glendale, Maspeth 161022 150142 149126 165911 169190 0.073805
2 12 QUEENS 12228 412 Queens 12 Jamaica, St. Albans, Hollis 206639 189383 201293 223602 225919 0.054126
3 01 BROOKLYN 11863 301 Brooklyn 1 Williamsburg, Greenpoint 179390 142942 155972 160338 173083 0.068539
4 03 BROOKLYN 11615 303 Brooklyn 3 Bedford Stuyvesant 203380 133379 138696 143867 152985 0.075922

Let’s create a simplified new dataframe that only include the columns we care about and in a better order.

columns = [
    "boro_cd",
    "Borough",
    "CD Name",
    "2010 Population",
    "num_311_requests",
    "requests_per_capita",
]
cd_data = districts[columns]

cd_data
boro_cd Borough CD Name 2010 Population num_311_requests requests_per_capita
0 112 Manhattan Washington Heights, Inwood 190020 14110 0.074255
1 405 Queens Ridgewood, Glendale, Maspeth 169190 12487 0.073805
2 412 Queens Jamaica, St. Albans, Hollis 225919 12228 0.054126
3 301 Brooklyn Williamsburg, Greenpoint 173083 11863 0.068539
4 303 Brooklyn Bedford Stuyvesant 152985 11615 0.075922
5 501 Staten Island Stapleton, Port Richmond 175756 11438 0.065079
6 407 Queens Flushing, Bay Terrace 247354 11210 0.045320
7 305 Brooklyn East New York, Starrett City 182896 10862 0.059389
8 204 Bronx Highbridge, Concourse Village 146441 10628 0.072575
9 401 Queens Astoria, Long Island City 191105 10410 0.054473
10 317 Brooklyn East Flatbush, Rugby, Farragut 155252 10208 0.065751
11 314 Brooklyn Flatbush, Midwood 160664 10179 0.063356
12 207 Bronx Bedford Park, Norwood, Fordham 139286 9841 0.070653
13 318 Brooklyn Canarsie, Flatlands 193543 9717 0.050206
14 409 Queens Woodhaven, Richmond Hill 143317 9599 0.066977
15 413 Queens Queens Village, Rosedale 188593 9459 0.050156
16 212 Bronx Wakefield, Williamsbridge 152344 9412 0.061781
17 311 Brooklyn Bensonhurst, Bath Beach 181981 9309 0.051154
18 205 Bronx University Hts., Fordham, Mt. Hope 128200 9094 0.070936
19 103 Manhattan Lower East Side, Chinatown 163277 8905 0.054539
20 408 Queens Fresh Meadows, Briarwood 151107 8661 0.057317
21 304 Brooklyn Bushwick 112634 8639 0.076700
22 110 Manhattan Central Harlem 115723 8592 0.074246
23 503 Staten Island Tottenville, Woodrow, Great Kills 160209 8524 0.053206
24 315 Brooklyn Sheepshead Bay, Gerritsen Beach 159650 8508 0.053292
25 410 Queens Ozone Park, Howard Beach 122396 8333 0.068082
26 312 Brooklyn Borough Park, Ocean Parkway 191382 8214 0.042919
27 107 Manhattan West Side, Upper West Side 209084 8141 0.038937
28 209 Bronx Soundview, Parkchester 172298 8092 0.046965
29 310 Brooklyn Bay Ridge, Dyker Heights 124491 7808 0.062719
30 308 Brooklyn Crown Heights North 96317 7797 0.080951
31 302 Brooklyn Brooklyn Heights, Fort Greene 99617 7747 0.077768
32 309 Brooklyn Crown Heights South, Wingate 98429 7571 0.076918
33 306 Brooklyn Park Slope, Carroll Gardens 104709 7373 0.070414
34 502 Staten Island New Springville, South Beach 132003 7059 0.053476
35 403 Queens Jackson Heights, North Corona 171576 6799 0.039627
36 109 Manhattan Manhattanville, Hamilton Heights 110193 6792 0.061637
37 104 Manhattan Chelsea, Clinton 103245 6765 0.065524
38 105 Manhattan Midtown Business District 51673 6599 0.127707
39 108 Manhattan Upper East Side 219920 6579 0.029915
40 102 Manhattan Greenwich Village, Soho 90016 6514 0.072365
41 211 Bronx Pelham Pkwy, Morris Park, Laconia 113232 6448 0.056945
42 307 Brooklyn Sunset Park, Windsor Terrace 126230 6364 0.050416
43 402 Queens Sunnyside, Woodside 113200 6142 0.054258
44 404 Queens Elmhurst, South Corona 172598 5824 0.033743
45 208 Bronx Riverdale, Kingsbridge, Marble Hill 101731 5753 0.056551
46 411 Queens Bayside, Douglaston, Little Neck 116431 5748 0.049368
47 206 Bronx East Tremont, Belmont 83268 5746 0.069006
48 210 Bronx Throgs Nk., Co-op City, Pelham Bay 120392 5719 0.047503
49 111 Manhattan East Harlem 120511 5714 0.047415
50 414 Queens The Rockaways, Broad Channel 114978 5284 0.045957
51 203 Bronx Morrisania, Crotona Park East 79762 5262 0.065971
52 106 Manhattan Stuyvesant Town, Turtle Bay 142745 5158 0.036134
53 316 Brooklyn Brownsville, Ocean Hill 86468 5100 0.058981
54 201 Bronx Melrose, Mott Haven, Port Morris 91497 4915 0.053718
55 406 Queens Forest Hills, Rego Park 113257 4607 0.040677
56 313 Brooklyn Coney Island, Brighton Beach 104278 3840 0.036825
57 101 Manhattan Battery Park City, Tribeca 60978 3623 0.059415
58 202 Bronx Hunts Point, Longwood 52246 3470 0.066417

Let’s check out which Community Districts have the highest complaints per capita

cd_data.sort_values("requests_per_capita", ascending=False).head(10)
boro_cd Borough CD Name 2010 Population num_311_requests requests_per_capita
38 105 Manhattan Midtown Business District 51673 6599 0.127707
30 308 Brooklyn Crown Heights North 96317 7797 0.080951
31 302 Brooklyn Brooklyn Heights, Fort Greene 99617 7747 0.077768
32 309 Brooklyn Crown Heights South, Wingate 98429 7571 0.076918
21 304 Brooklyn Bushwick 112634 8639 0.076700
4 303 Brooklyn Bedford Stuyvesant 152985 11615 0.075922
0 112 Manhattan Washington Heights, Inwood 190020 14110 0.074255
22 110 Manhattan Central Harlem 115723 8592 0.074246
1 405 Queens Ridgewood, Glendale, Maspeth 169190 12487 0.073805
8 204 Bronx Highbridge, Concourse Village 146441 10628 0.072575

While Inwood (112) had the highest number of complaints, it ranks further down on the list for requests per capita. Midtown may also be an outlier, based on it’s low residential population.

# cd_data.to_csv("data/311_community_districts.csv", index=False)

In-class exercise#

How does the per-capita distribution compare to that of the raw counts?

fig = px.histogram(districts, x="requests_per_capita", height=200)
fig.show()
fig = px.histogram(districts, x="num_311_requests", height=200)
fig.show()

Let’s improve the formatting (based on the .histogram() documentation):

fig = px.histogram(
    districts,
    x="requests_per_capita",
    title="Volume of 311 requests, 2018-2019",
    labels={"requests_per_capita": "311 requests per capita"},
)

# y-axis needs to be done separately, since it's derived
fig.update_layout(yaxis_title_text="Number of community districts")
fig.show()

Scatterplot#

fig = px.scatter(
    districts,
    x="2010 Population",
    y="num_311_requests",
    title="Number of 311 requests per Community District by population",
)

fig.show()

Add a trendline:

fig = px.scatter(
    districts,
    x="2010 Population",
    y="num_311_requests",
    title="Number of 311 requests per Community District by population",
    trendline="ols",
)

fig.show()

Let’s take a look at the statistical summary, via the statsmodels package, following Plotly’s example:

trend_results = px.get_trendline_results(fig).iloc[0, 0]
trend_results.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.469
Model: OLS Adj. R-squared: 0.460
Method: Least Squares F-statistic: 50.39
Date: Mon, 18 Mar 2024 Prob (F-statistic): 2.18e-09
Time: 17:43:34 Log-Likelihood: -523.81
No. Observations: 59 AIC: 1052.
Df Residuals: 57 BIC: 1056.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2692.4746 773.994 3.479 0.001 1142.578 4242.371
x1 0.0379 0.005 7.099 0.000 0.027 0.049
Omnibus: 0.047 Durbin-Watson: 1.999
Prob(Omnibus): 0.977 Jarque-Bera (JB): 0.080
Skew: -0.052 Prob(JB): 0.961
Kurtosis: 2.853 Cond. No. 4.88e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.88e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

“In general, the higher the R-squared, the better the model fits your data.”

Map complaint counts by CD#

We’ll follow this example, using community district GIS data.

Jump ahead to the map

First, let’s take a look at the GeoJSON data. We’re looking for what we can match our boro_cd column up to. One way to inspect it:

  1. Open Chrome

  2. Install JSON Viewer

  3. Open https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson

Load the GeoJSON data using the requests package (nothing to do with 311 requests):

import requests

response = requests.get("https://data.cityofnewyork.us/resource/jp9i-3b7y.geojson")
shapes = response.json()
print("loaded")

# intentionally not outputting the data here since it's large
loaded

This is equivalent to the use of urlopen() and json.load() in the Plotly examples.

The structure looks something like:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "MultiPolygon",
        "coordinates": [
          [
            [
              [-73.8718461029101, 40.843760777855834],
              …
            ]
          ]
        ]
      },
      "properties": {
        "boro_cd": "206",
        "shape_leng": "35875.7117328",
        "shape_area": "42664311.5086"
      }
    },
    …
  ]
}

Peek at the properties of one of the features a.k.a. shapes a.k.a. Community Districts:

shapes["features"][0]["properties"]
{'boro_cd': '308',
 'shape_leng': '38232.8866493',
 'shape_area': '45603787.1384'}

Notes:

  • boro_cd is the property we’re looking for. We’ll specify this as the featureidkey.

  • response.json() turns JSON data into nested Python objects: shapes is a dictionary, features is a list beneath it, etc.

def plot_nyc(df):
    fig = px.choropleth_mapbox(
        df,
        locations="boro_cd",  # column name to match on
        color="requests_per_capita",  # column name for values
        geojson=shapes,
        featureidkey="properties.boro_cd",  # GeoJSON property to match on
        center={"lat": 40.71, "lon": -73.98},
        zoom=9,
        mapbox_style="carto-positron",
        height=600,
        title="Requests per capita across Community Districts",
    )

    fig.show()

Wrapping this Plotly code in a function to:

  • Save space on subsequent slides

  • Make the code reusable for plotting different DataFrames

plot_nyc(districts)

Midtown, as an outlier, is skewing our results. Let’s exclude it.

no_midtown = districts[districts["boro_cd"] != 105]
plot_nyc(no_midtown)

Fun fact (for a certain kind of person): What the Mapbox zoom level means

Chart hygiene#

  • Always include a title

  • Make sure you label dependent and independent variables (X and Y axes)

  • Consider whether you are working with continuous vs. discrete values

  • If you’re trying to show more than three variables at once (e.g. X axis, Y axis, and color), try simplifying

What visualization should I use?#

Rudimentary guidelines:

What do you want to do?

Chart type

Show changes over time

Line chart

Compare values for categorical data

Bar chart

Compare two numeric variables

Scatter plot

Count things / show distribution across a range

Histogram

Show geographic trends

Map (choropleth, hexbin, bubble, etc.)

The Data Design Standards goes into more detail.

Pivoting#

FYI: Pandas supports reshaping DataFrames through pivoting, like spreadsheets do.

Homework 3#

Final Project#

In real/ideal world, start with specific question and find data to answer it:

project flow

Source: Big Data and Social Science

Data needed often doesn’t exist or is hard (or impossible) to find/access

project flow

Final Project