Lecture 3: Data reshaping and visualization#

“Data visualization”, “chart”, “graph”, and will be used interchangeably.

Please sign attendance sheet; close devices

Ensure the visualizations render properly across VSCode, Jupyter Book, etc. You can ignore this.

import plotly.io as pio

pio.renderers.default = "colab+notebook_connected+plotly_mimetype"

Example from first class#

import plotly.express as px

df = px.data.tips()
fig = px.scatter(df, x="total_bill", y="tip", trendline="ols")
fig.show()

This includes a trendline. Let’s take a look at the statistical summary, via the statsmodels package, following Plotly’s example:

trend_results = px.get_trendline_results(fig).iloc[0, 0]
trend_results.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.457
Model: OLS Adj. R-squared: 0.454
Method: Least Squares F-statistic: 203.4
Date: Tue, 14 Apr 2026 Prob (F-statistic): 6.69e-34
Time: 17:28:13 Log-Likelihood: -350.54
No. Observations: 244 AIC: 705.1
Df Residuals: 242 BIC: 712.1
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.9203 0.160 5.761 0.000 0.606 1.235
x1 0.1050 0.007 14.260 0.000 0.091 0.120
Omnibus: 20.185 Durbin-Watson: 1.811
Prob(Omnibus): 0.000 Jarque-Bera (JB): 37.750
Skew: 0.443 Prob(JB): 6.35e-09
Kurtosis: 4.711 Cond. No. 53.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

“In general, the higher the R-squared, the better the model fits your data.”

In-class exercise#

Using NYC parks data, make a histogram of parks by borough. Pair with a neighbor.

Fertility rates demo#

What should we look at?

Reshaping#

Like pivot tables in spreadsheets

Mapping#

Let’s make a map. What should it be?

Choropleth maps#

Geospatial data#

To make a choropleth map, we need shapes. We’ll use country boundaries as GeoJSON from Natural Earth.

https://raw.githubusercontent.com/nvkelso/natural-earth-vector/refs/heads/master/geojson/ne_50m_admin_0_countries.geojson

The structure looks something like:

{
  "type": "FeatureCollection",
  "name": "ne_50m_admin_0_countries",
  "features": [
    {
      "type": "Feature",
      "properties": {
        "NAME": "Fiji",
        "ADM0_ISO": "FJI",
        ...
      },
      "bbox": [
        -180,
        -18.28799,
        180,
        -16.020882
      ],
      "geometry": {
        "type": "MultiPolygon",
        "coordinates": [
          [
            [
              [180, -16.067133],
              [180, -16.555217],
              ...
            ]
          ]
        ]
      }
    },
    ...
  ]
}

ADM0_ISO is the property we’re looking for. We’ll specify this as the featureidkey.

Fun fact (for a certain kind of person): What the zoom level means

Chart hygiene#

  • Always include a title

  • Make sure you label dependent and independent variables (X and Y axes)

  • Consider whether you are working with continuous vs. discrete values

  • If you’re trying to show more than three variables at once (e.g. X axis, Y axis, and color), try simplifying

What visualization should I use?#

Rudimentary guidelines:

What do you want to do?

Chart type

Show changes over time

Line chart

Compare values for categorical data

Bar chart

Compare two numeric variables

Scatter plot

Count things / show distribution across a range

Histogram

Show geographic trends

Map (choropleth, hexbin, bubble, etc.)

The Data Design Standards goes into more detail.

Homework 3#

Final Project#

In real/ideal world, start with specific question and find data to answer it:

project flow

Source: Big Data and Social Science

Data needed often doesn’t exist or is hard (or impossible) to find/access

project flow

Final Project