Lecture 3: Data reshaping and visualization

Lecture 3: Data reshaping and visualization#

“Data visualization”, “chart”, “graph”, and will be used interchangeably.

Please sign attendance sheet; close devices

Ensure the visualizations render properly across VSCode, Jupyter Book, etc. You can ignore this.

import plotly.io as pio

pio.renderers.default = "colab+notebook_connected+plotly_mimetype"

Example from first class #

import plotly.express as px

df = px.data.tips()
fig = px.scatter(df, x="total_bill", y="tip", trendline="ols")
fig.show()

This includes a trendline. Let’s take a look at the statistical summary, via the statsmodels package, following Plotly’s example:

trend_results = px.get_trendline_results(fig).iloc[0, 0]
trend_results.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.457
Model:	OLS	Adj. R-squared:	0.454
Method:	Least Squares	F-statistic:	203.4
Date:	Tue, 14 Apr 2026	Prob (F-statistic):	6.69e-34
Time:	17:28:13	Log-Likelihood:	-350.54
No. Observations:	244	AIC:	705.1
Df Residuals:	242	BIC:	712.1
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.9203	0.160	5.761	0.000	0.606	1.235
x1	0.1050	0.007	14.260	0.000	0.091	0.120

Omnibus:	20.185	Durbin-Watson:	1.811
Prob(Omnibus):	0.000	Jarque-Bera (JB):	37.750
Skew:	0.443	Prob(JB):	6.35e-09
Kurtosis:	4.711	Cond. No.	53.0

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

“In general, the higher the R-squared, the better the model fits your data.”

In-class exercise#

Using NYC parks data, make a histogram of parks by borough. Pair with a neighbor.

Fertility rates demo#

What should we look at?

Reshaping #

Like pivot tables in spreadsheets

Mapping#

Let’s make a map. What should it be?

Choropleth maps #

Geospatial data#

To make a choropleth map, we need shapes. We’ll use country boundaries as GeoJSON from Natural Earth.

https://raw.githubusercontent.com/nvkelso/natural-earth-vector/refs/heads/master/geojson/ne_50m_admin_0_countries.geojson

The structure looks something like:

{
  "type": "FeatureCollection",
  "name": "ne_50m_admin_0_countries",
  "features": [
    {
      "type": "Feature",
      "properties": {
        "NAME": "Fiji",
        "ADM0_ISO": "FJI",
        ...
      },
      "bbox": [
        -180,
        -18.28799,
        180,
        -16.020882
      ],
      "geometry": {
        "type": "MultiPolygon",
        "coordinates": [
          [
            [
              [180, -16.067133],
              [180, -16.555217],
              ...
            ]
          ]
        ]
      }
    },
    ...
  ]
}

ADM0_ISO is the property we’re looking for. We’ll specify this as the featureidkey.

Fun fact (for a certain kind of person): What the zoom level means

Chart hygiene#

Always include a title
Make sure you label dependent and independent variables (X and Y axes)
Consider whether you are working with continuous vs. discrete values
If you’re trying to show more than three variables at once (e.g. X axis, Y axis, and color), try simplifying

What visualization should I use?#

Rudimentary guidelines:

What do you want to do?	Chart type
Show changes over time	Line chart
Compare values for categorical data	Bar chart
Compare two numeric variables	Scatter plot
Count things / show distribution across a range	Histogram
Show geographic trends	Map (choropleth, hexbin, bubble, etc.)

The Data Design Standards goes into more detail.

Conditionals review#

If there’s time

Pure (“Purr”) Python#

Example: Make a function that checks if the given name is one of my cats

# name = input("Name: ")
name = "Wilbur"

def test_cats(word):
    print(word)

    if word.lower() == "blondie" or word.lower() == "wilbur":
        return True
    elif otherthing:
        stuff
    else:
        return False

    # versus

    if word.lower() == "blondie" or word.lower() == "wilbur":
        something
    if otherthing:
        stuff


test_cats(name)

Wilbur

True

False or name

'Wilbur'

name == ("blondie" or "wilbur")

False

"blondie" or "wilbur"

'blondie'

Pandas#

Comparison operators* are different, since you’re working with full columns instead of single values.

*You may see these referred to as “bitwise operators”, though that name isn’t quite accurate.

A sample dataset from Plotly:

import plotly

medals = plotly.data.medals_wide()
medals

	nation	gold	silver	bronze
0	South Korea	24	13	11
1	China	10	15	8
2	Canada	9	12	12

medals["gold"] >= 10

   True
   True
  False
Name: gold, dtype: bool

medals["silver"] >= 14

  False
   True
  False
Name: silver, dtype: bool

(medals["gold"] >= 10) & (medals["silver"] >= 14)

  False
   True
  False
dtype: bool

medals[(medals["gold"] >= 10) & (medals["silver"] >= 14)]

	nation	gold	silver	bronze
1	China	10	15	8

Refactored:

high_gold = medals["gold"] >= 10
high_silver = medals["silver"] >= 14

medals[high_gold & high_silver]

	nation	gold	silver	bronze
1	China	10	15	8

Homework 3 #

Final Project#

In real/ideal world, start with specific question and find data to answer it:

project flow

Source: Big Data and Social Science

Data needed often doesn’t exist or is hard (or impossible) to find/access