Class 4: Dates and time series analysis#

Dates and time series analysis#

From Wikipedia:

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.

What are time series you’ve seen / worked with?

Is the 311 data a time series?

As an example, we’ll look at the frequency of 311 requests:

  • Over time

  • By day of the week

  • By hour

Importing necessary packages#

import pandas as pd
import plotly.express as px

Data preparation#

Load 311 data.

requests = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_27164/461306798.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(
requests[["Created Date", "Closed Date"]]
Created Date Closed Date
0 08/01/2018 12:05:13 AM 08/01/2018 12:05:13 AM
1 08/01/2018 12:06:05 AM 08/01/2018 12:06:05 AM
2 08/01/2018 12:06:16 AM 08/03/2018 02:03:55 PM
3 08/01/2018 12:06:29 AM 08/01/2018 02:54:24 AM
4 08/01/2018 12:06:51 AM 08/01/2018 04:54:26 AM
... ... ...
499953 08/24/2019 01:46:09 AM NaN
499954 08/24/2019 01:49:49 AM NaN
499955 08/24/2019 01:56:35 AM NaN
499956 08/24/2019 01:56:40 AM NaN
499957 08/24/2019 01:57:58 AM NaN

499958 rows × 2 columns

Get the minimum and maximum Created Date:

requests["Created Date"].min()
'01/01/2019 01:00:00 PM'
requests["Created Date"].max()
'12/31/2018 12:59:37 AM'

Anything you notice about that?

The dates are stored as strings (objects):

requests.dtypes
Unique Key                          int64
Created Date                       object
Closed Date                        object
Agency                             object
Agency Name                        object
Complaint Type                     object
Descriptor                         object
Location Type                      object
Incident Zip                       object
Incident Address                   object
Street Name                        object
Cross Street 1                     object
Cross Street 2                     object
Intersection Street 1              object
Intersection Street 2              object
Address Type                       object
City                               object
Landmark                           object
Facility Type                      object
Status                             object
Due Date                           object
Resolution Description             object
Resolution Action Updated Date     object
Community Board                    object
BBL                               float64
Borough                            object
X Coordinate (State Plane)        float64
Y Coordinate (State Plane)        float64
Open Data Channel Type             object
Park Facility Name                 object
Park Borough                       object
Vehicle Type                       object
Taxi Company Borough               object
Taxi Pick Up Location              object
Bridge Highway Name                object
Bridge Highway Direction           object
Road Ramp                          object
Bridge Highway Segment             object
Latitude                          float64
Longitude                         float64
Location                           object
dtype: object

Convert columns to timestamps using pandas’ to_datetime()#

requests["Created Date"] = pd.to_datetime(requests["Created Date"], format="%m/%d/%Y %I:%M:%S %p")
requests["Closed Date"] = pd.to_datetime(requests["Closed Date"], format="%m/%d/%Y %I:%M:%S %p")
requests[["Created Date", "Closed Date"]]
Created Date Closed Date
0 2018-08-01 00:05:13 2018-08-01 00:05:13
1 2018-08-01 00:06:05 2018-08-01 00:06:05
2 2018-08-01 00:06:16 2018-08-03 14:03:55
3 2018-08-01 00:06:29 2018-08-01 02:54:24
4 2018-08-01 00:06:51 2018-08-01 04:54:26
... ... ...
499953 2019-08-24 01:46:09 NaT
499954 2019-08-24 01:49:49 NaT
499955 2019-08-24 01:56:35 NaT
499956 2019-08-24 01:56:40 NaT
499957 2019-08-24 01:57:58 NaT

499958 rows × 2 columns

More about the format string. If you don’t provide one, it will take much longer to convert.

Check data types and confirm they are now datetime:

requests.dtypes
Unique Key                                 int64
Created Date                      datetime64[ns]
Closed Date                       datetime64[ns]
Agency                                    object
Agency Name                               object
Complaint Type                            object
Descriptor                                object
Location Type                             object
Incident Zip                              object
Incident Address                          object
Street Name                               object
Cross Street 1                            object
Cross Street 2                            object
Intersection Street 1                     object
Intersection Street 2                     object
Address Type                              object
City                                      object
Landmark                                  object
Facility Type                             object
Status                                    object
Due Date                                  object
Resolution Description                    object
Resolution Action Updated Date            object
Community Board                           object
BBL                                      float64
Borough                                   object
X Coordinate (State Plane)               float64
Y Coordinate (State Plane)               float64
Open Data Channel Type                    object
Park Facility Name                        object
Park Borough                              object
Vehicle Type                              object
Taxi Company Borough                      object
Taxi Pick Up Location                     object
Bridge Highway Name                       object
Bridge Highway Direction                  object
Road Ramp                                 object
Bridge Highway Segment                    object
Latitude                                 float64
Longitude                                float64
Location                                  object
dtype: object
requests["Created Date"].min()
Timestamp('2018-08-01 00:05:13')
requests["Created Date"].max()
Timestamp('2019-08-24 01:57:58')

In-class exercise#

Let’s do up through Step 2 of Homework 4.

Noise complaints per day#

noise = requests[requests["Complaint Type"] == "Noise - Residential"]
noise_per_day = noise.resample("D", on="Created Date").size().reset_index(name="count_requests")

noise_per_day
Created Date count_requests
0 2018-08-01 50
1 2018-08-02 49
2 2018-08-03 65
3 2018-08-04 162
4 2018-08-05 191
... ... ...
384 2019-08-20 60
385 2019-08-21 46
386 2019-08-22 62
387 2019-08-23 109
388 2019-08-24 33

389 rows × 2 columns

Resampling#

Once you have a column with datetime objects, pandas can manipulate them directly. From the User Guide:

resample() is a time-based groupby

.resample('D', on='Created Date')

The 'D' is the offset alias, i.e. the desired frequency.

fig = px.line(
    noise_per_day,
    x="Created Date",
    y="count_requests",
    title="Noise complaints per day",
)
fig.show()

How about a rolling average?

noise_per_day_rolling = noise_per_day.rolling("7D", on="Created Date").mean()

fig = px.line(
    noise_per_day_rolling,
    x="Created Date",
    y="count_requests",
    title="7-day rolling average of noise complaints per day",
)
fig.show()

Let’s try weekly:

noise_per_week = noise.resample("W", on="Created Date").size().reset_index(name="count_requests")

fig = px.line(
    noise_per_week,
    x="Created Date",
    y="count_requests",
    title="Noise complaints per week",
)
fig.show()

Resampling with other grouping#

noise.resample("W", on="Created Date").size()

# can be rewrritten as

noise.groupby([pd.Grouper(key="Created Date", freq="W")]).size()
Created Date
2018-08-05     517
2018-08-12     569
2018-08-19     677
2018-08-26     856
2018-09-02     783
2018-09-09     770
2018-09-16     785
2018-09-23     793
2018-09-30     842
2018-10-07     725
2018-10-14     687
2018-10-21     715
2018-10-28     714
2018-11-04     722
2018-11-11     651
2018-11-18     619
2018-11-25     713
2018-12-02     644
2018-12-09     671
2018-12-16     645
2018-12-23     543
2018-12-30     748
2019-01-06     784
2019-01-13     615
2019-01-20     591
2019-01-27     658
2019-02-03     673
2019-02-10     597
2019-02-17     645
2019-02-24     700
2019-03-03     624
2019-03-10     638
2019-03-17     721
2019-03-24     666
2019-03-31     743
2019-04-07     819
2019-04-14     815
2019-04-21     748
2019-04-28     801
2019-05-05     867
2019-05-12     876
2019-05-19     962
2019-05-26    1048
2019-06-02    1197
2019-06-09     989
2019-06-16    1074
2019-06-23    1009
2019-06-30     726
2019-07-07     903
2019-07-14     719
2019-07-21     565
2019-07-28     674
2019-08-04     609
2019-08-11     741
2019-08-18     752
2019-08-25     373
Freq: W-SUN, dtype: int64

This means you can add other columns to group by:

noise.groupby([pd.Grouper(key="Created Date", freq="W"), "Borough"]).size()
Created Date  Borough      
2018-08-05    BRONX            134
              BROOKLYN         157
              MANHATTAN         90
              QUEENS           112
              STATEN ISLAND     23
                              ... 
2019-08-25    BROOKLYN         126
              MANHATTAN         69
              QUEENS            75
              STATEN ISLAND     16
              Unspecified        2
Length: 310, dtype: int64

In-class exercise#

Let’s do up through Step 5 of Homework 4.

Does the frequency of noise complaints vary by day of the week?#

Get the day of the week for each 311 request#

Add column using the time/date component.

noise_per_day["weekday_name"] = noise_per_day["Created Date"].dt.day_name()
noise_per_day["weekday"] = noise_per_day["Created Date"].dt.weekday

noise_per_day[["Created Date", "weekday_name", "weekday"]]
Created Date weekday_name weekday
0 2018-08-01 Wednesday 2
1 2018-08-02 Thursday 3
2 2018-08-03 Friday 4
3 2018-08-04 Saturday 5
4 2018-08-05 Sunday 6
... ... ... ...
384 2019-08-20 Tuesday 1
385 2019-08-21 Wednesday 2
386 2019-08-22 Thursday 3
387 2019-08-23 Friday 4
388 2019-08-24 Saturday 5

389 rows × 3 columns

Find the median count of 311 requests per weekday#

# having the (numeric) `weekday` first ensures they're in order
columns = ["weekday", "weekday_name"]
noise_weekday = (
    noise_per_day.groupby(columns)["count_requests"].median().reset_index(name="median_requests")
)

noise_weekday
weekday weekday_name median_requests
0 0 Monday 76.0
1 1 Tuesday 65.0
2 2 Wednesday 62.0
3 3 Thursday 66.0
4 4 Friday 93.0
5 5 Saturday 160.5
6 6 Sunday 170.0
fig = px.bar(
    noise_weekday,
    x="weekday_name",
    y="median_requests",
    title="Noise complaints per day of week",
)
fig.show()

What about by time of day?#

Get count of noise complaints per individual date and hour:

noise_per_date_hour = (
    noise.resample("H", on="Created Date").size().reset_index(name="count_requests")
)

# create a column for the hour number, so we can group on it
noise_per_date_hour["hour"] = noise_per_date_hour["Created Date"].dt.hour

noise_per_date_hour
Created Date count_requests hour
0 2018-08-01 00:00:00 6 0
1 2018-08-01 01:00:00 3 1
2 2018-08-01 02:00:00 4 2
3 2018-08-01 03:00:00 1 3
4 2018-08-01 04:00:00 0 4
... ... ... ...
9309 2019-08-23 21:00:00 10 21
9310 2019-08-23 22:00:00 24 22
9311 2019-08-23 23:00:00 20 23
9312 2019-08-24 00:00:00 18 0
9313 2019-08-24 01:00:00 15 1

9314 rows × 3 columns

Get the median count of complaints per hour:

noise_hour = (
    noise_per_date_hour.groupby("hour")["count_requests"]
    .median()
    .reset_index(name="median_requests")
)
noise_hour
hour median_requests
0 0 7.0
1 1 5.0
2 2 3.0
3 3 2.0
4 4 1.0
5 5 1.0
6 6 1.0
7 7 1.0
8 8 1.0
9 9 1.0
10 10 1.5
11 11 1.0
12 12 2.0
13 13 2.0
14 14 2.0
15 15 2.0
16 16 2.0
17 17 3.0
18 18 4.0
19 19 5.0
20 20 6.0
21 21 7.0
22 22 11.0
23 23 10.0
fig = px.line(noise_hour, x="hour", y="median_requests", title="Noise complaints per hour")
fig.show()

Resampling vs. time/date components#

  • Use resampling when you want to work with dates as continuous values, e.g. points in time

  • Use date components when you want to work with dates as categorical values, e.g. month number, day of week

Which 311 complaints take the longest to resolve?#

# calculate the amount of time that passed between Created Date and Closed Date

requests["resolution_duration"] = requests["Closed Date"] - requests["Created Date"]

# print head to check results

requests[["Closed Date", "Created Date", "resolution_duration"]].head()
Closed Date Created Date resolution_duration
0 2018-08-01 00:05:13 2018-08-01 00:05:13 0 days 00:00:00
1 2018-08-01 00:06:05 2018-08-01 00:06:05 0 days 00:00:00
2 2018-08-03 14:03:55 2018-08-01 00:06:16 2 days 13:57:39
3 2018-08-01 02:54:24 2018-08-01 00:06:29 0 days 02:47:55
4 2018-08-01 04:54:26 2018-08-01 00:06:51 0 days 04:47:35
requests.dtypes
Unique Key                                  int64
Created Date                       datetime64[ns]
Closed Date                        datetime64[ns]
Agency                                     object
Agency Name                                object
Complaint Type                             object
Descriptor                                 object
Location Type                              object
Incident Zip                               object
Incident Address                           object
Street Name                                object
Cross Street 1                             object
Cross Street 2                             object
Intersection Street 1                      object
Intersection Street 2                      object
Address Type                               object
City                                       object
Landmark                                   object
Facility Type                              object
Status                                     object
Due Date                                   object
Resolution Description                     object
Resolution Action Updated Date             object
Community Board                            object
BBL                                       float64
Borough                                    object
X Coordinate (State Plane)                float64
Y Coordinate (State Plane)                float64
Open Data Channel Type                     object
Park Facility Name                         object
Park Borough                               object
Vehicle Type                               object
Taxi Company Borough                       object
Taxi Pick Up Location                      object
Bridge Highway Name                        object
Bridge Highway Direction                   object
Road Ramp                                  object
Bridge Highway Segment                     object
Latitude                                  float64
Longitude                                 float64
Location                                   object
resolution_duration               timedelta64[ns]
dtype: object
requests["resolution_duration"].median()
Timedelta('1 days 05:19:58')

Ignore empty values:

df_clean = requests.dropna(subset=["resolution_duration"])
median_durations = df_clean.groupby("Complaint Type")["resolution_duration"].median()

median_durations.nlargest(15).reset_index(name="median_duration")
Complaint Type median_duration
0 For Hire Vehicle Complaint 98 days 15:08:41
1 Taxi Complaint 98 days 09:37:53
2 Graffiti 89 days 11:29:38
3 New Tree Request 83 days 16:56:30
4 Food Establishment 60 days 15:45:06
5 Facades 55 days 21:04:26
6 Sustainability Enforcement 47 days 12:47:23
7 Radioactive Material 46 days 01:01:18
8 Drinking Water 35 days 00:24:20
9 Bike Rack Condition 33 days 22:22:32
10 Highway Sign - Missing 32 days 01:38:15.500000
11 School Maintenance 28 days 20:56:12.500000
12 Vacant Lot 27 days 23:44:30
13 Transportation Provider Complaint 24 days 20:15:41
14 Plumbing 20 days 11:44:44
median_durations.nsmallest(15).reset_index(name="median_duration")
Complaint Type median_duration
0 BEST/Site Safety 0 days 00:00:00
1 Construction Safety Enforcement 0 days 00:00:00
2 Derelict Vehicles 0 days 00:00:00
3 Miscellaneous Categories 0 days 00:00:00
4 Quality of Life 0 days 00:00:00
5 Street Light Condition 0 days 00:00:00
6 Taxi Report 0 days 00:00:34
7 Benefit Card Replacement 0 days 00:00:39
8 For Hire Vehicle Report 0 days 00:00:42
9 Advocate-Personal Exemptions 0 days 00:05:39
10 Advocate-Co-opCondo Abatement 0 days 00:07:23
11 Advocate-Prop Refunds/Credits 0 days 00:07:41
12 Advocate - Other 0 days 00:08:37.500000
13 Taxpayer Advocate Inquiry 0 days 00:08:51
14 DPR Internal 0 days 00:25:41

Filtering timestamps#

Noise complaints over New Year’s.

after_dec_31 = requests["Created Date"] >= "2018-12-31"
before_jan_2 = requests["Created Date"] < "2019-01-02"
residential_only = requests["Complaint Type"] == "Noise - Residential"

display_columns = ["Created Date", "Complaint Type"]
requests[after_dec_31 & before_jan_2 & residential_only][display_columns]
Created Date Complaint Type
201471 2018-12-31 00:03:37 Noise - Residential
201473 2018-12-31 00:05:12 Noise - Residential
201475 2018-12-31 00:07:00 Noise - Residential
201478 2018-12-31 00:11:18 Noise - Residential
201481 2018-12-31 00:19:44 Noise - Residential
... ... ...
203413 2019-01-01 22:59:22 Noise - Residential
203415 2019-01-01 23:03:03 Noise - Residential
203416 2019-01-01 23:04:27 Noise - Residential
203420 2019-01-01 23:10:46 Noise - Residential
203435 2019-01-01 23:56:22 Noise - Residential

321 rows × 2 columns

Conditionals review#

If there’s time

Pure (“Purr”) Python#

Example: Make a function that checks if the given name is one of my cats

# name = input("Name: ")
name = "Wilbur"
def test_cats(word):
    print(word)

    if word.lower() == "blondie" or word.lower() == "wilbur":
        return True
    elif otherthing:
        stuff
    else:
        return False

    # versus

    if word.lower() == "blondie" or word.lower() == "wilbur":
        something
    if otherthing:
        stuff


test_cats(name)
Wilbur
True
False or name
'Wilbur'
name == ("blondie" or "wilbur")
False
"blondie" or "wilbur"
'blondie'

Pandas#

Comparison operators* are different, since you’re working with full columns instead of single values.

*You may see these referred to as “bitwise operators”, though that name isn’t quite accurate.

Another sample dataset from Plotly:

import plotly

medals = plotly.data.medals_wide()
medals
nation gold silver bronze
0 South Korea 24 13 11
1 China 10 15 8
2 Canada 9 12 12
medals["gold"] >= 10
0     True
1     True
2    False
Name: gold, dtype: bool
medals["silver"] >= 14
0    False
1     True
2    False
Name: silver, dtype: bool
(medals["gold"] >= 10) & (medals["silver"] >= 14)
0    False
1     True
2    False
dtype: bool
medals[(medals["gold"] >= 10) & (medals["silver"] >= 14)]
nation gold silver bronze
1 China 10 15 8

Refactored:

high_gold = medals["gold"] >= 10
high_silver = medals["silver"] >= 14

medals[high_gold & high_silver]
nation gold silver bronze
1 China 10 15 8

Final Project#

  1. Check feedback on your proposals, even if you got 100%.

  2. Once you start