Class 4: Dates and time series analysis#

Dates and time series analysis#

From Wikipedia:

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.

What are time series you’ve seen / worked with?

Is the 311 data a time series?

As an example, we’ll look at the frequency of 311 requests:

  • Over time

  • By day of the week

  • By hour

Importing necessary packages#

import pandas as pd
import plotly.express as px

Data preparation#

Load 311 data.

requests = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_22779/461306798.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
  requests = pd.read_csv(
requests[["Created Date", "Closed Date"]]
Created Date Closed Date
0 08/01/2018 12:05:13 AM 08/01/2018 12:05:13 AM
1 08/01/2018 12:06:05 AM 08/01/2018 12:06:05 AM
2 08/01/2018 12:06:16 AM 08/03/2018 02:03:55 PM
3 08/01/2018 12:06:29 AM 08/01/2018 02:54:24 AM
4 08/01/2018 12:06:51 AM 08/01/2018 04:54:26 AM
... ... ...
499953 08/24/2019 01:46:09 AM NaN
499954 08/24/2019 01:49:49 AM NaN
499955 08/24/2019 01:56:35 AM NaN
499956 08/24/2019 01:56:40 AM NaN
499957 08/24/2019 01:57:58 AM NaN

499958 rows × 2 columns

Get the minimum and maximum Created Date:

requests["Created Date"].min()
'01/01/2019 01:00:00 PM'
requests["Created Date"].max()
'12/31/2018 12:59:37 AM'

Anything you notice about that?

The dates are stored as strings (objects):

requests.dtypes
Unique Key                          int64
Created Date                       object
Closed Date                        object
Agency                             object
Agency Name                        object
Complaint Type                     object
Descriptor                         object
Location Type                      object
Incident Zip                       object
Incident Address                   object
Street Name                        object
Cross Street 1                     object
Cross Street 2                     object
Intersection Street 1              object
Intersection Street 2              object
Address Type                       object
City                               object
Landmark                           object
Facility Type                      object
Status                             object
Due Date                           object
Resolution Description             object
Resolution Action Updated Date     object
Community Board                    object
BBL                               float64
Borough                            object
X Coordinate (State Plane)        float64
Y Coordinate (State Plane)        float64
Open Data Channel Type             object
Park Facility Name                 object
Park Borough                       object
Vehicle Type                       object
Taxi Company Borough               object
Taxi Pick Up Location              object
Bridge Highway Name                object
Bridge Highway Direction           object
Road Ramp                          object
Bridge Highway Segment             object
Latitude                          float64
Longitude                         float64
Location                           object
dtype: object

Convert columns to timestamps using pandas’ to_datetime()#

requests["Created Date"] = pd.to_datetime(requests["Created Date"], format="%m/%d/%Y %I:%M:%S %p")
requests["Closed Date"] = pd.to_datetime(requests["Closed Date"], format="%m/%d/%Y %I:%M:%S %p")
requests[["Created Date", "Closed Date"]]
Created Date Closed Date
0 2018-08-01 00:05:13 2018-08-01 00:05:13
1 2018-08-01 00:06:05 2018-08-01 00:06:05
2 2018-08-01 00:06:16 2018-08-03 14:03:55
3 2018-08-01 00:06:29 2018-08-01 02:54:24
4 2018-08-01 00:06:51 2018-08-01 04:54:26
... ... ...
499953 2019-08-24 01:46:09 NaT
499954 2019-08-24 01:49:49 NaT
499955 2019-08-24 01:56:35 NaT
499956 2019-08-24 01:56:40 NaT
499957 2019-08-24 01:57:58 NaT

499958 rows × 2 columns

More about the format string. If you don’t provide one, it will take much longer to convert.

Check data types and confirm they are now datetime:

requests.dtypes
Unique Key                                 int64
Created Date                      datetime64[ns]
Closed Date                       datetime64[ns]
Agency                                    object
Agency Name                               object
Complaint Type                            object
Descriptor                                object
Location Type                             object
Incident Zip                              object
Incident Address                          object
Street Name                               object
Cross Street 1                            object
Cross Street 2                            object
Intersection Street 1                     object
Intersection Street 2                     object
Address Type                              object
City                                      object
Landmark                                  object
Facility Type                             object
Status                                    object
Due Date                                  object
Resolution Description                    object
Resolution Action Updated Date            object
Community Board                           object
BBL                                      float64
Borough                                   object
X Coordinate (State Plane)               float64
Y Coordinate (State Plane)               float64
Open Data Channel Type                    object
Park Facility Name                        object
Park Borough                              object
Vehicle Type                              object
Taxi Company Borough                      object
Taxi Pick Up Location                     object
Bridge Highway Name                       object
Bridge Highway Direction                  object
Road Ramp                                 object
Bridge Highway Segment                    object
Latitude                                 float64
Longitude                                float64
Location                                  object
dtype: object
requests["Created Date"].min()
Timestamp('2018-08-01 00:05:13')
requests["Created Date"].max()
Timestamp('2019-08-24 01:57:58')

In-class exercise#

Let’s do up through Step 2 of Homework 4.

Noise complaints per day#

noise = requests[requests["Complaint Type"] == "Noise - Residential"]
noise_per_day = noise.resample("D", on="Created Date").size().reset_index(name="count_requests")

noise_per_day
Created Date count_requests
0 2018-08-01 50
1 2018-08-02 49
2 2018-08-03 65
3 2018-08-04 162
4 2018-08-05 191
... ... ...
384 2019-08-20 60
385 2019-08-21 46
386 2019-08-22 62
387 2019-08-23 109
388 2019-08-24 33

389 rows × 2 columns

Resampling#

Once you have a column with datetime objects, pandas can manipulate them directly. From the User Guide:

resample() is a time-based groupby

.resample('D', on='Created Date')

The 'D' is the offset alias, i.e. the desired frequency.

fig = px.line(
    noise_per_day,
    x="Created Date",
    y="count_requests",
    title="Noise complaints per day",
)
fig.show()