Lecture 4: Dates and time series analysis#
Please sign attendance sheet
Dates and time series analysis#
From Wikipedia:
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.
What are time series you’ve seen / worked with?
Is the 311 data a time series?
Importing necessary packages#
import pandas as pd
import plotly.express as px
Data preparation#
Load 311 data.
requests = pd.read_csv(
"https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample_clean.csv.zip"
)
/var/folders/kr/nx0m1j811kz5vy8c87ffchzr0000gn/T/ipykernel_39361/461306798.py:1: DtypeWarning: Columns (8,20,31,34) have mixed types. Specify dtype option on import or set low_memory=False.
requests = pd.read_csv(
requests[["Created Date", "Closed Date"]]
Created Date | Closed Date | |
---|---|---|
0 | 08/01/2018 12:05:13 AM | 08/01/2018 12:05:13 AM |
1 | 08/01/2018 12:06:05 AM | 08/01/2018 12:06:05 AM |
2 | 08/01/2018 12:06:16 AM | 08/03/2018 02:03:55 PM |
3 | 08/01/2018 12:06:29 AM | 08/01/2018 02:54:24 AM |
4 | 08/01/2018 12:06:51 AM | 08/01/2018 04:54:26 AM |
... | ... | ... |
499953 | 08/24/2019 01:46:09 AM | NaN |
499954 | 08/24/2019 01:49:49 AM | NaN |
499955 | 08/24/2019 01:56:35 AM | NaN |
499956 | 08/24/2019 01:56:40 AM | NaN |
499957 | 08/24/2019 01:57:58 AM | NaN |
499958 rows × 2 columns
Get the minimum and maximum Created Date:
requests["Created Date"].min()
'01/01/2019 01:00:00 PM'
requests["Created Date"].max()
'12/31/2018 12:59:37 AM'
Anything you notice about that?
The dates are stored as strings (object
s):
requests.dtypes
Unique Key int64
Created Date object
Closed Date object
Agency object
Agency Name object
Complaint Type object
Descriptor object
Location Type object
Incident Zip object
Incident Address object
Street Name object
Cross Street 1 object
Cross Street 2 object
Intersection Street 1 object
Intersection Street 2 object
Address Type object
City object
Landmark object
Facility Type object
Status object
Due Date object
Resolution Description object
Resolution Action Updated Date object
Community Board object
BBL float64
Borough object
X Coordinate (State Plane) float64
Y Coordinate (State Plane) float64
Open Data Channel Type object
Park Facility Name object
Park Borough object
Vehicle Type object
Taxi Company Borough object
Taxi Pick Up Location object
Bridge Highway Name object
Bridge Highway Direction object
Road Ramp object
Bridge Highway Segment object
Latitude float64
Longitude float64
Location object
dtype: object
Convert columns to timestamps using pandas’ to_datetime()
#
requests["Created Date"] = pd.to_datetime(requests["Created Date"], format="%m/%d/%Y %I:%M:%S %p")
requests["Closed Date"] = pd.to_datetime(requests["Closed Date"], format="%m/%d/%Y %I:%M:%S %p")
requests[["Created Date", "Closed Date"]]
Created Date | Closed Date | |
---|---|---|
0 | 2018-08-01 00:05:13 | 2018-08-01 00:05:13 |
1 | 2018-08-01 00:06:05 | 2018-08-01 00:06:05 |
2 | 2018-08-01 00:06:16 | 2018-08-03 14:03:55 |
3 | 2018-08-01 00:06:29 | 2018-08-01 02:54:24 |
4 | 2018-08-01 00:06:51 | 2018-08-01 04:54:26 |
... | ... | ... |
499953 | 2019-08-24 01:46:09 | NaT |
499954 | 2019-08-24 01:49:49 | NaT |
499955 | 2019-08-24 01:56:35 | NaT |
499956 | 2019-08-24 01:56:40 | NaT |
499957 | 2019-08-24 01:57:58 | NaT |
499958 rows × 2 columns
More about the format
string. If you don’t provide one, it will take much longer to convert.
Check data types and confirm they are now datetime:
requests.dtypes
Unique Key int64
Created Date datetime64[ns]
Closed Date datetime64[ns]
Agency object
Agency Name object
Complaint Type object
Descriptor object
Location Type object
Incident Zip object
Incident Address object
Street Name object
Cross Street 1 object
Cross Street 2 object
Intersection Street 1 object
Intersection Street 2 object
Address Type object
City object
Landmark object
Facility Type object
Status object
Due Date object
Resolution Description object
Resolution Action Updated Date object
Community Board object
BBL float64
Borough object
X Coordinate (State Plane) float64
Y Coordinate (State Plane) float64
Open Data Channel Type object
Park Facility Name object
Park Borough object
Vehicle Type object
Taxi Company Borough object
Taxi Pick Up Location object
Bridge Highway Name object
Bridge Highway Direction object
Road Ramp object
Bridge Highway Segment object
Latitude float64
Longitude float64
Location object
dtype: object
requests["Created Date"].min()
Timestamp('2018-08-01 00:05:13')
requests["Created Date"].max()
Timestamp('2019-08-24 01:57:58')
In-class exercise#
Do Homework 4 through Step 2.
Noise complaints per day#
How would you do this, conceptually?
is_noise = requests["Complaint Type"] == "Noise - Residential"
noise = requests[is_noise]
noise_per_day = noise.resample("D", on="Created Date").size().reset_index(name="count_requests")
noise_per_day
Created Date | count_requests | |
---|---|---|
0 | 2018-08-01 | 50 |
1 | 2018-08-02 | 49 |
2 | 2018-08-03 | 65 |
3 | 2018-08-04 | 162 |
4 | 2018-08-05 | 191 |
... | ... | ... |
384 | 2019-08-20 | 60 |
385 | 2019-08-21 | 46 |
386 | 2019-08-22 | 62 |
387 | 2019-08-23 | 109 |
388 | 2019-08-24 | 33 |
389 rows × 2 columns
Resampling#
Once you have a column with datetime objects, pandas can manipulate them directly. From the User Guide:
resample()
is a time-basedgroupby
.resample('D', on='Created Date')
The 'D'
is the offset alias, i.e. the desired frequency.
fig = px.line(
noise_per_day,
x="Created Date",
y="count_requests",
title="Noise complaints per day",
)
fig.show()