Dates and Times in R Without Losing Your Sanity
Working with dates and times in R can be frustrating! This isn’t R’s fault; dates and times are naturally complicated. One must consider time zones, leap years, leap seconds, Daylight Savings, hundreds of potential date and time formats, and other quirky complexities. The goal of this article is to give you the tools and knowledge to deal with dates and times in R so you can avoid common mistakes, saving your hair and extending your lifespan.
Dates
The easiest way to make a date in R is to use as.Date(x)
where x
is some date in yyyy-mm-dd format. For example,
mydate <- as.Date("2018-01-01")
mydate
## [1] "2018-01-01"
class(mydate)
## [1] "Date"
Similarly, we can make a vector of dates by passing a vector of characters to as.Date()
.
mydates <- as.Date(c("2018-01-01", "2018-05-31"))
mydates
## [1] "2018-01-01" "2018-05-31"
Perhaps the hardest, most common issue that arises when dealing with dates is that they often come in weird or cryptic formats. In these cases, you’ll need to tell as.Date()
the format of the raw date values so that R can convert them into the canonical yyyy-mm-dd format. For example
# year month day without spaces
ugh <- c("20180205", "20170303")
as.Date(ugh, format = "%Y%m%d")
## [1] "2018-02-05" "2017-03-03"
# month day year with slashes
really <- c("2/5/18", "3/3/18")
as.Date(really, format = "%m/%d/%y")
## [1] "2018-02-05" "2018-03-03"
# day month year with underscores
whywouldsomeonedothis <- c("5_2_18", "3_3_18")
as.Date(whywouldsomeonedothis, format = "%d_%m_%y")
## [1] "2018-02-05" "2018-03-03"
# month day year with full month names
annoying <- c("February 5, 2018", "March 3, 2018")
as.Date(annoying, format = "%B %d, %Y")
## [1] "2018-02-05" "2018-03-03"
You might be wondering how I figured out those format strings. The key here is to read R’s documentation for strptime()
(see ?strptime
). In it, you’ll find all the special conversion specifications that tell R what to look for. Here’s a snippet from the docs
Let’s pause and consider what a date in R actually is. Consider
as.numeric(as.Date("2018-01-01"))
## [1] 17532
That’s right – you can cast a date to numeric. That’s because internally, R stores a date as number of days since January 1st, 1970.
Now that we know how to make dates in R, let’s learn how to do stuff with them.
Add or subtract days from a date
mydate <- as.Date("2018-01-01")
mydate + 1
## [1] "2018-01-02"
mydate - 7
## [1] "2017-12-25"
mydate + c(0,1,2,3)
## [1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04"
Get the time between two dates
d1 <- as.Date("2017-01-01")
d2 <- as.Date("2018-01-01")
d2 - d1
## Time difference of 365 days
difftime(d2, d1, units = "day")
## Time difference of 365 days
difftime(d2, d1, units = "hour")
## Time difference of 8760 hours
Determine the weekday of a date
mydate <- as.Date("2018-01-01")
weekdays(mydate)
## [1] "Monday"
Change the format of a date
mydate <- Sys.Date() # Get today's date
myprettydate <- format(mydate, format = "%B %d, %Y")
mydate
## [1] "2020-06-16"
myprettydate
## [1] "June 16, 2020"
Note that myprettydate
is type character, not Date.
class(mydate)
## [1] "Date"
class(myprettydate)
## [1] "character"
lubridate
The lubridate package in R let’s us do a lot of other cool things with dates. Let’s give it a go.
# install.packages(lubridate)
library(lubridate)
Extract the day of month and month of year
day(mydate)
## [1] 16
month(mydate)
## [1] 6
Adding and Subtracting Months
Conceptually this is a tricky topic. What does it mean to “add one month” to a date? For example, I think we’d all agree that one month after 2018-01-01 is 2018-02-01. But what’s one month after 2018-01-31?
Base R provides a very handy seq.Date()
method which can be used to create a sequence of dates that differ by days, months, years, etc. For example if we create a sequence of dates that differ by one month starting from 2018-01-31, we get
seq.Date(as.Date("2018-01-31"), length.out = 3, by = "month")
## [1] "2018-01-31" "2018-03-03" "2018-03-31"
In this case R is literally adding one to the month component of each date so that the resulting (attempted) dates are 2018-01-31, 2018-02-31, 2018-03-31, and 2018-04-31. Obviously dates like 2018-02-31 and 2018-04-31 don’t exist, so R counts the difference between the day component and the last valid date in each month, and then adds that amount to the last valid month to get a valid date. For example, 2018-02-31 is (in a sense) three days after 2018-02-28, so R resolves the date to 2018-03-03.
Alternatively, lubridate provides functionality for adding months to a date that behaves differently.
as.Date("2018-01-31") %m+% months(c(0,1,2))
## [1] "2018-01-31" "2018-02-28" "2018-03-31"
In this case, 2018-02-31 resolves to 2018-02-28 so that the month component of the result is truly one month ahead of 2018-01-31.
Datetimes
Now that we’ve discussed dates, let’s move on to the hard stuff – datetimes. Let’s start by making a datetime from scratch.
mydatetime <- as.POSIXct("2018-01-01 05:30:00")
mydatetime
## [1] "2018-01-01 05:30:00 CST"
Notice that R prints mydatetime
with the timezone CST. That’s because my operating system was set up with the America/Chicago timezone (I’m based in New Orleans) and R is assuming it’s the timezone I want. (Use Sys.timezone()
to see what timezone your OS is using.) If we want to make a datetime with a specific timezone, we can use the tz
argument of as.POSIXct()
. For example
dtTokyo <- as.POSIXct("2018-01-01 5:00:00", tz = "Asia/Tokyo")
dtCalifornia <- as.POSIXct("2018-01-01 03:30:00", tz = "America/Los_Angeles")
dtSydney <- as.POSIXct("2018-01-01 22:30:00", tz = "Australia/Sydney")
For a list of timezones you can plug into the tz
argument, see this Wikipedia article.
Note that dtCalifornia
and dtSydney
are actually “the same” meaning, if you were in LA at 3:30 on 2018-01-01 and you called your friend in Sydney, your friend’s clock would show 22:30 (10:30 PM).
dtCalifornia == dtSydney
## [1] TRUE
In R, a vector of datetimes must all have the same timezone attribute. So, if you want to store the above datetimes together and retain the timezone information, the best approach is to store them in a dataframe with a DateTime column and a Timezone column like so:
df <- data.frame(
DateTime = c(dtTokyo, dtCalifornia, dtSydney),
Timezone = c(attr(dtTokyo, "tzone"), attr(dtCalifornia, "tzone"), attr(dtSydney, "tzone")),
stringsAsFactors = F
)
df
## DateTime Timezone
## 1 2018-01-01 05:00:00 Asia/Tokyo
## 2 2018-01-01 20:30:00 America/Los_Angeles
## 3 2018-01-01 20:30:00 Australia/Sydney
Notice that R actually converts the DateTime column to “America/Chicago” time (my local timezone). While this might be nice for me, my colleagues in California and New York might not appreciate it. Instead, a better approach is to store the datetimes in Coordinated Universal Time (UTC). UTC is like a reference point for which all other timezone representations are based upon.
df$DateTime
## [1] "2018-01-01 05:00:00 JST" "2018-01-01 20:30:00 JST"
## [3] "2018-01-01 20:30:00 JST"
attr(df$DateTime, "tzone") <- "UTC"
df$DateTime
## [1] "2017-12-31 20:00:00 UTC" "2018-01-01 11:30:00 UTC"
## [3] "2018-01-01 11:30:00 UTC"
The important takeaway here is that, we are NOT altering the datetime values. We are simply altering the way those datetimes are displayed. With that said, the timezone information is still important. For example, suppose we want to add a column to our dataframe called Date indicating the date of each instance.
df$Date <- as.Date(df$DateTime)
df
## DateTime Timezone Date
## 1 2017-12-31 20:00:00 Asia/Tokyo 2017-12-31
## 2 2018-01-01 11:30:00 America/Los_Angeles 2018-01-01
## 3 2018-01-01 11:30:00 Australia/Sydney 2018-01-01
The above is probably NOT what we want. While those dates are correct in reference to UTC time, the 1st row for Tokyo shows Date 2017-12-31. However, in Tokyo, that date would actually be recognized as 2018-01-01. Fortunately as.Date()
has a tz
parameter where we can specify a timezone, although it’s not vectorized so we have to apply it row-by-row.
dats <- mapply(FUN = function(datetime, timezone) as.Date(datetime, tz = timezone), df$DateTime, df$Timezone)
dats <- as.Date(dats, origin = "1970-01-01")
df$Date <- dats
df
## DateTime Timezone Date
## 1 2017-12-31 20:00:00 Asia/Tokyo 2018-01-01
## 2 2018-01-01 11:30:00 America/Los_Angeles 2018-01-01
## 3 2018-01-01 11:30:00 Australia/Sydney 2018-01-01
dats <- as.Date(dats, origin = "1970-01-01")
is necessary here because mapply()
resolves the dates to their numeric values. In practice, I would typically do this with the data.table package which is much faster for large datasets.
library(data.table)
dt <- as.data.table(df)
dt[, Date := as.Date(DateTime, tz = Timezone[1L]), by = Timezone]
dt
## DateTime Timezone Date
## 1: 2017-12-31 20:00:00 Asia/Tokyo 2018-01-01
## 2: 2018-01-01 11:30:00 America/Los_Angeles 2018-01-01
## 3: 2018-01-01 11:30:00 Australia/Sydney 2018-01-01
Daylight savings can be a pain when working with datetimes. Consider this weird behavior… In case you weren’t aware, the beginning of Daylight Savings in US Central skips 2AM.
seq.POSIXt(from = as.POSIXct("2018-03-11 01:00:00", tz = "America/Chicago"), length.out = 6, by = "15 mins")
## [1] "2018-03-11 01:00:00 CST" "2018-03-11 01:15:00 CST"
## [3] "2018-03-11 01:30:00 CST" "2018-03-11 01:45:00 CST"
## [5] "2018-03-11 03:00:00 CDT" "2018-03-11 03:15:00 CDT"
Similarly the end of Daylight Savings repeats 1AM.
seq.POSIXt(from = as.POSIXct("2018-11-04 01:00:00", tz = "America/Chicago"), length.out = 6, by = "15 mins")
## [1] "2018-11-04 01:00:00 CDT" "2018-11-04 01:15:00 CDT"
## [3] "2018-11-04 01:30:00 CDT" "2018-11-04 01:45:00 CDT"
## [5] "2018-11-04 01:00:00 CST" "2018-11-04 01:15:00 CST"
With these things in mind, take note of how R handles the following:
as.POSIXct("2018-03-11 02:15:00", tz = "America/Chicago") # Turns into 12:00AM
## [1] "2018-03-11 CST"
as.POSIXct("2018-11-04 01:15:00", tz = "America/Chicago")
## [1] "2018-11-04 01:15:00 CDT"
as.POSIXct("2018-11-04 01:15:00", tz = "America/Chicago") + 60*60 # Add 1 hour (60*60 seconds)
## [1] "2018-11-04 01:15:00 CST"
To make things more confusing, not all regions observe Daylight Savings. Some regions do a half-hour rollback and others do a 45-minute rollback. Due to the wacky and inconsistent behavior of datetimes, it’s often easiest to work with UTC if possible. (UTC does not observe daylight savings.)
Lastly, let’s take a look at some common and useful tasks involving datetimes.
Formatting (see ?strptime
)
mydatetime <- Sys.time()
format(mydatetime, "%Y-%m-%d, %I:%M %p", usetz = T)
## [1] "2020-06-16, 04:59 PM CDT"
format(mydatetime, "%I:%M %p", usetz = T)
## [1] "04:59 PM CDT"
format(mydatetime, "%I:%M:%S %p", usetz = T)
## [1] "04:59:41 PM CDT"
Adding time
mydatetime <- as.POSIXct("2018-11-04 00:00:00", tz = "America/Chicago")
# add 60 seconds
mydatetime + 60
## [1] "2018-11-04 00:01:00 CDT"
# add 24 hours (this is the end of Daylight Savings)
mydatetime + 24*60*60
## [1] "2018-11-04 23:00:00 CST"
# add 1 day
mydatetime + lubridate::days(1)
## [1] "2018-11-05 CST"
Note that leap seconds exist, but R’s POSIXct class ignores them.
Create a sequence of datetimes
mydatetime <- as.POSIXct("2018-01-01 00:00:00", tz = "UTC")
seq.POSIXt(from = mydatetime, length.out = 3, by = "5 mins")
## [1] "2018-01-01 00:00:00 UTC" "2018-01-01 00:05:00 UTC"
## [3] "2018-01-01 00:10:00 UTC"
Rounding datetimes
library(lubridate)
mydatetime <- as.POSIXct("2018-01-01 12:35:00", tz = "UTC")
# nearest hour
round_date(mydatetime, unit = "hour")
## [1] "2018-01-01 13:00:00 UTC"
# nearest hour not ahead
floor_date(mydatetime, unit = "hour")
## [1] "2018-01-01 12:00:00 UTC"
# nearest hour not behind
ceiling_date(mydatetime, unit = "hour")
## [1] "2018-01-01 13:00:00 UTC"
In closing, R has the tools you need to do just about anything involving dates and times. However, it’s your responsibility to think about things like leap years, Daylight Savings, timezones etc. In order to avoid common traps, use UTC and the lubridate package often. Cheers!