R – Data.Table Rolling Joins

Rolling joins are commonly used for analyzing data involving time. A simple example – suppose you have a table of product sales and a table of commercials. You might want to associate each product sale with the most recent commercial that aired prior to the sale. In this case, you cannot do a basic join between the sales table and the commercials table because each sale was NOT tracked with a commercialID attribute. Instead, you need to generate a mapping between sales and commercials based on logic involving their dates.

sales<-data.table(saleID=c("S1","S2","S3","S4","S5"), 
                  saleDate=as.Date(c("2014-2-20","2014-5-1","2014-6-15","2014-7-1","2014-12-31")))
sales
   saleID   saleDate
1:     S1 2014-02-20
2:     S2 2014-05-01
3:     S3 2014-06-15
4:     S4 2014-07-01
5:     S5 2014-12-31

commercials<-data.table(commercialID=c("C1","C2","C3","C4"), 
                        commercialDate=as.Date(c("2014-1-1","2014-4-1","2014-7-1","2014-9-15")))
commercials
   commercialID commercialDate
1:           C1     2014-01-01
2:           C2     2014-04-01
3:           C3     2014-07-01
4:           C4     2014-09-15

setkey(sales,"saleDate")
setkey(commercials,"commercialDate")

Note: If you do a rolling join involving a multi-column key, the roll column should be the last keyed column.

Before we answer the problem stated above, let’s analyze the behavior of the default rolling join in R’s data.table package.

sales[commercials, roll=TRUE]
     saleDate saleID commercialID
1: 2014-01-01     NA           C1
2: 2014-04-01     S1           C2
3: 2014-07-01     S4           C3
4: 2014-09-15     S4           C4

What just occurred is best described graphically.

roll1

data.table is associating each commercial with the most recent sale prior to the commercial date (and including the commercial date). In other words, the most recent sale prior to each commercial is said to roll forwards, and the saleDate is mapped to the commercialDate. Notice that sale S4 was the most recent sale prior to commercial C3 and C4, so S4 appears twice in the resultant table.

If you don’t want to lose the original saleDate column, you may prefer to create a designated rollDate column for your rolling join. I’m going to use this convention going forward.

sales[, rollDate:=saleDate] #Add a column, rollDate equal to saleDate
commercials[, rollDate:=commercialDate] #Add a column, rollDate equal to commercialDate

setkey(sales, "rollDate")
setkey(commercials, "rollDate")

sales[commercials, roll=TRUE]
     rollDate saleID   saleDate commercialID commercialDate
1: 2014-01-01     NA       <NA>           C1     2014-01-01
2: 2014-04-01     S1 2014-02-20           C2     2014-04-01
3: 2014-07-01     S4 2014-07-01           C3     2014-07-01
4: 2014-09-15     S4 2014-07-01           C4     2014-09-15

Now let’s see what happens when we join the commercials to the sales.

commercials[sales, roll=TRUE]
     rollDate commercialID commercialDate saleID   saleDate
1: 2014-02-20           C1     2014-01-01     S1 2014-02-20
2: 2014-05-01           C2     2014-04-01     S2 2014-05-01
3: 2014-06-15           C2     2014-04-01     S3 2014-06-15
4: 2014-07-01           C3     2014-07-01     S4 2014-07-01
5: 2014-12-31           C4     2014-09-15     S5 2014-12-31

What it’s doing…

roll2

Here data.table is associating each sale with the most recent commercial prior to the saleDate. Did you notice this is the solution to our originally stated problem?

data.table also allows backward rolling joins by setting the roll argument to -Inf

sales[commercials, roll=-Inf]
     rollDate saleID   saleDate commercialID commercialDate
1: 2014-01-01     S1 2014-02-20           C1     2014-01-01
2: 2014-04-01     S2 2014-05-01           C2     2014-04-01
3: 2014-07-01     S4 2014-07-01           C3     2014-07-01
4: 2014-09-15     S5 2014-12-31           C4     2014-09-15

commercials[sales, roll=-Inf]
     rollDate commercialID commercialDate saleID   saleDate
1: 2014-02-20           C2     2014-04-01     S1 2014-02-20
2: 2014-05-01           C3     2014-07-01     S2 2014-05-01
3: 2014-06-15           C3     2014-07-01     S3 2014-06-15
4: 2014-07-01           C3     2014-07-01     S4 2014-07-01
5: 2014-12-31           NA           <NA>     S5 2014-12-31

The above tools should provide you with the ability to solve the majority of your rolling join needs. However, I encourage you to read the docs for data.table as there are some other cool features for rolling joins. Specifically, you can

  • provide a positive/negative value to the roll argument, which will perform a forward/backward rolling join, only associating dates within n days of each other
  • roll dates to their nearest neighbor as opposed to a one-way (forward or backward) rolling join
  • specify the behavior of the end points in a rolling join (see the rollends argument)



Challenge
Remove sale S1 from the sales table.

sales<-sales[saleID!="S1"]

Now build a table that has each commercial and all sales that occurred after (and on) the commercialDate, but before the next commercial. In other words, try to build this table that will allow you to analyze the performance of each commercial.

   commercialID saleID   saleDate commercialDate
1:           C1     NA       <NA>     2014-01-01
2:           C2     S2 2014-05-01     2014-04-01
3:           C2     S3 2014-06-15     2014-04-01
4:           C3     S4 2014-07-01     2014-07-01
5:           C4     S5 2014-12-31     2014-09-15
Posted Saturday, July 26th, 2014 under R.

2 comments

  1. Great explanation of a really complicated, but powerful function in data.table. The graphics did a very nice job of visualizing what is going on. Thank you!

Leave a Reply