R – Data.Table Rolling Joins

Rolling joins are commonly used for analyzing data involving time. A simple example – suppose you have a table of product sales and a table of commercials. You might want to associate each product sale with the most recent commercial that aired prior to the sale. In this case, you cannot do a basic join between the sales table and the commercials table because each sale was NOT tracked with a commercialID attribute. Instead, you need to generate a mapping between sales and commercials based on logic involving their dates.

   saleID   saleDate
1:     S1 2014-02-20
2:     S2 2014-05-01
3:     S3 2014-06-15
4:     S4 2014-07-01
5:     S5 2014-12-31

   commercialID commercialDate
1:           C1     2014-01-01
2:           C2     2014-04-01
3:           C3     2014-07-01
4:           C4     2014-09-15


Note: If you do a rolling join involving a multi-column key, the roll column should be the last keyed column.

Before we answer the problem stated above, let’s analyze the behavior of the default rolling join in R’s data.table package.

sales[commercials, roll=TRUE]
     saleDate saleID commercialID
1: 2014-01-01     NA           C1
2: 2014-04-01     S1           C2
3: 2014-07-01     S4           C3
4: 2014-09-15     S4           C4

What just occurred is best described graphically.


data.table is associating each commercial with the most recent sale prior to the commercial date (and including the commercial date). In other words, the most recent sale prior to each commercial is said to roll forwards, and the saleDate is mapped to the commercialDate. Notice that sale S4 was the most recent sale prior to commercial C3 and C4, so S4 appears twice in the resultant table.

If you don’t want to lose the original saleDate column, you may prefer to create a designated rollDate column for your rolling join. I’m going to use this convention going forward.

sales[, rollDate:=saleDate] #Add a column, rollDate equal to saleDate
commercials[, rollDate:=commercialDate] #Add a column, rollDate equal to commercialDate

setkey(sales, "rollDate")
setkey(commercials, "rollDate")

sales[commercials, roll=TRUE]
     rollDate saleID   saleDate commercialID commercialDate
1: 2014-01-01     NA       <NA>           C1     2014-01-01
2: 2014-04-01     S1 2014-02-20           C2     2014-04-01
3: 2014-07-01     S4 2014-07-01           C3     2014-07-01
4: 2014-09-15     S4 2014-07-01           C4     2014-09-15

Now let’s see what happens when we join the commercials to the sales.

commercials[sales, roll=TRUE]
     rollDate commercialID commercialDate saleID   saleDate
1: 2014-02-20           C1     2014-01-01     S1 2014-02-20
2: 2014-05-01           C2     2014-04-01     S2 2014-05-01
3: 2014-06-15           C2     2014-04-01     S3 2014-06-15
4: 2014-07-01           C3     2014-07-01     S4 2014-07-01
5: 2014-12-31           C4     2014-09-15     S5 2014-12-31

What it’s doing…


Here data.table is associating each sale with the most recent commercial prior to the saleDate. Did you notice this is the solution to our originally stated problem?

data.table also allows backward rolling joins by setting the roll argument to -Inf

sales[commercials, roll=-Inf]
     rollDate saleID   saleDate commercialID commercialDate
1: 2014-01-01     S1 2014-02-20           C1     2014-01-01
2: 2014-04-01     S2 2014-05-01           C2     2014-04-01
3: 2014-07-01     S4 2014-07-01           C3     2014-07-01
4: 2014-09-15     S5 2014-12-31           C4     2014-09-15

commercials[sales, roll=-Inf]
     rollDate commercialID commercialDate saleID   saleDate
1: 2014-02-20           C2     2014-04-01     S1 2014-02-20
2: 2014-05-01           C3     2014-07-01     S2 2014-05-01
3: 2014-06-15           C3     2014-07-01     S3 2014-06-15
4: 2014-07-01           C3     2014-07-01     S4 2014-07-01
5: 2014-12-31           NA           <NA>     S5 2014-12-31

The above tools should provide you with the ability to solve the majority of your rolling join needs. However, I encourage you to read the docs for data.table as there are some other cool features for rolling joins. Specifically, you can

  • provide a positive/negative value to the roll argument, which will perform a forward/backward rolling join, only associating dates within n days of each other
  • roll dates to their nearest neighbor as opposed to a one-way (forward or backward) rolling join
  • specify the behavior of the end points in a rolling join (see the rollends argument)

Remove sale S1 from the sales table.


Now build a table that has each commercial and all sales that occurred after (and on) the commercialDate, but before the next commercial. In other words, try to build this table that will allow you to analyze the performance of each commercial.

   commercialID saleID   saleDate commercialDate
1:           C1     NA       <NA>     2014-01-01
2:           C2     S2 2014-05-01     2014-04-01
3:           C2     S3 2014-06-15     2014-04-01
4:           C3     S4 2014-07-01     2014-07-01
5:           C4     S5 2014-12-31     2014-09-15
Posted Saturday, July 26th, 2014 under R.


  1. Great explanation of a really complicated, but powerful function in data.table. The graphics did a very nice job of visualizing what is going on. Thank you!

  2. Jesús Loredo says:

    Una explicación muy agradable, en especial por las animaciones, muhas gracias.

    Llegue aquí mientras me documentaba en google, y creo que leeré algunas entradas (empezando por Magic Behind Constructing a Decision Tree,)

    personalmente me encantaria ver material de Acturia y analisis multivariante,

    Felicidades por el Blog, espero no lo abandones,

    Saludos desde México,

  3. Great, intuitive example to demonstrate this highly useful feature of data.table! Kudos to you!

  4. Superquant says:

    Fantastic example of something extremely useful in data.table.

  5. Excellent! This was extremely helpful, especially the visuals. Thank you!

  6. Wordsmyth says:

    Again, a picture is woth more than 1k words! Thanks.

  7. Dmitry says:

    Nice job, Ben! Visuals are of great help!

  8. Nishant says:

    Nice explanation Ben, thanks

  9. Beyond awesome. The use of the “rollDate” copies of the index variables make it much, much easier to see what whent on when looking at the result DT. Thanks.

Leave a Reply