R’s rpart package provides a powerful framework for growing classification and regression trees. To see how it works, let’s get started with a minimal example.
Motivating Problem First let’s define a problem. There’s a common scam amongst motorists whereby a person will slam on his breaks in heavy traffic with the intention of being rear-ended. The person will then file an insurance claim for personal injury and damage to his vehicle, alleging that the other driver was at fault.
Rolling joins are commonly used for analyzing data involving time. A simple example – suppose you have a table of product sales and a table of commercials. You might want to associate each product sale with the most recent commercial that aired prior to the sale. In this case, you cannot do a basic join between the sales table and the commercials table because each sale was not tracked with a CommercialId attribute.
R’s data.table package provides fast methods for handling large tables of data with simplistic syntax. The following is an introduction to basic join operations using data.table.
Suppose you have two data.tables – a table of insurance policies
policies <- data.table( PolicyNumber = c(1, 2, 3), EffectiveDate = as.Date(c("2012-1-1", "2012-1-1", "2012-7-1")), ExpirationDate = as.Date(c("2012-12-31", "2012-6-30", "2012-12-31")) ) policies ## PolicyNumber EffectiveDate ExpirationDate ## 1: 1 2012-01-01 2012-12-31 ## 2: 2 2012-01-01 2012-06-30 ## 3: 3 2012-07-01 2012-12-31 and a table of insurance claims.
A factor variable (commonly called a categorical variable outside of R) is a variable that takes on a limited set of values. For example, a vector that stores days of the week {Sunday, Monday, etc.} or colors from the set {Red, Blue, Green} should be encoded as a factor. By contrast, a vector of person names {Bill, Sue, Jane, …} should generally be designated as a character vector since there is an unlimited set of possible names a person can take.
When I was in high school, I knew I wanted to pursue a career involving math. I did an internship working for some mechanical engineers at an oil platform consultant company, but I never witnessed my mentors do more than basic geometry or algebra. That’s when I started looking into actuarial science. It sounded like a more challenging and stimulating career for me. Problem was, I was having a hard time understanding exactly what an actuary does.
Before we get started I need to clarify something. Theoretical decision trees can have two or more branches protruding from a single node.
However, this can be computationally expensive so most implementations of decision trees only allow binary splits.
Recall our example problem – Bill is a user on our online dating site and we want to build a decision tree that predicts whether he would message a certain woman. (If so, we say that she’s “date-worthy”).