R – Introduction to Factors Tutorial

A factor variable (commonly called a categorical variable outside of R) is a variable that takes on a limited set of values. For example, days of the week {Sunday, Monday, etc.} or the set of colors {Red, Blue, Green} should be a factor. By contrast, a vector of person names {Bill, Sue, Jane, etc.} should generally be designated as a character vector since there is an unlimited set of possible names a person can take.

Let’s make this more clear with some examples.

First create a vector of colors.

colors<-c("blue", "red", "green")
class(colors)
[1] "character"

By default, R will create a character vector. We need to explicitly assign this vector as a factor vector by using the factor() function.

colors<-factor(c("blue", "red", "green"))
class(colors)
[1] "factor"

Now, notice the output when we print the colors vector.

colors
[1] blue  red   green
Levels: blue green red

R displays our vector of colors with something additional – Levels.

The levels of a factor are the possible values that the variable can take. By default, when you create a factor vector, R sets the levels of the factor to be the unique set of values inside the vector, and R orders the levels alphabetically.

Suppose you are a teacher and you give your students a survey to evaluate your performance. One of your evaluation criteria might be “Your teacher had strong knowledge of the subject” with possible answers {Strongly Agree, Agree, Disagree, Strongly Disagree}. Suppose the student responses were {Agree, Agree, Strongly Agree, Disagree, Agree}.

We can create a vector of these responses by

responses<-factor(c("Agree", "Agree", "Strongly Agree", "Disagree", "Agree"))
responses
[1] Agree  Agree  Strongly Agree  Disagree  Agree         
Levels: Agree Disagree Strongly Agree

Luckily for you, none of the students strongly disagreed with the statement. However, since it was an option in the survey it should be included in the levels attribute of the factor. You can correct this by explicitly setting the levels attribute.

responses<-factor(c("Agree", "Agree", "Strongly Agree", "Disagree", "Agree"), 
                  levels=c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree"))
responses
[1] Agree  Agree  Strongly Agree  Disagree  Agree         
Levels: Strongly Agree Agree Disagree Strongly Disagree

Now, with the extra information provided in the levels attribute we can do things like plot a bar graph of the students’ responses including responses that were not selected.

Ussing ggplot2…

require(ggplot2)
df<-data.frame(responses = responses) #create a data frame for ggplot
ggplot(data=df, aes(x=responses)) + geom_bar() + scale_x_discrete(drop=FALSE) #drop=FALSE tells the plot not to drop unused levels (i.e. include Strongly Disagree as a bar in the plot)

Survey Plot1

Also note that plots will display the values of a factor in the order of the levels of the factor. In the above example, since we explicitly defined the levels attribute as c(“Strongly Agree”, “Agree”, “Disagree”, “Strongly Disagree”), our bar plot displayed the bars in that order. We can plot the response options in the reverse order by resetting the levels of the responses vector.

responses<-factor(responses, 
                  levels=c("Strongly Disagree", "Disagree", "Agree", "Strongly Agree"))
df<-data.frame(responses = responses)
ggplot(data=df, aes(x=responses))+geom_bar()+scale_x_discrete(drop=FALSE)

Survey Plot2

There’s an important distinction to make between the two examples of factors we discusses above. responses {Strongly Agree, Agree, Disagree, Strongly Disagree} have an inherent order whereas colors {red, blue, green} do not. To see whether a factor is ordered, use the is.ordered() function.

colors<-factor(c("blue", "red", "green"))
is.ordered(colors)
[1] FALSE

responses<-factor(c("Agree", "Agree", "Strongly Agree", "Disagree", "Agree"), 
                  levels=c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree"))
is.ordered(responses)
[1] FALSE

So how do we inform R that our vector of responses should be ordered? Use the ordered attribute of the factor() function.

responses<-factor(c("Agree", "Agree", "Strongly Agree", "Disagree", "Agree"), 
                  levels=c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree"),
                  ordered=TRUE)
is.ordered(responses)
[1] TRUE

Printing the responses vector gives

responses
[1] Agree  Agree  Strongly Agree  Disagree  Agree         
Levels: Strongly Agree < Agree < Disagree < Strongly Disagree

Notice the levels are printed with “<“s indicating that they have a meaningful order. Certain probabilistic models such as decision trees will take this into account. For example, consider a model that uses the responses of your student survey to try to predict whether a student failed your class. If you feed the model your vector of responses to the statement “Your teacher had strong knowledge of the subject” as an unordered factor, the model might consider all possible combinations of responses to predict whether a student failed:

{Agree, Strongly Disagree} vs {Strongly Agree, Disagree}
{Strongly Agree, Disagree, Strongly Disagree} vs {Agree}
.
.
.
{Strongly Disagree} vs {Strongly Agree, Agree, Disagree}

If you feed the model your vector of responses as an ordered factor, the model may only look at the following combinations of responses:

{Strongly Agree, Agree, Disagree} vs {Strongly Disagree}
{Strongly Agree, Agree} vs {Disagree, Strongly Disagree}
{Strongly Agree} vs {Agree, Disagree, Strongly Disagree}

The latter approach being much more reasonable for your classification model.