Spatial Data Analysis in R

In this tutorial you will learn how to

  • define geometries (points, lines, polygons)
  • plot those geometries
  • execute spatial joins (which points are contained in a polygon?)
  • get the distance between a set of points
  • do all of the above within the context of geospatial data (e.g. cities, roads, counties)

Important This tutorial is based on sf version 0.5-3 and ggplot2 version

The Basics

To get started we need to learn how to define and operate on abstract geometries without a coordinate reference system (CRS). In other words, we need to learn how to deal with points, lines and polygons on your vanilla XY plane as opposed to longitude and latitude coordinates on a globe. To do this, we will make heavy use of the Simple Feature (sf) package from Edzer Pebesma.

Simple Feature Geometries (sfg)

We can define a point with the coordinates (x,y) = (1,1) as follows

pt1 <- st_point(c(1,1))
pt1  # POINT(1 1)

We could also include a Z coordinate by doing

st_point(c(1,1,1))  # POINTZ(1 1 1)

and/or an M coordinate by doing

st_point(c(1,1,1,20))  # POINTZM(1 1 1 20)
st_point(c(1,1,20), dim = "XYM")  # POINTM(1 1 20)

Here X, Y, and Z represent spatial coordinates while M represents an associated measurement such as time. Inspecting the class of pt1 we see

class(pt1)  # "XY" "POINT" "sfg"  

Our point object inherits from multiple classes – “XY”, “POINT”, and “sfg”. Here sfg stands for Simple Feature Geometry. We can build other Simple Feature Geometries such as


line1 <- st_linestring(x=rbind(c(4,1), c(5,2), c(6,2)))
class(line1)  # "XY" "LINESTRING" "sfg"


poly1 <- st_polygon(x=list(rbind(c(1,1), c(3,1), c(3,2), c(1,2), c(.5,.5), c(1,1))))
class(poly1)  # "XY" "POLYGON" "sfg"

POLYGON (with holes)

poly2 <- st_polygon(x=list(
  rbind(c(1,3), c(3,3), c(3,4), c(1,4), c(.5,2.5), c(1,3)), # outer polygon
  rbind(c(1.5, 3.5), c(1.7, 3.5), c(1.7, 3.7), c(1.5, 3.7), c(1.5, 3.5)),  # hole1
  rbind(c(2.5, 3.5), c(2.7, 3.5), c(2.7, 3.7), c(2.5, 3.5))  # hole2
class(poly2)  # "XY" "POLYGON" "sfg"

You can also define a Simple Feature Geometry that includes multiple points, linestrings, polygons, or a mixture of all three.


multipt <- st_multipoint(rbind(c(1,1), c(2,2), c(1.5,1.5), c(2,1)))
class(multipt)  # "XY" "MULTIPOINT" "sfg"

GEOMETRYCOLLECTION (a point, line, and polygon)

geomcoll <- st_geometrycollection(x = list(pt1, line1, poly2))
class(geomcoll)  # "XY" "GEOMETRYCOLLECTION" "sfg"

Notice that printing the objects returns a description of its coordinates. This textual description is called Well Known Text (WKT).

We can do lots of cool things with geometries like

create new point which is pt1 shifted one unit up

pt1 + c(0, 1)

get the area of polygon1


determine if pt1 lies inside poly2

st_within(pt1, poly1, sparse = F)

determine the distance between two points

st_distance(st_point(c(1,1)), st_point(c(2,2)))

Simple Feature List-Column (sfc)

Suppose we shoot bullets at a target. We’ll want to analyze where each bullet hit the target. For this we can use a Simple Feature List-Column (sfc). For example

shot1 <- st_point(c(0.4, 0.3))
shot2 <- st_point(c(0.8, 0.1))
shot3 <- st_point(c(0.6, 0.55))
shotsSFC <- st_sfc(shot1, shot2, shot3)

Now we can do cool stuff like

get the distance between every pair of bullets


determine the minimum bounding box that contains all points


You might be wondering why we created a Simple Feature List-Column instead of a MULTIPOINT object. The reason is because MULTIPOINT is meant to represent one “thing” which happens to have multiple points. In this case, we can consider each shot to be a separate “thing”, so it make sense to represent them as multiple POINT objects within a Simple Feature List-Column. To make this more clear, suppose we use a shotgun (which sprays many beads for each shot) instead of a rifle (one bead per shot). Here we’d represent each shot as a MULTIPOINT, but still organize the collection of shots together as a Simple Feature List-Column.

shot1 <- st_multipoint(rbind(c(0.4, 0.3), c(0.45, 0.32), c(0.42, 0.35)))
shot2 <- st_multipoint(rbind(c(0.8, 0.1)))
shot3 <- st_multipoint(rbind(c(0.93, 0.55), c(0.97, 0.52), c(0.95, 0.48), c(1.05, 0.29)))
shotsSFC <- st_sfc(shot1, shot2, shot3)


Simple Feature (sf)

The next layer up from Simple Feature List-Column is the Simple Feature class. Consider our example of shooting a gun at target. With each shot there is associated data we’d like to retain. For example, the shot number and the type of gun used. ShotNumber and GunType are the features associated with the points representing the bullet locations. This concept of having a geometry associated with a set of features is the underlying concept behind the Simple Feature class type.

shotsDF <- data.frame(
  ShotNumber = c(1, 2, 3, 4, 5),
  GunType = c("rifle", "rifle", "shotgun", "shotgun", "shotgun")
shot1 <- st_point(c(0.2, 0.75))
shot2 <- st_point(c(1.1, 0.9))
shot3 <- st_multipoint(rbind(c(0.4, 0.3), c(0.45, 0.32), c(0.42, 0.35)))
shot4 <- st_multipoint(rbind(c(0.8, 0.1)))
shot5 <- st_multipoint(rbind(c(0.93, 0.55), c(0.97, 0.52), c(0.95, 0.48), c(1.05, 0.48)))
shotsSFG <- st_sfc(shot1, shot2, shot3, shot4, shot5)
shotsSF <- st_sf(shotsDF, shotsSFG)


Notice that when we print shotsSF it looks much like a data.frame. In fact, class(shotsSF) returns “sf”, “data.frame” meaning shotsSF is an extension of a data.frame with additional properties defined for Simple Feature objects. The upshot of this is that we can use it much like we use a normal data.frame

subset the rows


combine rows

rbind(shotsSF, shotsSF)

reference columns


Simple Feature is the class type you will find yourself using most frequently when working with spatial data. A major benefit to this is that ggplot2 has a native geom_sf() method for plotting your Simple Feature objects. For example

mypoints <- data.frame(x = c(1,2,3), y = c(1,2,3), foo = c("a", "b", "a"))
mypoints <- st_as_sf(mypoints, coords = c("x", "y"))
ggplot()+geom_sf(data = mypoints, aes(color = foo))+labs(title = "My simple plot")

(Note that geom_sf() is brand new, so it still has some bugs which are being worked out.)

Here we define a square target to illustrate some of the cool things you can do with sf objects.

target <- st_polygon(list(rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0))))
plot(shotsSF, add=T, col=shotsSF$ShotNumber)

Which shots hit the target at least a little bit?

st_intersects(target, shotsSF, sparse = F)

Which shots landed entirely within the target?

st_contains(target, shotsSF, sparse = F)

For each shot, how close was the center of the target to the nearest bead?

st_distance(shotsSF, st_centroid(target))

For each shot, how close was the center of the target to the center of the beads?

st_distance(st_centroid(shotsSF), st_centroid(target))

Data with a Coordinate Reference System (CRS)

For geospatial data, we can mimmick the concepts above but we’ll need to include something extra – a Coordinate Reference System (CRS). The CRS will inform sf to treat x and y coordinates as longitude and latitude, and help with operations like plotting and measuring the distance between points. CRS is a deep, technical topic and I won’t pretend to be an expert on it. If you’re interested in understanding it better, I’ll defer you to the Wikipedia article for Spatial Reference Systems.

Geospatial data is typically stored as a “shape file” which is really a collection of files, each with the same prefix but with differing file extensions which store different data elements. As an example, the sf package comes with a shape file representing the counties in North Carolina. You can see them by navigating to

system.file("shape", package="sf")  # For me, /Library/Frameworks/R.framework/Versions/3.4/Resources/library/sf/shape
list.files(system.file("shape", package="sf"))

To load a shape file, we can do

nc <- st_read(system.file("shape/nc.shp", package="sf"))

Notice that the resulting object is a Simple Feature where each row in the data.frame represents a county, and the geometry used to represent counties is a MULTIPOLYGON. This is because some counties in NC have disjoint sections broken apart by water.

Heres a plot of the data


Also take note of the CRS. It’s signified by the SRID 4267.


Suppose we want to plot a few cities on the map. First we can build a table of some major cities in NC

cities <- data.frame(
  City = c("Charlotte", "Raleigh", "Greensboro"),
  Population = c(842051, 458880, 287027),
  lon = c(-80.8431, -78.6382, -79.7920),
  lat = c(35.2271, 35.7796, 36.0726)

Convert to sf object

citiesSF <- st_as_sf(cities, coords = c("lon", "lat"), crs = 4267)

Plot North Carolina counties with major cities overlaid

  geom_sf(data = nc)+
  geom_sf(data = citiesSF, aes(size = Population), color = "red")

For more details, be sure to check out the sf vignettes, starting with this one.