Walter W. Zhang

dplyr and data.table in R

29 Dec 2016

I have created a introductory comparison script for R’s dplyr (v0.5.0) and data.table (v1.9.6) packages in a previous teaching assignment, and here is a polished version of it. The script can be downloaded here.

Let’s first load the packages:

library("dplyr")
library("data.table")

And generate some sample data!

CreateExampleData <- function() {
    set.seed(45L)
    DT  <- data.table(V1 = c(1L,2L),
    V2 = LETTERS[1:3],
    V3 = round(rnorm(4),4),
    V4 = 1:12 )
    return(DT)
}
DT <- CreateExampleData()
DF <- data.frame(CreateExampleData())

We use the data.frame object in R for dplyr, but we can use the data.table object for dplyr as well. Note that none of dplyr operations are done in reference. Also, %>% in dpylr is similar to the pipe operator, and it sends output of one command to the first parameter of the other.

General Usage

The following are some examples with data.table and dplyr & data.frame. I will show the data.table operation’s dplyr equivalent with a data.frame object. The parallel commands are done in order. Note that some of the dplyr code are multiple lines long, and this is done to aid the readability of the code.

Manipulating rows

With data.table:

DT[3:5,]
DT[ V2 == "A" ]

With dplyr:

DF[ 3:5, ]
DF %>% filter(V2 == "A")

dplyr works on data.table objects too!

DT %>% filter(V2 == "A")

Manipulating Columns

With data.table:

DT[, .(V2,V3)]
DT[, .(Aggregate = sum(V1), Sd.V3 = sd(V3))]

With dplyr:

DF %>% select(V2, V3)
DF %>% summarize(Aggregate = sum(V1), Sd.V3 = sd(V3))

Manipulating Columns with Groups

With data.table:

DT[5:9, .(V4.Sum = sum(V4)), by = V1]
DT[, .(V4.Sum = sum(V4)), by = sign(V1-1)]

With dplyr:

DF[5:9,] %>% group_by(V1) %>% summarize(V4.Sum = sum(V4))
DF %>% group_by(sign = sign(V1-1)) %>%
    summarize(V4.Sum = sum(V4))

Adding/Updating Columns by Reference

We use := in data.table:

DT[, c("V1","V2") := list(round(exp(V1),2), LETTERS[4:6])]
DT[, ':=' (V1 = round(exp(V1),2), V2 = LETTERS[4:6])][]

With dplyr: (Note dplyr does not update by reference)

DF <- DF %>%
    mutate(V1 = round(exp(V1),2), V2 = rep(LETTERS[4:6],4))
DF <- DF %>%
    mutate(V1 = round(exp(V1),2), V2 = rep(LETTERS[4:6],4))
DF

Indexing and keys

With a new data set:

DT <- CreateExampleData()
DF <- data.frame(CreateExampleData())

With data.table:

setkey(DT,V2)
DT["A"]
DT["A", mult ="first"]
DT["A", mult ="last"]
DT[c("A","D"), nomatch = 0]
DT[c("A","D"), nomatch = NA]
setkey(DT,V1,V2)
DT[.(2, c("A","C")), sum(V4), by=.EACHI]

With dplyr:

# There does not seem to be a direct setkey() equivalent in dplyr
DF %>% filter(V2 == "A")
DF %>% filter(V2 == "A") %>% filter(row_number()== 1)
DF %>% filter(V2 == "A") %>% filter(row_number()== n())
DF %>% filter(V2 %in% c("A", "D"))
# There does not seem to be direct nomatch = NA equivalent in dplyr
DF %>% group_by(V1, V2) %>%
    filter(V1 == 2 & V2 %in% c("A", "C")) %>%
    summarize(sum(V4))

Advanced data table operations

With data.table:

setkey(DT,V2)
DT[.N-1]                  # Extracts the penultimate row
DT[, print(.SD), by=V2]   # See what .SD contains
DT[, .SD[c(1,.N)], by=V2] # Pick first and last row in V2 groups
DT[, lapply(.SD, sum), by=V2]

With dplyr:

DF %>% arrange(V2, V1) %>% filter(row_number()== n()-1)
for (V2_item in unique(DF$V2)){
    print(DF %>% group_by(V2) %>%
        filter(V2 == V2_item) %>%
        ungroup() %>%
        select(V1, V3, V4) %>%
        arrange(V1))
} # Indirect implementation here
DF %>% group_by(V2) %>%
    filter(row_number()== 1 | row_number()== n()) %>%
    arrange(V2, V1) %>%
    select(V2, V1, V3, V4)
DF %>% group_by(V2) %>%
    summarize(sum(V1), sum(V3), sum(V4))

dplyr seems be significantly more verbose for the “Advanced Table Operations”, but the code is much easier to read.

Chaining

With a new data set:

DT <- CreateExampleData()
DF <- data.frame(CreateExampleData())

With data.table:

setkey(DT,V2)
DT[, .(V4.Sum = sum(V4)), by=V2][V4.Sum > 25]

With dplyr:

DF %>% group_by(V2) %>%
    summarize(V4.Sum = sum(V4)) %>%
    filter(V4.Sum > 25)

Chaining is much more “natural” in dplyr from the pipe operator, but the data.table operations is also very clear.

Joins

With a new data set:

set.seed(45L)
A <- data.table(store = rep(c("A","B","C"), each=5),
                week  = 1:5,
                price = round(runif(15,2,4),2) )
B   <- data.table(store = rep(c("B","C","D"), each=6),
                week  = 1:6,
                units = round(runif(18,0,25),0) )
C   <- data.table(store = c("B","C","D"),
                size  = c(100,200,300),
                region = c("Boston","Chicago","Houston") )
A1 <- data.frame(A)
B1 <- data.frame(B)
C1 <- data.frame(C)

With data.table:

setkey(A, store, week)
setkey(B, store, week)
setkey(C, store)
A[B] # Right outer join
B[A] # Left outer join
A[B, nomatch=0L] # Inner join
A[C[, .(store,region)], nomatch=0L] # Inner join on subset
A[C, .(store,week,price,size,region), nomatch=0L]

With dplyr:

A %>% right_join(B)
A %>% left_join(B, on = c(store, week)) %>%
  select(store, week, units, price)
A %>% inner_join(B)
C %>% select(store, region) %>% inner_join(A) %>%
  select(store, week, price, region)
A %>% inner_join(C) %>% select(store,week,price,size,region)

dplyr joins on default on columns with the same name, and there is also no rolling join equivalent in dplyr.

The Shift Function

The new shift function can replace code for lagging.

X  = data.table(store = rep(c("A","B"), each=10),
                week  = 1:10,
                price = round(runif(20,2,4),2) )
setkey(X, store, week)

With the Lag Code:

X[, lag_price := X[J(store,week-1)][, price]]       # First lag
X[, lag_price_3 := X[J(store,week-3)][, price]]     # Third lag

With the Shift function:

X[, lag_price_alt := shift(price, 1), by = store]   # First lag
X[, lag_price_alt_3 := shift(price, 3), by = store] # Third lag
X[]

X[, c("lag_price","lag_price_3", "lag_price_alt", "lag_price_alt_3") := NULL]
X = X[!c(3,4,5,8,11,12,15,19,20)]

Lag with missing periods:

X[, lag_price := X[J(store,week-1), roll=TRUE][, price]][]

Lag with missing periods with Shift Function: (This is a bit verbose and there is probably an easier way)

X[, lag_price_shift :=  setnames(X[,.(shift(week, 1, fill=0), shift(price, 1)),
                                by = store, roll = TRUE],
                                'V2', 'price')[,price]][]

Last Words

data.table is very terse when compared to dplyr. However, dplyr is much easier to read than data.table, and this is due to its ubiquitous use of the pipe operator.

dplyr is compatable with data.table so we can ideally get the best of two worlds. For example:

DT <- CreateExampleData()
DT[, .(V4.Sum = sum(V4)), by=V2] %>% filter(V4.Sum > 25)
DT[, .(V4.Sd = sd(V4)), by=V2] %>% select(V4.Sd, V2)

Some helpful references:


comments powered by Disqus