I have created a introductory comparison script for R’s dplyr
(v0.5.0) and data.table
(v1.9.6) packages in a previous teaching assignment, and here is a polished version of it. The script can be downloaded here.
Let’s first load the packages:
library("dplyr")
library("data.table")
And generate some sample data!
CreateExampleData <- function() {
set.seed(45L)
DT <- data.table(V1 = c(1L,2L),
V2 = LETTERS[1:3],
V3 = round(rnorm(4),4),
V4 = 1:12 )
return(DT)
}
DT <- CreateExampleData()
DF <- data.frame(CreateExampleData())
We use the data.frame
object in R for dplyr, but we can use the data.table
object for dplyr as well. Note that none of dplyr operations are done in reference. Also, %>% in dpylr is similar to the pipe operator, and it sends output of one command to the first parameter of the other.
The following are some examples with data.table and dplyr & data.frame. I will show the data.table operation’s dplyr equivalent with a data.frame object. The parallel commands are done in order. Note that some of the dplyr code are multiple lines long, and this is done to aid the readability of the code.
With data.table:
DT[3:5,]
DT[ V2 == "A" ]
With dplyr:
DF[ 3:5, ]
DF %>% filter(V2 == "A")
dplyr works on data.table objects too!
DT %>% filter(V2 == "A")
With data.table:
DT[, .(V2,V3)]
DT[, .(Aggregate = sum(V1), Sd.V3 = sd(V3))]
With dplyr:
DF %>% select(V2, V3)
DF %>% summarize(Aggregate = sum(V1), Sd.V3 = sd(V3))
With data.table:
DT[5:9, .(V4.Sum = sum(V4)), by = V1]
DT[, .(V4.Sum = sum(V4)), by = sign(V1-1)]
With dplyr:
DF[5:9,] %>% group_by(V1) %>% summarize(V4.Sum = sum(V4))
DF %>% group_by(sign = sign(V1-1)) %>%
summarize(V4.Sum = sum(V4))
We use := in data.table:
DT[, c("V1","V2") := list(round(exp(V1),2), LETTERS[4:6])]
DT[, ':=' (V1 = round(exp(V1),2), V2 = LETTERS[4:6])][]
With dplyr: (Note dplyr does not update by reference)
DF <- DF %>%
mutate(V1 = round(exp(V1),2), V2 = rep(LETTERS[4:6],4))
DF <- DF %>%
mutate(V1 = round(exp(V1),2), V2 = rep(LETTERS[4:6],4))
DF
With a new data set:
DT <- CreateExampleData()
DF <- data.frame(CreateExampleData())
With data.table:
setkey(DT,V2)
DT["A"]
DT["A", mult ="first"]
DT["A", mult ="last"]
DT[c("A","D"), nomatch = 0]
DT[c("A","D"), nomatch = NA]
setkey(DT,V1,V2)
DT[.(2, c("A","C")), sum(V4), by=.EACHI]
With dplyr:
# There does not seem to be a direct setkey() equivalent in dplyr
DF %>% filter(V2 == "A")
DF %>% filter(V2 == "A") %>% filter(row_number()== 1)
DF %>% filter(V2 == "A") %>% filter(row_number()== n())
DF %>% filter(V2 %in% c("A", "D"))
# There does not seem to be direct nomatch = NA equivalent in dplyr
DF %>% group_by(V1, V2) %>%
filter(V1 == 2 & V2 %in% c("A", "C")) %>%
summarize(sum(V4))
With data.table:
setkey(DT,V2)
DT[.N-1] # Extracts the penultimate row
DT[, print(.SD), by=V2] # See what .SD contains
DT[, .SD[c(1,.N)], by=V2] # Pick first and last row in V2 groups
DT[, lapply(.SD, sum), by=V2]
With dplyr:
DF %>% arrange(V2, V1) %>% filter(row_number()== n()-1)
for (V2_item in unique(DF$V2)){
print(DF %>% group_by(V2) %>%
filter(V2 == V2_item) %>%
ungroup() %>%
select(V1, V3, V4) %>%
arrange(V1))
} # Indirect implementation here
DF %>% group_by(V2) %>%
filter(row_number()== 1 | row_number()== n()) %>%
arrange(V2, V1) %>%
select(V2, V1, V3, V4)
DF %>% group_by(V2) %>%
summarize(sum(V1), sum(V3), sum(V4))
dplyr seems be significantly more verbose for the “Advanced Table Operations”, but the code is much easier to read.
With a new data set:
DT <- CreateExampleData()
DF <- data.frame(CreateExampleData())
With data.table:
setkey(DT,V2)
DT[, .(V4.Sum = sum(V4)), by=V2][V4.Sum > 25]
With dplyr:
DF %>% group_by(V2) %>%
summarize(V4.Sum = sum(V4)) %>%
filter(V4.Sum > 25)
Chaining is much more “natural” in dplyr from the pipe operator, but the data.table operations is also very clear.
With a new data set:
set.seed(45L)
A <- data.table(store = rep(c("A","B","C"), each=5),
week = 1:5,
price = round(runif(15,2,4),2) )
B <- data.table(store = rep(c("B","C","D"), each=6),
week = 1:6,
units = round(runif(18,0,25),0) )
C <- data.table(store = c("B","C","D"),
size = c(100,200,300),
region = c("Boston","Chicago","Houston") )
A1 <- data.frame(A)
B1 <- data.frame(B)
C1 <- data.frame(C)
With data.table:
setkey(A, store, week)
setkey(B, store, week)
setkey(C, store)
A[B] # Right outer join
B[A] # Left outer join
A[B, nomatch=0L] # Inner join
A[C[, .(store,region)], nomatch=0L] # Inner join on subset
A[C, .(store,week,price,size,region), nomatch=0L]
With dplyr:
A %>% right_join(B)
A %>% left_join(B, on = c(store, week)) %>%
select(store, week, units, price)
A %>% inner_join(B)
C %>% select(store, region) %>% inner_join(A) %>%
select(store, week, price, region)
A %>% inner_join(C) %>% select(store,week,price,size,region)
dplyr joins on default on columns with the same name, and there is also no rolling join equivalent in dplyr.
The new shift function can replace code for lagging.
X = data.table(store = rep(c("A","B"), each=10),
week = 1:10,
price = round(runif(20,2,4),2) )
setkey(X, store, week)
With the Lag Code:
X[, lag_price := X[J(store,week-1)][, price]] # First lag
X[, lag_price_3 := X[J(store,week-3)][, price]] # Third lag
With the Shift function:
X[, lag_price_alt := shift(price, 1), by = store] # First lag
X[, lag_price_alt_3 := shift(price, 3), by = store] # Third lag
X[]
X[, c("lag_price","lag_price_3", "lag_price_alt", "lag_price_alt_3") := NULL]
X = X[!c(3,4,5,8,11,12,15,19,20)]
Lag with missing periods:
X[, lag_price := X[J(store,week-1), roll=TRUE][, price]][]
Lag with missing periods with Shift Function: (This is a bit verbose and there is probably an easier way)
X[, lag_price_shift := setnames(X[,.(shift(week, 1, fill=0), shift(price, 1)),
by = store, roll = TRUE],
'V2', 'price')[,price]][]
data.table is very terse when compared to dplyr. However, dplyr is much easier to read than data.table, and this is due to its ubiquitous use of the pipe operator.
dplyr is compatable with data.table so we can ideally get the best of two worlds. For example:
DT <- CreateExampleData()
DT[, .(V4.Sum = sum(V4)), by=V2] %>% filter(V4.Sum > 25)
DT[, .(V4.Sd = sd(V4)), by=V2] %>% select(V4.Sd, V2)
Some helpful references: