vignettes/causalKNN-Bootstrap.Rmd
causalKNN-Bootstrap.Rmd
In this vignette, we show how to utilize the bootstrap options in the
causalKNN
package get a bagged estimate of our conditional
average treatment effect (CATE). This vignette extends the introductory
vignette.
With the causal KNN, we can first bootstrap the procedure where we search for the optimal K and then the treatment effect estimation procedure. We will use the following simulated data set for our procedure.
set.seed(1234L)
n <- 5000 # Number of Observations in training set
p <- 21 # Number of Features
key <- 1:n # Individual indices
bootstrap_iterations <- 5L # Number of bootstrap iterations
n_test <- 501 # Number of Observations in test set
key_test <- 1:n_test
X <- cbind(matrix(rnorm(p/3*n), n, p/3),
matrix(5*runif(p/3*n), n, p/3),
matrix(rexp(p/3*n), n, p/3))
X_test <- matrix(0, n_test, p)
X_test[,1] <- rnorm(n_test)
X_test[,p/3+1] <- 5*runif(n_test)
TE <- pmin(abs(X[,1] * X[,p/3+1] + 1), 4)
W <- rbinom(n, 1, 0.5)
Y <- TE * W + 5 * X[,2] + X[,3] * pmax(X[,5], 0) + 5 * sin(X[,4]) + X[,8] + rnorm(n)
k <- 160L # Number of NN to search
distance_metric <- "euclidean" # Distance metric
keep_dist <- FALSE # Drop the distance matrices
threads <- 1L # Number of threads to use
knn_index_list <- knn_index_mat(X, W, k,
distance_metric = distance_metric,
keep_dist = keep_dist,
threads = threads)
N_step <- 20 # How many steps to take
K_step <- 5 # Value between steps and starting value
We first bootstrap the keys and then supply the bootstrapped keys to the function.
boot_search_list <- lapply(1:bootstrap_iterations, function(bootstrap_index)
{
boot_key <- sample(key, length(key), replace = TRUE)
knn_optimal_k(X, W, Y, key, N_step, K_step, knn_index_list,
bootstrap_keys = boot_key)
})
We then average across the RMSE values to find the optimal K value across the bootstrap results.
rmse_avg_vec <- rowMeans(sapply(boot_search_list, function(x) x$rmse_vec))
rmse_avg_vec[which.min(rmse_avg_vec)]
#> KNN_100
#> 214.5634
plot(boot_search_list[[1]]$k_vec, rmse_avg_vec, type = "l", ylab = "RMSE", xlab = "K")
In this case, we see the bootstrapped optimal K search procedure gives us an optimal K value of \(100\). Note that this is the upper bound of our K grid. For production, we should try the procedure with a larger K grid to ensure we have reached the true optimal K value. For this vignette, we will use this K value of \(100\) for demonstrative purposes.
bagged_optimal_K <- 100L
We can also use the bootstrap procedure to estimate the CATE in both the direct causal KNN and the TEP procedures.
We need to first compute the test set K nearest neighbor matrix.
knn_index_test_list <- knn_index_mat(X, W, k,
DF_test = X_test,
distance_metric = distance_metric,
keep_dist = keep_dist,
threads = threads)
Then, we can average across bootstrap iterations our direct Causal KNN estimates.
boot_direct_list <- lapply(1:bootstrap_iterations, function(bootstrap_index)
{
boot_key <- sample(key, length(key), replace = TRUE)
causalknn_treatment_effect(W, Y, key, bagged_optimal_K,
knn_index_test_list,
key_test = key_test,
bootstrap_keys = boot_key)
})
direct_CATE_mat <- sapply(boot_direct_list, function(x) x$treatment_effect)
direct_CATE_DF <- data.frame(key = boot_direct_list[[1]]$key,
CATE = rowMeans(direct_CATE_mat))
We will our bootstrap our TEP projection estimates and then project using the Ridge for each bootstrap iteration. Then we average the Ridge’s predicted values for our bootstrap averaged causal KNN TEP estimates.
boot_TEP_list <- lapply(1:bootstrap_iterations, function(bootstrap_index)
{
boot_key <- sample(key, length(key), replace = TRUE)
causal_KNN_DF <- causalknn_treatment_effect(W, Y, key, bagged_optimal_K,
knn_index_list,
bootstrap_keys = boot_key)
TEP_results_list <- tep_enet(X, causal_KNN_DF$treatment_effect,
X_test, alpha = 0)
list(causal_KNN_DF = causal_KNN_DF, TEP_results_list = TEP_results_list)
})
TEP_CATE_mat <- sapply(boot_TEP_list, function(x) x$TEP_results_list$test_TE)
TEP_CATE_DF <- data.frame(key = key_test,
CATE = rowMeans(TEP_CATE_mat))