The function knockoff.filter
is a wrapper around several simpler functions that
create
)stat
)knockoff.threshold
)These functions may be called directly if desired. The purpose of this vignette is to illustrate the flexibility of this package with some examples.
set.seed(1234)
library(knockoff)
Let us begin by creating some synthetic data. For simplicity, we will use synthetic data constructed from a generalized linear model such that the response only depends on a small fraction of the variables.
# Problem parameters
= 1000 # number of observations
n = 1000 # number of variables
p = 60 # number of variables with nonzero coefficients
k = 7.5 # signal amplitude (for noise level = 1)
amplitude
# Generate the variables from a multivariate normal distribution
= rep(0,p)
mu = 0.10
rho = toeplitz(rho^(0:(p-1)))
Sigma = matrix(rnorm(n*p),n) %*% chol(Sigma)
X
# Generate the response from a logistic model and encode it as a factor.
= sample(p, k)
nonzero = amplitude * (1:p %in% nonzero) / sqrt(n)
beta = function(x) exp(x) / (1+exp(x))
invlogit = function(x) rbinom(n, prob=invlogit(x %*% beta), size=1)
y.sample = factor(y.sample(X), levels=c(0,1), labels=c("A","B")) y
Instead of using knockoff.filter
directly, we can run the filter manually by calling its main components one by one.
The first step is to generate the knockoff variables for the true Gaussian distribution of the variables.
= create.gaussian(X, mu, Sigma) X_k
Then, we compute the knockoff statistics using 10-fold cross-validated lasso
= stat.glmnet_coefdiff(X, X_k, y, nfolds=10, family="binomial") W
Now we can compute the rejection threshold
= knockoff.threshold(W, fdr=0.2, offset=1) thres
The final step is to select the variables
= which(W >= thres)
selected print(selected)
## integer(0)
The false discovery proportion is
= function(selected) sum(beta[selected] == 0) / max(1, length(selected))
fdp fdp(selected)
## [1] 0
We show how to manually run the knockoff filter multiple times and compute average quantities. This is particularly useful to estimate the FDR (or the power) for a particular configuration of the knockoff filter on artificial problems.
# Optimize the parameters needed for generating Gaussian knockoffs,
# by solving an SDP to minimize correlations with the original variables.
# This calculation requires only the model parameters mu and Sigma,
# not the observed variables X. Therefore, there is no reason to perform it
# more than once for our simulation.
= create.solve_asdp(Sigma)
diag_s
# Compute the fdp over 20 iterations
= 20
nIterations = sapply(1:nIterations, function(it) {
fdp_list # Run the knockoff filter manually, using the pre-computed value of diag_s
= create.gaussian(X, mu, Sigma, diag_s=diag_s)
X_k = stat.glmnet_lambdasmax(X, X_k, y, family="binomial")
W = knockoff.threshold(W, fdr=0.2, offset=1)
t = which(W >= t)
selected # Compute and store the fdp
fdp(selected)
})# Estimate the FDR
mean(fdp_list)
## [1] 0.09537065
If you want to see some basic usage of the knockoff filter, see the introductory vignette. If you want to see how to use knockoffs for Fixed-X variables, see the Fixed-X vignette.