fRLR: Fit Repeated Linear Regressions

library(fRLR)

Introduction

This R package aims to fit Repeated Linear Regressions in which there are some same terms.

An Example

Let’s start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as \(y\), and these regressions are as follows.

\[ \begin{array}{ll} y&\sim x_1 + cov_1 + cov_2+\ldots+cov_m\\ y&\sim x_2 + cov_1 +cov_2+\ldots+cov_m\\ \cdot &\sim \cdots\\ y&\sim x_n + cov_1 +cov_2+\ldots+cov_m\\ \end{array} \]

where \(cov_i, i=1,\ldots, m\) are the same variables among these regressions.

Ideas

Intuitively, we can finish this task by using a simple loop.

model = vector(mode='list', length=n)
for (i in 1:n)
{
  ...
  model[[i]] = lm(y~x)
  ...
}

However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter \(\beta\). And we have

\[ \hat\beta = (X'X)^{-1}X'Y \]

where \(X\) is the design matrix and \(Y\) is the observation of response variable.

It is obvious that there are some same elements in the design matrix, and the larger \(m\) is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix.

Method

For the above example, the design matrix can be denoted as \(X=(x, cov)\). If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in \(cov\) naturally. Then we have

\[ (X'X)^{-1}= \left[ \begin{array}{cc} x'x & x'cov \\ cov'x & cov'cov \end{array} \right]= \left[ \begin{array}{ll} a& v'\\ v& B \end{array} \right] \]

Woodbury formula tells us

\[ (A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} \]

Let

\[ A=\left[ \begin{array}{ll} a&O\\ O&B \end{array}\right],\; U=\left[ \begin{array}{ll} 1 & 0\\ O & v \end{array} \right],\; V= \left[ \begin{array}{ll} 0& v'\\ 1& O \end{array} \right] \]

and \(C=I_{2\times 2}\). Then we can apply woodbury formula,

\[ (X'X)^{-1}=(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} \]

where

\[ A^{-1}=\left[ \begin{array}{cc} a^{-1}&O\\ O&B^{-1} \end{array} \right] \]

We can do further calculations to simplify and obtain the following result

\[ (X'X)^{-1}=\left[ \begin{array}{cc} 1/a+\frac{a}{a-v'B^{-1}v}v'B^{-1}v & -\frac{v'B^{-1}}{a-v'B^{-1}v}\\ -\frac{B^{-1}v}{a-v'B^{-1}v} & B^{-1}+\frac{-B^{-1}vv'B^{-1}}{a-v'B^{-1}v} \end{array} \right] \]

Notice that matrix \(B\) is the same for all regression, the identical terms for each regression are just \(a\) and \(v\), which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly.

Test

Now test two simulation examples by using the functions in this package.

## use fRLR package
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
frlr1(X, Y, COV)
#>   r r.p.value
#> 1 0 0.4380128
#> 2 1 0.7791076
#> 3 2 0.2212869
#> 4 3 0.9495018
#> 5 4 0.6729983

## use simple loop
res = matrix(nrow = 0, ncol = 2)
for (i in 1:ncol(X))
{
  mat = cbind(X[,i], COV)
  df = as.data.frame(mat)
  model = lm(Y~., data = df)
  tmp = c(i, summary(model)$coefficients[2, 4])
  res = rbind(res, tmp)
}
res
#>     [,1]      [,2]
#> tmp    1 0.4380128
#> tmp    2 0.7791076
#> tmp    3 0.2212869
#> tmp    4 0.9495018
#> tmp    5 0.6729983

As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods.

Similarly, we can test another example

set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)

idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)

frlr2(X, idx1, idx2, Y, COV)
#>    r1 r2 r1.p.value  r2.p.value
#> 1   1  2 0.53021406 0.895719578
#> 2   2  3 0.01812006 0.009833047
#> 3   3  4 0.29895922 0.963995969
#> 4   4  5 0.91749181 0.712075464
#> 5   1  3 0.33761507 0.210331456
#> 6   1  4 0.51074586 0.966484642
#> 7   1  5 0.12479380 0.152802911
#> 8   2  4 0.79302893 0.902402294
#> 9   2  5 0.73153760 0.663392258
#> 10  3  5 0.32367303 0.877154122
#> 11  2  3 0.01812006 0.009833047
#> 12  1  2 0.53021406 0.895719578
#> 13  3  4 0.29895922 0.963995969
#> 14  4  5 0.91749181 0.712075464
#> 15  1  3 0.33761507 0.210331456
#> 16  1  4 0.51074586 0.966484642
#> 17  1  5 0.12479380 0.152802911
#> 18  2  4 0.79302893 0.902402294
#> 19  2  5 0.73153760 0.663392258
#> 20  3  5 0.32367303 0.877154122

res = matrix(nrow=0, ncol=4)
for (i in 1:length(idx1))
{
  mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
  df = as.data.frame(mat)
  model = lm(Y~., data = df)
  tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
  res = rbind(res, tmp)
}

Again, we obtain the same results by different methods.

Computation Performance

The main aim of this new method is to reduce the computation cost. Now let’s compare its speed with the simple-loop method.

We can obtain the following time cost for \(99\times 100/2=4950\) linear regressions.

set.seed(123)
n = 100
X = matrix(rnorm(10*n), 10, n)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)

#idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
#idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
id = combn(n, 2)
idx1 = id[1, ]
idx2 = id[2, ]

system.time(frlr2(X, idx1, idx2, Y, COV))
#>    user  system elapsed 
#>   2.048   0.109   0.965

simpleLoop <- function()
{
  res = matrix(nrow=0, ncol=4)
  for (i in 1:length(idx1))
  {
    mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
    df = as.data.frame(mat)
    model = lm(Y~., data = df)
    tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
    res = rbind(res, tmp)
  }
}

system.time(simpleLoop())
#>    user  system elapsed 
#>   9.275   0.011   9.413