Simulate Correlated Variables

Lisa DeBruine

2021-09-13

library(ggplot2)
library(dplyr)
library(tidyr)
library(faux)

The rnorm_multi() function makes multiple normally distributed vectors with specified parameters and relationships.

Quick example

For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.

dat <- rnorm_multi(n = 100, 
                  mu = c(0, 20, 20),
                  sd = c(1, 5, 5),
                  r = c(0.5, 0.5, 0.25), 
                  varnames = c("A", "B", "C"),
                  empirical = FALSE)
n var A B C mean sd
100 A 1.00 0.49 0.51 -0.04 1.04
100 B 0.49 1.00 0.19 19.95 4.91
100 C 0.51 0.19 1.00 19.64 4.61

Table: Sample stats

Specify correlations

You can specify the correlations in one of four ways:

One Number

If you want all the pairs to have the same correlation, just specify a single number.

bvn <- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5])
n var a b c d e mean sd
100 a 1.00 0.18 0.29 0.33 0.31 0.04 1.03
100 b 0.18 1.00 0.18 0.33 0.30 0.13 1.06
100 c 0.29 0.18 1.00 0.14 0.20 0.07 0.99
100 d 0.33 0.33 0.14 1.00 0.28 0.15 1.06
100 e 0.31 0.30 0.20 0.28 1.00 0.03 1.03

Table: Sample stats from a single rho

Matrix

If you already have a correlation matrix, such as the output of cor(), you can specify the simulated data with that.

cmat <- cor(iris[,1:4])
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = colnames(cmat))
n var Sepal.Length Sepal.Width Petal.Length Petal.Width mean sd
100 Sepal.Length 1.00 -0.24 0.87 0.82 0.09 0.98
100 Sepal.Width -0.24 1.00 -0.58 -0.52 0.07 1.08
100 Petal.Length 0.87 -0.58 1.00 0.96 0.04 1.03
100 Petal.Width 0.82 -0.52 0.96 1.00 0.05 1.04

Table: Sample stats from a correlation matrix

Vector (vars*vars)

You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.

cmat <- c(1, .3, .5,
          .3, 1, 0,
          .5, 0, 1)
bvn <- rnorm_multi(100, 3, 0, 1, cmat, 
                  varnames = c("first", "second", "third"))
n var first second third mean sd
100 first 1.00 0.31 0.48 0.05 1.02
100 second 0.31 1.00 0.01 -0.14 0.86
100 third 0.48 0.01 1.00 0.02 1.12

Table: Sample stats from a vars*vars vector

Vector (vars*(vars-1)/2)

You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.

rho1_2 <- .3
rho1_3 <- .5
rho1_4 <- .5
rho2_3 <- .2
rho2_4 <- 0
rho3_4 <- -.3
cmat <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = letters[1:4])
n var a b c d mean sd
100 a 1.00 0.29 0.61 0.41 -0.10 1.06
100 b 0.29 1.00 0.23 -0.03 0.09 1.14
100 c 0.61 0.23 1.00 -0.28 0.08 1.17
100 d 0.41 -0.03 -0.28 1.00 -0.12 0.97

Table: Sample stats from a (vars*(vars-1)/2) vector

empirical

If you want your samples to have the exact correlations, means, and SDs you entered, set empirical to TRUE.

bvn <- rnorm_multi(100, 5, 0, 1, .3, 
                  varnames = letters[1:5], 
                  empirical = T)
n var a b c d e mean sd
100 a 1.0 0.3 0.3 0.3 0.3 0 1
100 b 0.3 1.0 0.3 0.3 0.3 0 1
100 c 0.3 0.3 1.0 0.3 0.3 0 1
100 d 0.3 0.3 0.3 1.0 0.3 0 1
100 e 0.3 0.3 0.3 0.3 1.0 0 1

Table: Sample stats with empirical = TRUE

Pre-existing variables

Us rnorm_pre() to create a vector with a specified correlation to one or more pre-existing variables. The following code creates a new column called B with a mean of 10, SD of 2 and a correlation of r = 0.5 to the A column.

dat <- rnorm_multi(varnames = "A") %>%
  mutate(B = rnorm_pre(A, mu = 10, sd = 2, r = 0.5))
n var A B mean sd
100 A 1.00 0.37 -0.03 1.10
100 B 0.37 1.00 10.02 2.28

Set empirical = TRUE to return a vector with the exact specified parameters.

dat$C <- rnorm_pre(dat$A, mu = 10, sd = 2, r = 0.5, empirical = TRUE)
n var A B C mean sd
100 A 1.00 0.37 0.50 -0.03 1.10
100 B 0.37 1.00 0.15 10.02 2.28
100 C 0.50 0.15 1.00 10.00 2.00

You can also specify correlations to more than one vector by setting the first argument to a data frame containing only the continuous columns and r to the correlation with each column.

dat$D <- rnorm_pre(dat, r = c(.1, .2, .3), empirical = TRUE)
n var A B C D mean sd
100 A 1.00 0.37 0.50 0.1 -0.03 1.10
100 B 0.37 1.00 0.15 0.2 10.02 2.28
100 C 0.50 0.15 1.00 0.3 10.00 2.00
100 D 0.10 0.20 0.30 1.0 0.00 1.00

Not all correlation patterns are possible, so you’ll get an error message if the correlations you ask for are impossible.

dat$E <- rnorm_pre(dat, r = .9)
#> Warning in rnorm_pre(dat, r = 0.9): Correlations are impossible.