library(ggplot2)
library(dplyr)
library(tidyr)
library(faux)
The rnorm_multi()
function makes multiple normally distributed vectors with specified parameters and relationships.
For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.
<- rnorm_multi(n = 100,
dat mu = c(0, 20, 20),
sd = c(1, 5, 5),
r = c(0.5, 0.5, 0.25),
varnames = c("A", "B", "C"),
empirical = FALSE)
n | var | A | B | C | mean | sd |
---|---|---|---|---|---|---|
100 | A | 1.00 | 0.49 | 0.51 | -0.04 | 1.04 |
100 | B | 0.49 | 1.00 | 0.19 | 19.95 | 4.91 |
100 | C | 0.51 | 0.19 | 1.00 | 19.64 | 4.61 |
Table: Sample stats
You can specify the correlations in one of four ways:
If you want all the pairs to have the same correlation, just specify a single number.
<- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5]) bvn
n | var | a | b | c | d | e | mean | sd |
---|---|---|---|---|---|---|---|---|
100 | a | 1.00 | 0.18 | 0.29 | 0.33 | 0.31 | 0.04 | 1.03 |
100 | b | 0.18 | 1.00 | 0.18 | 0.33 | 0.30 | 0.13 | 1.06 |
100 | c | 0.29 | 0.18 | 1.00 | 0.14 | 0.20 | 0.07 | 0.99 |
100 | d | 0.33 | 0.33 | 0.14 | 1.00 | 0.28 | 0.15 | 1.06 |
100 | e | 0.31 | 0.30 | 0.20 | 0.28 | 1.00 | 0.03 | 1.03 |
Table: Sample stats from a single rho
If you already have a correlation matrix, such as the output of cor()
, you can specify the simulated data with that.
<- cor(iris[,1:4])
cmat <- rnorm_multi(100, 4, 0, 1, cmat,
bvn varnames = colnames(cmat))
n | var | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | mean | sd |
---|---|---|---|---|---|---|---|
100 | Sepal.Length | 1.00 | -0.24 | 0.87 | 0.82 | 0.09 | 0.98 |
100 | Sepal.Width | -0.24 | 1.00 | -0.58 | -0.52 | 0.07 | 1.08 |
100 | Petal.Length | 0.87 | -0.58 | 1.00 | 0.96 | 0.04 | 1.03 |
100 | Petal.Width | 0.82 | -0.52 | 0.96 | 1.00 | 0.05 | 1.04 |
Table: Sample stats from a correlation matrix
You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.
<- c(1, .3, .5,
cmat 3, 1, 0,
.5, 0, 1)
.<- rnorm_multi(100, 3, 0, 1, cmat,
bvn varnames = c("first", "second", "third"))
n | var | first | second | third | mean | sd |
---|---|---|---|---|---|---|
100 | first | 1.00 | 0.31 | 0.48 | 0.05 | 1.02 |
100 | second | 0.31 | 1.00 | 0.01 | -0.14 | 0.86 |
100 | third | 0.48 | 0.01 | 1.00 | 0.02 | 1.12 |
Table: Sample stats from a vars*vars vector
You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.
<- .3
rho1_2 <- .5
rho1_3 <- .5
rho1_4 <- .2
rho2_3 <- 0
rho2_4 <- -.3
rho3_4 <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
cmat <- rnorm_multi(100, 4, 0, 1, cmat,
bvn varnames = letters[1:4])
n | var | a | b | c | d | mean | sd |
---|---|---|---|---|---|---|---|
100 | a | 1.00 | 0.29 | 0.61 | 0.41 | -0.10 | 1.06 |
100 | b | 0.29 | 1.00 | 0.23 | -0.03 | 0.09 | 1.14 |
100 | c | 0.61 | 0.23 | 1.00 | -0.28 | 0.08 | 1.17 |
100 | d | 0.41 | -0.03 | -0.28 | 1.00 | -0.12 | 0.97 |
Table: Sample stats from a (vars*(vars-1)/2) vector
If you want your samples to have the exact correlations, means, and SDs you entered, set empirical
to TRUE.
<- rnorm_multi(100, 5, 0, 1, .3,
bvn varnames = letters[1:5],
empirical = T)
n | var | a | b | c | d | e | mean | sd |
---|---|---|---|---|---|---|---|---|
100 | a | 1.0 | 0.3 | 0.3 | 0.3 | 0.3 | 0 | 1 |
100 | b | 0.3 | 1.0 | 0.3 | 0.3 | 0.3 | 0 | 1 |
100 | c | 0.3 | 0.3 | 1.0 | 0.3 | 0.3 | 0 | 1 |
100 | d | 0.3 | 0.3 | 0.3 | 1.0 | 0.3 | 0 | 1 |
100 | e | 0.3 | 0.3 | 0.3 | 0.3 | 1.0 | 0 | 1 |
Table: Sample stats with empirical = TRUE
Us rnorm_pre()
to create a vector with a specified correlation to one or more pre-existing variables. The following code creates a new column called B
with a mean of 10, SD of 2 and a correlation of r = 0.5 to the A
column.
<- rnorm_multi(varnames = "A") %>%
dat mutate(B = rnorm_pre(A, mu = 10, sd = 2, r = 0.5))
n | var | A | B | mean | sd |
---|---|---|---|---|---|
100 | A | 1.00 | 0.37 | -0.03 | 1.10 |
100 | B | 0.37 | 1.00 | 10.02 | 2.28 |
Set empirical = TRUE
to return a vector with the exact specified parameters.
$C <- rnorm_pre(dat$A, mu = 10, sd = 2, r = 0.5, empirical = TRUE) dat
n | var | A | B | C | mean | sd |
---|---|---|---|---|---|---|
100 | A | 1.00 | 0.37 | 0.50 | -0.03 | 1.10 |
100 | B | 0.37 | 1.00 | 0.15 | 10.02 | 2.28 |
100 | C | 0.50 | 0.15 | 1.00 | 10.00 | 2.00 |
You can also specify correlations to more than one vector by setting the first argument to a data frame containing only the continuous columns and r to the correlation with each column.
$D <- rnorm_pre(dat, r = c(.1, .2, .3), empirical = TRUE) dat
n | var | A | B | C | D | mean | sd |
---|---|---|---|---|---|---|---|
100 | A | 1.00 | 0.37 | 0.50 | 0.1 | -0.03 | 1.10 |
100 | B | 0.37 | 1.00 | 0.15 | 0.2 | 10.02 | 2.28 |
100 | C | 0.50 | 0.15 | 1.00 | 0.3 | 10.00 | 2.00 |
100 | D | 0.10 | 0.20 | 0.30 | 1.0 | 0.00 | 1.00 |
Not all correlation patterns are possible, so you’ll get an error message if the correlations you ask for are impossible.
$E <- rnorm_pre(dat, r = .9)
dat#> Warning in rnorm_pre(dat, r = 0.9): Correlations are impossible.