PCAmatchR matches cases to controls based on genotype principal components (PC). In order to produce more genetically similar matches, a weighted Mahalanobis distance metric (Kidd et al. (1987)) is used to determine matches. Weights are equal to the percent variance explained by each PC.
The release version of PCAmatchR can be installed from CRAN:
install.packages("PCAmatchR")
The development version of the PCAmatchR package can be installed from the GitHub repository by using the devtools package:
::install_github("machiela-lab/PCAmatchR") devtools
PCAmatchR depends on the optmatch package which must be manually installed from CRAN:
install.packages("optmatch")
Following installation, attach the PCAmatchR and optmatch packages with:
library(PCAmatchR)
library(optmatch)
setMaxProblemSize(size = Inf) # optmatch option to allow for large matching problems. See optmatch documentation for full description.
Here we perform a hypothetical example of case-control matching using the Phase 3 data release of 1000 Genomes Project, which contains genotype data from 2,504 individuals from 26 distinct populations (available at https://www.cog-genomics.org/plink/2.0/resources).
Within PCAmatchR, we include sample data containing information about population, gender, and the first 20 principal components and eigenvalues calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project. The example principal component analysis was conducted with PLINK using a set of ancestry informative SNPs (Yu et al. (2008)). The data files are contained within:
PCs_1000G
, A sample dataset containing information about population, gender, and the first 20 principal components calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project.
eigenvalues_1000G
, A sample dataset containing the first 20 eigenvalues calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project.
eigenvalues_all_1000G
, A sample dataset containing all eigenvalues calculated from 2,504 individuals in the Phase 3 data release of the 1000 Genomes Project.
# Load required packages
<- c("PCAmatchR", "optmatch")
loadedPackages invisible(lapply(loadedPackages, require, character.only = TRUE))
# Create PC data frame
<- as.data.frame(PCs_1000G[,c(1,5:24)])
pcs
# Create eigenvalues vector
<- c(eigenvalues_1000G)$eigen_values
eigen_vals
# Create full eigenvalues vector
<- c(eigenvalues_all_1000G)$eigen_values all_eigen_vals
For this example analysis, we select individuals from the ESN (Esan in Nigeria) population as cases (N=99), while all remaining samples are used as the control population (N=2,405):
<- as.data.frame(PCs_1000G[,1:4])
covariate_data$case <- ifelse(covariate_data$pop=="ESN", c(1), c(0)) covariate_data
Matching is performed using match_maker
. Within this example, cases and controls are 1:2 matched on the first 20 PCs:
<- match_maker(PC = pcs,
match_maker_outputeigen_value = eigen_vals,
data = covariate_data,
ids = c("sample"),
case_control="case",
num_controls = 2,
eigen_sum = sum(all_eigen_vals))
Derived matches are contained within the matches
object. The weighted Mahalanobis distance metric between case and control pairs is detailed within the match_distance
variable:
<- match_maker_output$matches
PCA_matches$match_distance PCA_matches
The all_eigen_vals
file is not needed to run match_maker
. The user can directly supply the scalar sum of all eigen values to the match_maker
function through the eigen_sum
input. In the above example, the sum of all the eigenvalues was 2722.856. This value is used to weight each eigen value to determine the percentage of variance explained.
If using PLINK to perform the PCA (e.g., --pca
flag), there are multiple options to calculate the sum of all eigen values instead of having to generate all PCs:
--make-rel
in PLINK. The diagonal of this matrix is equal to the eigen values. Summing the diagonal will obtain the integer to input into eigen_sum
. This approach may not be feasible when working with a large number of cases and controls, due to the file size.--ibc
in PLINK. This will output a file with 6 columns. Column 4 (Fhat1
), gives the variance-standardized relationship minus 1. Add 1 to each entry in this column, and then sum all entries. This will result in the total sum of the eigen values (i.e., the total variance explained). This value can then be used as the input to eigen_sum
.The distance between matches can be visualized using plot_maker
:
plot_maker(data=match_maker_output,
x_var="PC1",
y_var="PC2",
case_control="case",
line=F,
xlim = c(0.025,0.04),
ylim = c(-0.008,0.00))
The function further allows for connections between matches:
plot_maker(data=match_maker_output,
x_var="PC1",
y_var="PC2",
case_control="case",
line=T,
xlim = c(0.025,0.04),
ylim = c(-0.008,0.00))