Getting Started With prewas

Introduction to prewas

prewas is a tool to standardize the pre-processing of your genomic data before performing a bacterial genome-wide association study (bGWAS). Currently, prewas can pre-process single nucleotide variants (SNVs) and small insertions and deletions (indels).

prewas creates a variant matrix (where each row is a variants, each column is a sample, and the entries are presence - 1 - or absence - 0 - of the variants) that can be used as input for bGWAS tools, such as hogwash or treeWAS.

When creating the binary variant matrix, prewas can perform 3 pre-processing steps including:

dealing with multiallelic variants,
(optional) dealing with variants in overlapping genes, and
choosing a reference allele.

These 3 steps in the prewas workflow are shown below. (B) shows how prewas handles multiallelic sites, (C) shows options for choosing a reference allele and (D) shows how prewas can group variants into genes. A detailed workflow indicating required file inputs can be seen below in the “Detailed prewas workflow” section.

prewas workflow

1. Dealing with multiallelic variants

A multiallelic site is a site with more than two alleles present at a locus. Most bGWAS methods require a binary input (variants presence or absence) - so multiallelic sites don’t fit into this mold.

One could consider the reference allele 0 and any alternative allele as 1. However, each alternative allele could confer a different functional impact on the expressed protein. Thus, lumping the alternative alleles into one category will take away the ability to study these functional differences.

prewas uses the multi-line format to represent mutliallelic variants. Below are examples of how a triallelic site (T, A, and G) can be represented in a single line of a variant matrix compared to multiple lines of a variant matrix.

Single-line format:

T -> A, G

Multi-line format:

T->A
T->G

2. Dealing with variants in overlapping genes

A variants in an overlapping gene could have a different impact on those two genes. Therefore, when a gff is provided, prewas will output a binary variant matrix where a variants in x overlapping genes will be represented on x lines. When a gff is provided, prewas will also output a gene-based variant matrix indicating presence or absence of at least 1 variants in that gene, and variants in overlapping genes will be assigned to both genes.

3. Choosing a reference allele

A reference allele, the allele that will be denoted 0 in a binary matrix could be:

the allele in a reference genome
the major allele at each position
the ancestral allele at each position

Choosing a reference allele can be particularly important when doing gene-based analysis and therefore aggregating variants by gene. We suggest referencing to the ancestral allele for evolutionary interpretability. In cases where ancestral reconstruction is not feasible (e.g. computational intensity) or low confidence, we suggest referencing to the major allele instead of referencing to an arbitrary reference genome because using the major allele leads to less masking of variation at the gene level.

Reference to the major allele to avoid masking variation at the gene level

When ancestral reconstruction is not used (anc = FALSE), prewas will use the major allele as the reference allele.

Usage

At minimum, prewas requires a .vcf, but can also take in a phylogenetic tree, the name of the outgroup in that tree (if any), and a .gff for use with gene-based analysis.

Below, we’ll explore some usage examples using toy data built into the package.

library(prewas)
vcf = prewas::vcf
gff = prewas::gff
tree = prewas::tree
outgroup = prewas::outgroup

Minimal prewas run:

Inputs:

required vcf file

Specifications:

No ancestral reconstruction

results_minimal = prewas(dna = vcf, 
                         anc = FALSE)
#> Loading required package: vcfR
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.9.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****

Will output a list containing:

allele_mat: An allele matrix, created from the vcf where each multiallelic site will be on its own line. The rowname will be the position of the variants in the vcf file. If the position is triallelic, there will be two rows containing the same information. The rows will be labeled “pos” and “pos.1”. If the position is quadallelic, there will be three rows containing the same information. The rows will be labeled “pos”, “pos.1”, and “pos.2”.
bin_mat: A binary matrix, the same dimensions as the allele matrix and with corresponding row names, where 0 is the reference allele and 1 indicates a variants. The reference allele is the major allele (because anc = FALSE).
ar_results: Will indicate the allele used as the reference allele (in this case, the major allele). The number of rows is the length of the number of rows of bin_mat.
dup: An index that identifies duplicated rows. If the index is unique (appears once), that means it is not a multiallelic site. If the index appears more than once, that means the row was replicated x times, where x is the number of alternative alleles.

table(results_minimal$dup)
#> 
#>   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
#>   1   1   1   1   2   2   1   1   1   1   1   1   1   1   1   1   1   1   2   1 
#>  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
#>   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1 
#>  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
#>   2   1   1   1   1   1   1   2   2   1   1   1   2   2   1   1   1   2   1   1 
#>  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
#>   2   1   1   1   1   1   1   1   2   1   1   1   1   1   2   1   1   1   1   1 
#>  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
#>   1   1   1   1   2   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1 
#> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   3   1   1 
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
#>   1   1   2   2   1   1   1   2   1   1   2   1   1   1   2   1   1   1   1   1 
#> 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 
#>   1   2   2   1   1   1   2   1   1   1   1   2   1   1   1   1   1   1   1   1 
#> 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
#>   1   1   1   1   2   1   1   1   1   2   2   1   1   1   1   1   1   1   1   1 
#> 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1 
#> 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 
#>   1   1   1   1   1   1   1   1   1   1   1   2   1   2   1   1   2   1   1   1 
#> 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
#>   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#> 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
#>   1   1   1   1   2   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1 
#> 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
#>   1   1   1   1   1   2   1   1   1   1   2   1   1   2   1   1   1   1   2   1 
#> 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
#>   2   1   1   1   1   1   2   1   1   1   2   2   1   1   1   1   1   2   1   1 
#> 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
#>   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   2   2   1   1 
#> 321 322 323 324 325 326 
#>   1   1   1   1   1   1

gene_mat: NULL; Because no information is provided about genes (that is, a gff file), there will be no gene matrix generated. This also means variants that appear in more than 1 overlapping gene will not be split into multiple lines.
tree: NULL; Because ancestral reconstruction is not needed no tree was generated.

Maximal prewas run:

Inputs:

required vcf file
tree
string of the outgroup
gff file

Specifications:

Conduct ancestral reconstruction

results_maximal = prewas(dna = vcf, 
                         tree = tree, 
                         outgroup = outgroup, 
                         gff = gff,
                         anc = TRUE)

Will output:

allele_mat: Nearly identical to results_minimal, but now with the column corresponding to the outgroup removed.

dim(results_maximal$allele_mat)
#> [1] 360  13

head(results_maximal$allele_mat,10)
#>      t7  t3  t10 t6  t8  t2  t5  t9  t11 t13 t4  t14 t12
#> 1    "A" "A" "A" "A" "A" "A" "A" "A" "G" "A" "A" "A" "A"
#> 8    "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "A" "T" "T"
#> 10   "G" "G" "G" "G" "G" "G" "G" "G" "G" "A" "A" "A" "A"
#> 12   "G" "G" "G" "G" "G" "G" "G" "G" "T" "G" "G" "G" "G"
#> 18   "C" "C" "C" "C" "G" "C" "A" "A" "C" "C" "C" "C" "C"
#> 18.1 "C" "C" "C" "C" "G" "C" "A" "A" "C" "C" "C" "C" "C"
#> 21   "A" "A" "A" "A" "A" "A" "A" "C" "T" "A" "A" "A" "A"
#> 21.1 "A" "A" "A" "A" "A" "A" "A" "C" "T" "A" "A" "A" "A"
#> 22   "A" "A" "A" "A" "T" "T" "T" "T" "T" "T" "T" "T" "T"
#> 25   "G" "G" "G" "G" "T" "G" "G" "G" "G" "G" "G" "G" "G"

bin_mat: A binary matrix, where 0 is the reference allele and 1 indicates a variants. The reference allele is the ancestral allele (because anc = TRUE). The dimensions do not match the allele_mat, because variants in overlapping genes are represented on multiple lines. Position and locus tag name is provided in the rowname. Again, the column corresponding to the outgroup is removed.

dim(results_maximal$bin_mat)
#> [1] 1016   13

head(results_maximal$allele_mat)
#>      t7  t3  t10 t6  t8  t2  t5  t9  t11 t13 t4  t14 t12
#> 1    "A" "A" "A" "A" "A" "A" "A" "A" "G" "A" "A" "A" "A"
#> 8    "T" "T" "T" "T" "T" "T" "T" "T" "T" "T" "A" "T" "T"
#> 10   "G" "G" "G" "G" "G" "G" "G" "G" "G" "A" "A" "A" "A"
#> 12   "G" "G" "G" "G" "G" "G" "G" "G" "T" "G" "G" "G" "G"
#> 18   "C" "C" "C" "C" "G" "C" "A" "A" "C" "C" "C" "C" "C"
#> 18.1 "C" "C" "C" "C" "G" "C" "A" "A" "C" "C" "C" "C" "C"

head(results_maximal$bin_mat, 10)
#>           t7 t3 t10 t6 t8 t2 t5 t9 t11 t13 t4 t14 t12
#> 1|gene_1   0  0   0  0  0  0  0  0   1   0  0   0   0
#> 1|gene_2   0  0   0  0  0  0  0  0   1   0  0   0   0
#> 1|gene_3   0  0   0  0  0  0  0  0   1   0  0   0   0
#> 8|gene_8   0  0   0  0  0  0  0  0   0   0  1   0   0
#> 8|gene_80  0  0   0  0  0  0  0  0   0   0  1   0   0
#> 10|gene_1  1  1   1  1  1  1  1  1   1   0  0   0   0
#> 10|gene_2  1  1   1  1  1  1  1  1   1   0  0   0   0
#> 10|gene_3  1  1   1  1  1  1  1  1   1   0  0   0   0
#> 12|gene_1  0  0   0  0  0  0  0  0   1   0  0   0   0
#> 12|gene_2  0  0   0  0  0  0  0  0   1   0  0   0   0

ar_results: Will indicate the allele used as the reference allele (in this case, the ancestral allele). Because ancestral reconstruction was used, the probability of the ancestral alleles is also reported. The number of rows is the same as the number of rows of allele_mat.

head(results_maximal$ar_results)
#>      ancestral_allele       probability
#> 1                   A 0.992960136389704
#> 8                   T 0.996302603423344
#> 10                  A 0.504233270410623
#> 12                  G 0.992960136389704
#> 18                  C 0.990759106744756
#> 18.1                C 0.990759106744756

dup: Same format as results_minimal. Multiple indices indicates multiallelic site split, but not overlapping genes split.
gene_mat: Because a gff file was provided, a gene matrix is generated. Each row is a gene, not a variants.

head(results_maximal$gene_mat)
#>         t7 t3 t10 t6 t8 t2 t5 t9 t11 t13 t4 t14 t12
#> gene_1   1  1   1  1  1  1  1  1   1   1  1   1   1
#> gene_2   1  1   1  1  1  1  1  1   1   1  1   1   1
#> gene_3   1  1   1  1  1  1  1  1   1   1  1   1   1
#> gene_8   1  1   1  1  1  1  1  1   1   0  1   1   1
#> gene_80  1  1   1  1  1  1  1  1   0   0  1   0   0
#> gene_12  0  1   1  0  0  0  0  0   1   0  0   0   0

tree: Returns the phylo object that was provided after it has been rooted to the provided outgroup and the outgroup was removed.

results_maximal$tree
#> 
#> Phylogenetic tree with 13 tips and 12 internal nodes.
#> 
#> Tip labels:
#>   t7, t3, t10, t6, t8, t2, ...
#> 
#> Rooted; includes branch lengths.

No tree?

If you do not have a tree, a neighbor-joining tree will be generated and ancestral reconstruction can be conducted. We recommend using more sophisticated methods for tree building (e.g. maximum likelihood or Bayesian methods) but for an initial analysis, we have provided the option to create a neighbor-joining tree.

results_no_tree = prewas(dna = vcf, 
                         gff = gff,
                         anc = TRUE)
#> [1] "Start building tree"
#> optimize edge weights:  -2420.974 --> -2415.557 
#> optimize base frequencies:  -2415.557 --> -2415.352 
#> optimize rate matrix:  -2415.352 --> -2413.979 
#> optimize edge weights:  -2413.979 --> -2413.974 
#> optimize topology:  -2413.974 --> -2413.974 
#> 0 
#> [1] "Ratchet iteration  1 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  2 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  3 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  4 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  5 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  6 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  7 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  8 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  9 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  10 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  11 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  12 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  13 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  14 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  15 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  16 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  17 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  18 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  19 , best pscore so far: -2413.97385163831"
#> [1] "Ratchet iteration  20 , best pscore so far: -2413.97385163831"
#> optimize base frequencies:  -2413.974 --> -2413.742 
#> optimize rate matrix:  -2413.742 --> -2413.699 
#> optimize edge weights:  -2413.699 --> -2413.699 
#> optimize topology:  -2413.699 --> -2413.699 
#> 0 
#> optimize base frequencies:  -2413.699 --> -2413.692 
#> optimize rate matrix:  -2413.692 --> -2413.691 
#> optimize edge weights:  -2413.691 --> -2413.691 
#> optimize base frequencies:  -2413.691 --> -2413.691 
#> optimize rate matrix:  -2413.691 --> -2413.691 
#> optimize edge weights:  -2413.691 --> -2413.691 
#> optimize base frequencies:  -2413.691 --> -2413.691 
#> optimize rate matrix:  -2413.691 --> -2413.691 
#> optimize edge weights:  -2413.691 --> -2413.691 
#> [1] "Finished building tree"

results_no_tree returns similar results to results_maximal but with one critical difference in the tree object: tree: this phylogenetic tree was built so that ancestral reconstruction could be performed for the variants referencing step. Note that this tree has one more tip than results_maximal$tree because an outgroup was provided with results_maximal which was dropped after being used to root that tree. For results_no_tree the generated tree was midpoint rooted & no outgroup was provided so no tips were dropped.

results_no_tree$tree
#> 
#> Phylogenetic tree with 14 tips and 13 internal nodes.
#> 
#> Tip labels:
#>   t7, t3, t10, t6, t8, t2, ...
#> 
#> Rooted; includes branch lengths.

Want to filter variants by predicted functional impact prior to your gene-based analysis?

To filter variants by predicted functional impact prior to a gene-based analysis, you can provide prewas with a multivcf with SnpEff annotations. To merge individual SnpEff vcf files into a single multivcf, you can use bcftools merge.

# from the command line
bcftools merge -i ANN:join -m both -o out_multivcf.vcf -O *.vcf.gz

results_snpeff = prewas(dna = snpeff_vcf, anc = FALSE)
names(results_snpeff$gene_mat)
#> [1] "gene_mat_all"      "gene_mat_modifier" "gene_mat_high"    
#> [4] "gene_mat_moderate" "gene_mat_low"      "gene_mat_custom"

The default is to output the same variant binary matrix, and 5 different binary gene matrices: one for each snpeff impact (low, moderate, high, modifier), and one with all of the variants (all). There is also a custom output that will be NULL with the default options. If you would like to create a custom grouping (e.g. moderate and high), you can provide a vector to snpeff_grouping (e.g. c('MODERATE','HIGH')). In this case, the custom output will keep those specific variants types and no others when grouping by gene.

results_snpeff_custom = prewas(dna = snpeff_vcf, snpeff_grouping = c('MODERATE','HIGH'), anc = FALSE)
names(results_snpeff_custom$gene_mat)
#> [1] "gene_mat_all"      "gene_mat_modifier" "gene_mat_high"    
#> [4] "gene_mat_moderate" "gene_mat_low"      "gene_mat_custom"
head(results_snpeff_custom$gene_mat$gene_mat_custom)
#>                              s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15
#> dnaA|MODERATE-HIGH            0  0  0  0  0  0  0  0  0   0   0   1   0   0   0
#> serS|MODERATE-HIGH            0  0  0  0  0  0  0  0  0   0   0   0   0   0   0
#> mecR1|MODERATE-HIGH           0  0  0  0  0  0  0  0  0   0   0   0   0   0   0
#> hsdR1|MODERATE-HIGH           0  0  0  0  0  0  0  0  0   0   1   0   0   0   0
#> USA300HOU_0034|MODERATE-HIGH  0  0  0  0  0  0  0  0  0   0   1   0   0   0   0
#>                              s16 s17 s18 s19 s20 s21 s22 s23 s24 s25 s26 s27
#> dnaA|MODERATE-HIGH             0   0   0   0   0   0   0   0   0   0   0   0
#> serS|MODERATE-HIGH             0   0   0   0   0   0   0   0   0   0   1   0
#> mecR1|MODERATE-HIGH            0   1   0   1   0   1   0   0   0   0   0   0
#> hsdR1|MODERATE-HIGH            0   1   0   1   0   1   0   0   0   0   0   0
#> USA300HOU_0034|MODERATE-HIGH   0   0   0   0   0   0   0   0   0   0   0   0
#>                              s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39
#> dnaA|MODERATE-HIGH             0   0   0   0   0   0   0   0   0   1   0   0
#> serS|MODERATE-HIGH             1   0   0   0   0   0   0   0   0   0   0   0
#> mecR1|MODERATE-HIGH            0   0   0   0   0   0   0   0   0   0   0   0
#> hsdR1|MODERATE-HIGH            0   1   0   0   1   0   1   0   0   0   0   0
#> USA300HOU_0034|MODERATE-HIGH   0   1   0   0   1   0   1   0   0   0   0   0
#>                              s40 s41 s42 s43 s44 s45 s46 s47 s48 s49
#> dnaA|MODERATE-HIGH             0   0   0   0   0   0   0   1   0   0
#> serS|MODERATE-HIGH             0   0   0   0   0   0   0   0   0   0
#> mecR1|MODERATE-HIGH            0   0   0   0   0   0   0   0   0   0
#> hsdR1|MODERATE-HIGH            0   0   0   0   0   0   0   0   0   0
#> USA300HOU_0034|MODERATE-HIGH   0   0   0   0   0   0   0   0   0   0

Want to group by minor allele?

If you want to group by minor allele to increase power by grouping rarer alleles together. This feature re-collapses multiallelic sites for each locus. This feature is analagous to grouping by gene, but now by site instead.

If you are additionally using SnpEff annotations, you can appropriately group variants with the indicated annotation by site.

results_ma = prewas(dna = vcf, grp_nonref = TRUE, anc = FALSE)

Getting Started With prewas

Stephanie Thiede, Zena Lapp, Katie Saund

2021-04-01

Introduction to prewas

1. Dealing with multiallelic variants

Single-line format:

Multi-line format:

2. Dealing with variants in overlapping genes

3. Choosing a reference allele

Outputs:

Usage

Minimal prewas run:

Inputs:

Specifications:

Will output a list containing:

Maximal prewas run:

Inputs:

Specifications:

Will output:

No tree?

Want to filter variants by predicted functional impact prior to your gene-based analysis?

Want to group by minor allele?

Detailed prewas workflow