The sim_reg
function performs inference under the 'Similarity regression' model [1] conditional on supplied binary genotypes and ontological term sets.
The vignette 'Similarity Regression - Introduction' shows a simple application based on simulated data. This vignette demonstrates how to set the prior on phi
, for example in order to include prior information about likely phenotype of a disease. The information is supplied to the inference procedure as a parameter called term_weights
and should be a numeric vector of relative weights for terms included in the sample space of phi
(by default the set of all terms present amongst the terms in x
and their ancestors).
library(ontologyIndex)
library(ontologySimilarity)
library(SimReg)
data(hpo)
set.seed(1)
terms <- get_ancestors(hpo, c(hpo$id[match(c("Abnormality of thrombocytes","Hearing abnormality"),
hpo$name)], sample(hpo$id, size=50)))
To help illustrate the ideas, we'll consider a scenario where there is some evidence for an association between a phenotype - which in this example we'll set as 'Hearing abnormality' - and a binary genotype, and where there is also a prior expectation that the 'characteristic phenotype' of the genotype would involve hearing abnormality. For instance, the genotype might depend on variants in a gene orthologous to one known to harbour variants associated with 'Hearing abnormality' in a model organism. We apply the inference procedure to the data with and without the prior and observe the effect on the inferred probability of association.
hearing_abnormality <- hpo$id[match("Hearing abnormality", hpo$name)]
genotypes <- c(rep(TRUE, 2), rep(FALSE, 98))
#give all subjects 5 random terms and add 'hearing abnormality' for those with y_i=TRUE
phenotypes <- lapply(genotypes, function(y_i) minimal_set(hpo, c(
if (y_i) hearing_abnormality else character(0), sample(terms, size=5))))
So there are three cases with the rare variant (i.e. having y_i = TRUE) and all of them have the 'Hearing abnormality' HPO term.
An application of yields:
sim_reg(ontology=hpo, x=phenotypes, y=genotypes)
## Probability of association: 0.17
## --------------------------------------------------------------------------------
## Name p
## Abnormality of cardiovascular system physiology 0.65
## Hearing abnormality 0.47
## Abnormality of the ear 0.23
## Abnormality of the metaphyses 0.05
## Abnormality of long bone morphology 0.05
## Abnormality of the cardiovascular system 0.05
## Abnormality of upper limb metaphysis 0.05
## Thickened cortex of long bones 0.03
## Abnormal appendicular skeleton morphology 0.02
## Abnormality of the humeral metaphyses 0.02
## --------------------------------------------------------------------------------
We now consider constructing the term_weights
parameter to capture our knowledge about the gene from the model organism. We can either explicitly create the term_weights
vector of prior weights, i.e. by assigning higher weights to terms which involve some kind of hearing problem. For example, we could set the prior weight of all terms which have the word 'hearing' in to ten times that of terms which don't.
term_weights <- ifelse(grepl(x=hpo$name, ignore=TRUE, pattern="hearing"), 10, 1)
names(term_weights) <- hpo$id
Note: one must set the names
of the term_weights
vector, as sim_reg
will use it.
If the prior knowledge of the phenotype/phenotype of the model organism has been ontologically encoded (for example, it may be available as MPO terms from the Mouse Genome Informatics (MGI) website, http://www.informatics.jax.org/, [2]), another option is to use a phenotypic similarity function to obtain the numeric vector of weights for inclusion of terms in phi
[1]. This may be more convenient, particularly when dealing with large numbers of genes. In the SimReg paper [1], the vector is set using the Resnik-based [3] similarities of terms to the terms in the 'literature phenotype'. In order to calculate the similarities based on Resnik's similarity measure, we must first compute an 'information content' for the terms, equal to the negative log frequency. The frequencies can be calculated with respect to different collections of phenotypes. Here, we will calculate it with respect to the frequencies of terms within our collection, phenotypes
, by calling the function exported by the ontologyIndex
package, get_term_info_content
. Note, it could also be calculated with respect to the frequency of the term amongst the ontological annotation of OMIM diseases (available from the HPO website, http://human-phenotype-ontology.github.io [4]). The function get_term_set_to_term_sims
in the package ontologySimilarity
can then be used to calculate the similarities between the terms in the sample space of phi
and the literature_phenotype
. It calculates a matrix of similarities between the individual terms in the literature phenotype and terms in the sample space. Let's say the phenotype of the model organism in our example contains abnormalities of the thrombocytes and hearing abnormality.
thrombocytes <- hpo$id[match("Abnormality of thrombocytes", hpo$name)]
literature_phenotype <- c(hearing_abnormality, thrombocytes)
info <- get_term_info_content(hpo, phenotypes)
term_weights_resnik <- apply(get_term_set_to_term_sims(
ontology=hpo, information_content=info, terms=names(info),
term_sim_method="resnik", term_sets=list(literature_phenotype)), 2, mean)
This can then be passed to sim_reg
through the term_weights
parameter.
sim_reg(
ontology=hpo,
x=phenotypes,
y=genotypes,
term_weights=term_weights_resnik
)
## Probability of association: 0.28
## --------------------------------------------------------------------------------
## Name p
## Hearing abnormality 0.65
## Abnormality of cardiovascular system physiology 0.48
## Abnormality of the ear 0.21
## Abnormality of the cardiovascular system 0.04
## Abnormality of the metaphyses 0.03
## Abnormality of long bone morphology 0.03
## Abnormality of upper limb metaphysis 0.03
## Thickened cortex of long bones 0.02
## Abnormal appendicular skeleton morphology 0.02
## Abnormality of the humeral metaphyses 0.01
## --------------------------------------------------------------------------------
Note that including the term_weights
parameter has increased the mean posterior value of gamma
.
Often the binary genotype relates to a particular gene, and for many genes ontologically encoded phenotypes are available either in the form of HPO encoded OMIM annotations [4] or MPO annotations [2]. For a given set of subjects with HPO-coded phenotypes, it may be useful to apply the inference gene-by-gene, taking the binary genotype y
to indicate the presence of a rare variant in each particular gene for each case. Thus, we may wish to systematically create informative prior distributions for phi
for all genes. This can be done by downloading the file called 'ALL_SOURCES_TYPICAL_FEATURES_genes_to_phenotype.txt' from the HPO website, and running the following code yielding a list of term sets (i.e. character vectors of HPO term IDs).
annotation_df <- read.table(header=FALSE, skip=1, sep="\t",
file="ALL_SOURCES_TYPICAL_FEATURES_genes_to_phenotype.txt", stringsAsFactors=FALSE, quote="")
hpo_by_gene <- lapply(split(f=annotation_df[,2], x=annotation_df[,4]),
function(trms) minimal_set(hpo, intersect(trms, hpo$id)))