The package slgf
implements the suspected latent grouping factor (SLGF) methodology of Metzger and Franck (2019). In this vignette, we first provide a brief theoretical and mathematical review of the method. Next, through the use of real observational data and a designed experiment, we illustrate the usage of slgf
in the context of two common linear model layouts: one-way analysis of variance (ANOVA) and a two-way unreplicated layout. These examples will illustrate the user specification of the models, group-based regression and variance structure, SLGF selection, and regression effect prior specification.
ms.slgf()
Computes posterior model probabilities for each user-specified model based on the SLGF grouping methodologyThis package requires a data frame containing a response variable and at least one categorical factor with three or more levels. In many cases this data frame may also contain additional covariates, either continuous or categorical.
extract.hats()
Returns the concentrated maximum likelihood estimates for regression coefficients, variance(s), and \(g\) from a user-specified model.
groupings()
Computes the unique groupings available for a given \(K\).
The function ms.slgf
requires several inputs to compute and output posterior model probabilities for all models and model classes of interest. The user begins with a data frame containing a continuous response, at least one categorical predictor, and any other covariates of interest. This data frame must not contain column names with the character string group
, as this specific string is used by ms.slgf
to create the group-based regression effect structure. The user must first identify a suspected latent grouping factor, usually by plotting the data and noting a latent structure within the levels of a categorical predictor as illustrated in Section 3.1. The user indicates, via the arguments response
and lgf
, character strings corresponding to the response and the suspected latent grouping factor variable names, respectively.
Next the user determines the model classes they wish to evaluate. We note the distinction between these model classes and the R class of a variable. The argument usermodels
is a list where each element contains a string of R class formula
or character
. The user also specifies which classes should also be considered in a heteroscedastic context via the argument het
, which provides an indicator 1
or 0
corresponding to each model class specified in usermodels
. Together the arguments usermodels
and het
create the full set of model classes considered.
Next the user chooses a prior to place on the regression effects. As described in Section 3.2, prior="flat"
(the default) implements a constant prior and prior="zs"
imposes the Zellner-Siow mixture \(g\)-prior.
Finally the user must specify the minimum number of levels of the slgf that can comprise a group, via the argument min.levels
, which defaults to 1
. Because the number of possible grouping schemes increases exponentially with \(K\), the number of levels of the SLGF, the user can reduce the number of candidate models by increasing min.levels
, limiting the computational burden. Additionally, when considering data with limited degrees of freedom, increasing min.levels
can also ensure estimability of the parametrized usermodels
; see Section 4.2 for more detail.
For a thorough review of the model specification see Metzger and Franck (2019). We must define a model flexible enough to account for all possible model classes. We begin with the model
\(\boldsymbol{Y}=X\boldsymbol{\beta}+\boldsymbol{\varepsilon}\),
where
We must augment our notation to account for several additional components: the full slgf with \(K\) degrees of freedom or a 2-degree of freedom group effect; interactions with the slgf or group effect; other effects of interest unrelated to the slgf; and potential group-based heteroscedasticity.Thus we let \(X=(\boldsymbol{1}^T|W|V|U)\) and \(\boldsymbol{\beta}=(\alpha,\boldsymbol{\nu},\boldsymbol{\tau},\boldsymbol{\rho})\) to obtain
\[\begin{equation} \boldsymbol{Y} = \boldsymbol{1}^T \alpha + W\boldsymbol{\nu} + V\boldsymbol{\tau} + U\boldsymbol{\rho} + \boldsymbol{\varepsilon} \end{equation}\]
where
Not all linear model classes will incorporate each term; for example, the one-way ANOVA example of Section 4.1 contains only a single categorical predictor so \(\bf{\tau}:=\boldsymbol{0}\) and \(\bf{\rho}:=\boldsymbol{0}\). When the slgf is modeled as a group effect, denote this effect as \(\tilde{\boldsymbol{\nu}}\); when there is an interaction involving the group effect, denote it as \(\tilde{\boldsymbol{\rho}}\).
In heteroscedastic contexts, let \(\sigma^2_1\) and \(\sigma^2_2\) represent the error variances of groups 1 and 2, respectively. Let \(\tilde{\Sigma}\) denote the covariance matrix where the \(i\)th diagonal element is \(\sigma^2_1\) if \(y_i\) belongs to group 1, or \(\sigma^2_2\) if \(y_i\) belongs to group 2.
In the interest of objectivity, we implement noninformative priors on the regression effects and error variance(s). For homoscedastic models,
\(P(\boldsymbol{\beta},\sigma^2|m)\propto \frac{1}{\sigma^2}\)
and for a heteroscedastic models, \(P(\boldsymbol{\beta},\sigma^2_1, \sigma^2_2|m)\propto \frac{1}{\sigma^2_1\cdot\sigma^2_2}\)
The user implements these priors with the argument prior="flat"
.
However, in contexts with limited data, such as the two-way unreplicated layout of Section 4.2, we recommend the Zellner-Siow mixture \(g\)-prior (Liang et al. 2008), which reduces the minimal training sample size necessary for the computation of the fractional Bayes factor. For homoscedastic models,
\(P(\alpha, \sigma^2|m)\propto \frac{1}{\sigma^2}\) and \(\boldsymbol{\beta}_{-\alpha}|\Sigma,g,m \sim N(\boldsymbol{0},\,g(X^T\Sigma X)^{-1})\);
for heteroscedastic models,
\(P(\alpha, \sigma^2_1, \sigma^2_2|m)\propto \frac{1}{\sigma^2_1\cdot\sigma^2_2}\) and \(\boldsymbol{\beta}_{-\alpha}|\tilde{\Sigma},g,m \sim N(\boldsymbol{0},\,g(X^T\tilde{\Sigma} X)^{-1})\);
and in both cases,
\(g\sim \texttt{IG}\big(\frac{1}{2},\, \frac{N}{2}\big)\).
The user implements these priors with the argument prior="zs"
.
We invoke a fractional Bayes factor approach to compute well-defined posterior model probabilities for each model; for a thorough review and justification see O’Hagan (1995) and Metzger and Franck (2019).
Let \(\mathcal{M}\) represent the full set of models under consideration, representing all classes and grouping schemes of interest. Denote \(\boldsymbol{\theta}\) as the full set of unknown parameters associated with a model \(m\in \mathcal{M}\) and \(\pi(\boldsymbol{\theta})\) as the joint prior on these parameters. A fractional Bayes factor is a ratio of two fractional marginal model probabilites, where a fractional marginal likelihood is defined as
\(q^b(\boldsymbol{Y}|\boldsymbol{\theta})=\frac{\int P(\boldsymbol{Y}|\boldsymbol{\theta},m)\pi(\boldsymbol{\theta})d\boldsymbol{\theta}}{\int P(\boldsymbol{Y}|\boldsymbol{\theta},m)^b\pi(\boldsymbol{\theta})d\boldsymbol{\theta}}\)
for a some fractional exponent \(b\). We exclusively implement \(b=\frac{m_0}{N}\) where \(m_0\) is the minimal training sample size required for \(P(\boldsymbol{Y}|m)\) to be a proper distribution.
Thus slgf
must approximate the integrals \(\int P(\boldsymbol{Y}|\boldsymbol{\theta},m)\pi(\boldsymbol{\theta})d\boldsymbol{\theta}\) and \(\int P(\boldsymbol{Y}|\boldsymbol{\theta},m)^b\pi(\boldsymbol{\theta})d\boldsymbol{\theta}\) for all \(m\in \mathcal{M}\). In the case of noninformative regression priors, \(\boldsymbol{\beta}\) is integrated analytically, and \(\sigma^2\) or \(\sigma^2_1, \sigma^2_2\) are integrated using a Laplace approximation after a log-variance transformation. In the Zellner-Siow mixture \(g\)-prior case, \(\alpha\) and \(\boldsymbol{\beta}_{-\alpha}\) are integrated analytically, and \(\sigma^2\) or \(\sigma^2_1, \sigma^2_2\) and \(g\) are again integrated using a Laplace approximation with a log-variance transformation. Let \(\boldsymbol{\tilde{\theta}}\) represent the set of unknown parameters after the regression effects \(\boldsymbol{\beta}\) have been integrated out. Then for dimensionality \(d=2\) in the noninformative prior case and \(d=3\) in the Zellner-Siow mixture \(g\)-prior case,
\(\log\big(\int P(\boldsymbol{Y}|\boldsymbol{\tilde{\theta}},m)\pi(\boldsymbol{\tilde{\theta}})d\boldsymbol{\tilde{\theta}}\big)\approx \frac{d}{2}\log(2\pi)-\frac{1}{2}\log|-H^{\star}|+\log(P(\boldsymbol{Y}|\boldsymbol{\tilde{\theta}}^{\star},m))\)
where \(\boldsymbol{\tilde{\theta}}^{\star}\) is the mode of \(P(\boldsymbol{Y}|\boldsymbol{\tilde{\theta}},m)\pi(\boldsymbol{\tilde{\theta}})\) and \(H^{\star}\) is the Hessian matrix evaluated at \(\boldsymbol{\tilde{\theta}}^{\star}\). These values are computed with the functions optim
and numDeriv::hessian
, respectively. We make a similar computation for \(\int P(\boldsymbol{Y}|\boldsymbol{\tilde{\theta}},m)^b\pi(\boldsymbol{\tilde{\theta}})d\boldsymbol{\tilde{\theta}}\) to compute the fractional marginal model probability \(q^b(\boldsymbol{Y}|\boldsymbol{\theta})\) for all \(m\in \mathcal{M}\), well defined for both homoscedastic and heteroscedastic models. Once log-fractional marginal likelihoods have been computed for all models, we subtract the maximum from this set so that the set of log-fractional marginal likelihoods has been rescaled to have a maximum of 0. Each value is exponentiated to obtain a set of fractional marginal likelihoods with maximum 1 to avoid numerical underflow when computing posterior model probabilities.
Model priors are imposed uniformly prior by model class, and for classes containing multiple models, the prior on each class is uniformly divided among the models it contains.
We finally compute posterior model probabilities for each model:
\(P(m^{\prime}|\boldsymbol{Y})=\frac{P(\boldsymbol{Y}|m^{\prime})P(m^{\prime})}{\underset{\mathcal{M}}{\sum}P(\boldsymbol{Y}|m)P(m)}\)
First consider the data of OBrien and Heft (1995), who studied olfactory function by age (\(y\)-axis), where age was divided into five categories (\(x\)-axis). Load the dataset:
A simple boxplot suggests potential heteroscedasticity, with latent grouping structure 1,2,3:4,5. We propose the alternative notation {1,2,3}{4,5} to avoid ambiguity with the “:
” symbol used with strings of class formula
in R.
boxplot(smell$olf ~ smell$agecat, pch = 16, xlab = "Age Category",
ylab = "Olfactory Score", xaxt = "n", col = c("gray30", "gray30",
"gray30", "white", "white"), border = c("black", "black",
"black", "red", "red"), main = "O'Brien and Heft (1995) Smell Data")
axis(1, labels = c("1\n(n=38)", "2\n(n=35)", "3\n(n=21)", "4\n(n=43)",
"5\n(n=42)"), at = 1:5, lwd.tick = 0)
We thus consider agecat
as the suspected latent grouping factor. The apparent latent grouping scheme we observe is denoted {1,2,3}{4,5} (or equivalently, {4,5}{1,2,3}), but all possible grouping schemes are considered. The means may also differ by level, but this is more difficult to distinguish by the plot. Thus we first consider the following model and heteroscedasticity structures:
By specifying smell.models
and smell.het
as done above, we consider the third model specified with group-based heteroscedasticity. This elicits four model classes: a homoscedastic global mean, homoscedastic age level means, homoscedastic group-based means, and group-based means with group-based heteroscedasticity. Finally we note that with a relatively large sample size, we prefer the use of noninformative priors via prior="flat"
, and we specify min.levels=1
, as we have no prior information on the number of levels of agecat
that may be grouped together and we wish to consider a comprehensive set of candidate models. With the model specification list argument usermodels
and the heteroscedasticity vector argument het
, the number of unique classes can be obtained as length(usermodels)+sum(het)
.
smell.out <- ms.slgf(dataf=smell, response="olf", lgf="agecat", usermodels=smell.models,
prior="flat", het=smell.het, min.levels=1)
Note by specifying the argument min.levels=1
, we consider all possible grouping schemes containing at least one level of the SLGF. We could specify min.levels=2
to consider a smaller model space with at least two levels per group, but min.levels=3
or greater is not possible with five levels of the SLGF. The output smell.out
is a list of class slgf
, with six elements when prior="flat"
and seven when prior="zs"
; see the help file for full details.
We summarize the five most probable models of 32 considered:
smell.out$results[1:5, c(1:3,6,11)]
#> Model Scheme Variance Fmodprob Class
#> 1 olf~group {4,5}{1,2,3} Heterosk 0.99987407 olf~group, Heterosk
#> 2 olf~group {1,2}{3,4,5} Heterosk 0.00012589 olf~group, Heterosk
#> 3 olf~agecat None Homosk 0.00000003 olf~agecat, Homosk
#> 4 olf~group {5}{1,2,3,4} Heterosk 0.00000000 olf~group, Heterosk
#> 5 olf~group {4,5}{1,2,3} Homosk 0.00000000 olf~group, Homosk
We strongly favor the model with group-based mean effects and variances via scheme {4,5}{1,2,3}.
The function extract.hats
provides the estimates for the coefficient(s) and variance(s) for a given model.index
, as well as \(g\) if prior="zs"
.
smell.hats <- extract.hats(smell.out, model.index=1)
print(smell.hats$`sigsq.{4,5}`)
#> [1] 0.01211023
print(smell.hats$`sigsq.{1,2,3}`)
#> [1] 0.05868753
That is, we compute \(\hat{\sigma}^2_{\text{1,2,3}}=\) 0.05869 and \(\hat{\sigma}^2_{\text{4,5}}=\) 0.01211. Let us also consider the case where het=c(1,1,1)
; that is, we include two additional classes: group-based variances and a single global mean, and group-based variances with means by agecat
.
smell.out <- ms.slgf(dataf=smell, response="olf", lgf="agecat",
usermodels=smell.models, prior="flat",
het=c(1,1,1), min.levels=1)
smell.out$results[1:5, c(1:3,6,11)]
#> Model Scheme Variance Fmodprob Class
#> 1 olf~group {4,5}{1,2,3} Heterosk 0.54204570 olf~group, Heterosk
#> 2 olf~agecat {4,5}{1,2,3} Heterosk 0.44946817 olf~agecat, Heterosk
#> 3 olf~agecat {1,2}{3,4,5} Heterosk 0.00840126 olf~agecat, Heterosk
#> 4 olf~group {1,2}{3,4,5} Heterosk 0.00007605 olf~group, Heterosk
#> 5 olf~agecat {1,3}{2,4,5} Heterosk 0.00000566 olf~agecat, Heterosk
Now the most probable models are a bit less conclusive, as the distinct category-means model with scheme {4,5}{1,2,3} group-based heteroscedasticity accounts for a meaningful amount of posterior model probability. We can easily summarize the scheme and class probabilities, which strongly favor scheme {4,5}{1,2,3} and moderately favor the group-based means and variances model class:
smell.out$scheme.Probs
#> Scheme.Prob
#> {4,5}{1,2,3} 0.99151391
#> {1,2}{3,4,5} 0.00847731
#> {1,3}{2,4,5} 0.00000566
#> {2,3}{1,4,5} 0.00000219
#> {1}{2,3,4,5} 0.00000047
#> {5}{1,2,3,4} 0.00000025
#> {2}{1,3,4,5} 0.00000018
#> None 0.00000002
#> {1,4}{2,3,5} 0.00000000
#> {1,5}{2,3,4} 0.00000000
#> {2,4}{1,3,5} 0.00000000
#> {2,5}{1,3,4} 0.00000000
#> {3,4}{1,2,5} 0.00000000
#> {3,5}{1,2,4} 0.00000000
#> {3}{1,2,4,5} 0.00000000
#> {4}{1,2,3,5} 0.00000000
smell.out$class.Probs
#> Class.Prob
#> olf~group, Heterosk 0.54212175
#> olf~agecat, Heterosk 0.45787818
#> olf~1, Heterosk 0.00000004
#> olf~agecat, Homosk 0.00000002
#> olf~1, Homosk 0.00000000
#> olf~group, Homosk 0.00000000
Next we analyze a two-way unreplicated layout. Consider the data analyzed by Franck, Nielsen, and Osborne (2013), where six dogs with lymphoma were studied. Two individual samples were taken from healthy and tumor tissue within each dog, and the copy number variation was measured for each sample. We arrange dogs into rows and tissue types into columns of a two-way layout. We first plot the data to determine whether a latent grouping structure underlies the data:
We strongly suspect that dogs 1, 2, and 5 behave distinctly from dogs 3, 4, and 6. The telltale non-parallel lines suggest an underlying interaction, but with only a single observation in each cell, we lack the degrees of freedom to fit a standard row by column interaction term. Instead, we let dog
represent the SLGF, exclude the column effect, and parametrize a group-by-column interaction term, isomorphic to fitting distinct column effects by group. Thus we consider four reasonable model classes: a dog and tissue effect, a dog and tissue-by-column interaction, and the heteroscedastic counterparts of each class. We accomplish this with the arguments usermodels=list("gene~dog+tissue", "gene~dog+group:tissue")
and het=c(1,1)
. Additionally, we must specify min.levels=2
or min.levels=3
to ensure sufficient degrees of freedom to estimate the group:col
effect. Here we consider min.levels=2
in the interest of assessing a more complete set of candidate models.
Because of the limited amount of data, our choice of prior is more impactful in this case. We impose prior="zs"
to utilize the Zellner-Siow mixture \(g\)-prior, as the fractional Bayes factor exponent would require a prohibitively high proportion of the data for model training.
We first put the two-way layout into a data.frame
format compatible with the slgf
function, and then implement this approach.
lymphoma.tall <- maketall(lymphoma)
lymphoma.tall <- data.frame("gene"=lymphoma.tall[,1], "dog"=lymphoma.tall[,2],
"tissue"=lymphoma.tall[,3])
lymphoma.models <- list("gene~dog+tissue", "gene~dog+group:tissue")
lymphoma.out <- ms.slgf(dataf=lymphoma.tall, response="gene", lgf="dog",
usermodels=lymphoma.models, prior="zs",
het=c(1,1), min.levels=2)
Note the :
operator in the usermodels
syntax, which does not automatically include the main effects group
and tissue
which are not both estimable. As expected, we conclude with high probability that scheme {1,2,5}{3,4,6} underlies the data. The five most probable models are given by:
lymphoma.out$results[1:5, c(1:3,6)]
#> Model Scheme Variance Fmodprob
#> 1 gene~dog+group:tissue {1,2,5}{3,4,6} Homosk 0.73643491
#> 2 gene~dog+group:tissue {1,2,5}{3,4,6} Heterosk 0.24833051
#> 3 gene~dog+tissue None Homosk 0.00311082
#> 4 gene~dog+group:tissue {1,2}{3,4,5,6} Heterosk 0.00211760
#> 5 gene~dog+tissue {1,2,5}{3,4,6} Heterosk 0.00109052
while the class and five highest scheme probailities are
lymphoma.out$class.Probs
#> Class.Prob
#> gene~dog+group:tissue, Homosk 0.74081898
#> gene~dog+group:tissue, Heterosk 0.25326343
#> gene~dog+tissue, Homosk 0.00311082
#> gene~dog+tissue, Heterosk 0.00280677
head(lymphoma.out$scheme.Probs, 5)
#> Scheme.Prob
#> {1,2,5}{3,4,6} 0.98585594
#> None 0.00311082
#> {1,2}{3,4,5,6} 0.00288222
#> {1,5}{2,3,4,6} 0.00145899
#> {3,6}{1,2,4,5} 0.00081783
The group.datafs
element of lymphoma.out
contains data.frames
associated with each model and grouping scheme. We first determine which element of lymphoma.out$group.datafs
contains the data.frame
of interest via the column dataf.Index
:
lymphoma.out$results[1,c(1:3,6,8)]
#> Model Scheme Variance Fmodprob dataf.Index
#> 1 gene~dog+group:tissue {1,2,5}{3,4,6} Homosk 0.7364349 18
This tells us that element 18 of lymphoma.out$group.datafs
contains the data.frame
with the {1,2,5}{3,4,6} group effect:
lymphoma.out$group.datafs[[18]]
#> gene dog tissue group
#> 1 9.3278 1 1 1,2,5
#> 2 9.2168 1 2 1,2,5
#> 3 9.5108 2 1 1,2,5
#> 4 9.3942 2 2 1,2,5
#> 5 8.7535 3 1 3,4,6
#> 6 9.4158 3 2 3,4,6
#> 7 8.6372 4 1 3,4,6
#> 8 9.2480 4 2 3,4,6
#> 9 9.4981 5 1 1,2,5
#> 10 9.4626 5 2 1,2,5
#> 11 8.7322 6 1 3,4,6
#> 12 9.3439 6 2 3,4,6
Franck, Christopher T., Dahlia M. Nielsen, and Jason A. Osborne. 2013. “A Method for Detecting Hidden Additivity in Two-Factor Unreplicated Experiments.” Computational Statistics & Data Analysis 67 (Supplement C): 95–104. https://doi.org/https://doi.org/10.1016/j.csda.2013.05.002.
Liang, Feng, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger. 2008. “Mixtures of G Priors for Bayesian Variable Selection.” Journal of the American Statistical Association 103 (481): 410–23. https://doi.org/10.1198/016214507000001337.
Metzger, Thomas A., and Christopher T. Franck. 2019. “Detection of latent heteroscedasticity and group-based regression effects in linear models via Bayesian model selection.” arXiv E-Prints, March.
OBrien, R. G., and M. W. Heft. 1995. “New Discrimination Indexes and Models for Studying Sensory Functioning in Aging.” Journal of Applied Statistics 22: 9–27.
O’Hagan, Anthony. 1995. “Fractional Bayes Factors for Model Comparison.” Journal of the Royal Statistical Society. Series B (Methodological) 57 (1): 99–138. http://www.jstor.org/stable/2346088.