Elsevier

NeuroImage

Volume 46, Issue 4, 15 July 2009, Pages 1004-1017
NeuroImage

Bayesian model selection for group studies

https://doi.org/10.1016/j.neuroimage.2009.03.025Get rights and content

Abstract

Bayesian model selection (BMS) is a powerful method for determining the most likely among a set of competing hypotheses about the mechanisms that generated observed data. BMS has recently found widespread application in neuroimaging, particularly in the context of dynamic causal modelling (DCM). However, so far, combining BMS results from several subjects has relied on simple (fixed effects) metrics, e.g. the group Bayes factor (GBF), that do not account for group heterogeneity or outliers. In this paper, we compare the GBF with two random effects methods for BMS at the between-subject or group level. These methods provide inference on model-space using a classical and Bayesian perspective respectively. First, a classical (frequentist) approach uses the log model evidence as a subject-specific summary statistic. This enables one to use analysis of variance to test for differences in log-evidences over models, relative to inter-subject differences. We then consider the same problem in Bayesian terms and describe a novel hierarchical model, which is optimised to furnish a probability density on the models themselves. This new variational Bayes method rests on treating the model as a random variable and estimating the parameters of a Dirichlet distribution which describes the probabilities for all models considered. These probabilities then define a multinomial distribution over model space, allowing one to compute how likely it is that a specific model generated the data of a randomly chosen subject as well as the exceedance probability of one model being more likely than any other model. Using empirical and synthetic data, we show that optimising a conditional density of the model probabilities, given the log-evidences for each model over subjects, is more informative and appropriate than both the GBF and frequentist tests of the log-evidences. In particular, we found that the hierarchical Bayesian approach is considerably more robust than either of the other approaches in the presence of outliers. We expect that this new random effects method will prove useful for a wide range of group studies, not only in the context of DCM, but also for other modelling endeavours, e.g. comparing different source reconstruction methods for EEG/MEG or selecting among competing computational models of learning and decision-making.

Introduction

Model comparison and selection is central to the scientific process, in that it allows one to evaluate different hypotheses about the way data are caused (Pitt and Myung, 2002). Nearly all scientific reporting rests upon some form of model comparison, which represents a probabilistic statement about the beliefs in one hypothesis relative to some other(s), given observations or data. The fundamental Neyman–Pearson lemma states that the best statistic upon which to base model selection is simply the probability of observing the data under one model, divided by the probability under another model (Neyman and Pearson, 1933). This is known as a log-likelihood ratio. In a classical (frequentist) setting, the distribution of the log-likelihood ratio, under the null hypothesis that there is no difference between models, can be computed relatively easy for some models. Common examples include Wilk's Lambda for linear multivariate models and the F- and t-statistics for univariate models. In a Bayesian setting, the equivalent to the log-likelihood ratio is the log-evidence ratio, which is commonly known as a Bayes factor (Kass and Raftery, 1995). An important property of Bayes factors are that they can deal both with nested and non-nested models. In contrast, frequentist model comparison can be seen as a special case of Bayes factors where, under certain hierarchical restrictions on the models, their null distribution is readily available.

In this paper, we will consider the general case of how to use the model evidence for analyses at the group level, without putting any constraints on the models compared. These models can be nonlinear, possibly dynamic and, critically, do not necessarily bear a hierarchical relationship to each other, i.e. they are not necessarily nested. The application domain we have in mind is the comparison of dynamic causal models (DCMs) for fMRI or electrophysiological data (Friston et al., 2003, Stephan et al., 2007a) that have been inverted for each subject. However, the theoretical framework described in this paper can be applied to any model, for example when comparing different source reconstruction methods for EEG/MEG or selecting among competing computational models of learning and decision-making.

This paper is structured as follows. First, to ensure this paper is self-contained, particularly for readers without an in-depth knowledge of Bayesian statistics, we summarise the concept of log-evidence as a measure of model goodness and review commonly used approximations to it, i.e. the Akaike Information Criterion (AIC; Akaike, 1974), the Bayesian Information Criterion (BIC; Schwarz, 1978), and the negative free-energy (F). These approximations, which are described in Appendix A, differ in how they trade-off model fit against model complexity. Given any of these approximations to the log-evidence, we then consider model comparison at the group level. We address this issue both from a classical and Bayesian perspective. First, in a frequentist setting, we consider classical inference on the log-evidences themselves by treating them as summary statistics that reflect the evidence for each model for a given subject. Subsequently, using a hierarchical model and variational Bayes (VB), we describe a novel technique for inference on the conditional density of the models per se, given data (or log-evidences) from all subjects. This rests on treating the model as a random variable and estimating the parameters of a Dirichlet distribution, which describes the probabilities for all models considered. These probabilities then define a multinomial distribution over model space, allowing one to compute how likely it is that a specific model generated the data of a subject chosen at random.

We compare and contrast these random effects approaches to the conventional use of the group Bayes factor (GBF), an approach for model comparison at the between-subject level that has been used extensively in previous group studies in neuroimaging. For example, the GBF has been used frequently to decide between competing dynamic causal models fitted to fMRI (Acs and Greenlee, 2008, Allen et al., 2008, Grol et al., 2007, Heim et al., 2008, Kumar et al., 2007, Leff et al., 2008, Smith et al., 2006, Stephan et al., 2007b, Stephan et al., 2007c, Summerfield and Koechlin, 2008) and EEG data (Garrido et al., 2007, Garrido et al., 2008). While the GBF is a simple and straightforward index for model comparison at the group level, it assumes that all the subjects' data are generated by the same model (i.e. a fixed effects approach) and can be influenced adversely by violations of this assumption.

The novel Bayesian framework presented in this paper does not suffer from these shortcomings: it can quantify the probability that a particular model generated the data for any randomly selected subject, relative to other models, and it is robust to the presence of outliers. In the analyses below, we illustrate the advantages of this new approach using synthetic and empirical data. We show that computing a conditional density of the model probabilities, given the log-evidences for all subjects, can be superior to both the GBF and frequentist tests applied to the log-evidences. In particular, we found that our Bayesian approach is markedly more robust than either of the other approaches in the presence of outlying subjects.

Section snippets

The model evidence and its approximations

The model evidence p(y|m) is the probability of obtaining observed data y given a particular model m. It can be considered the holy grail of any model inversion and is necessary to compare different models or hypotheses. The evidence for some models can be computed relatively easily (e.g., for linear models); however, in general, computing the model evidence entails integrating out any dependency on the model parameters ϑ:p(y|m)=p(y|ϑ,m)p(ϑ|m)dϑ.

In many cases, this integration is analytically

Results

In what follows, we compare classical inference, the GBF (fixed effects) and inference on model space (random effects) using both synthetic and real data. These data have been previously published and have been analysed in various ways, including group level model inference using GBFs (Stephan et al., 2007b, Stephan et al., 2007c, 2008).

Discussion

In this paper, we have introduced a novel approach for model selection at the group level. Provisional experience suggests that this approach represents a more powerful way of quantifying one's belief that a particular model is more likely than any other at the group level, relative to the conventional GBF. Critically, this variational Bayesian approach rests on treating the model switches mi as a random variable, within a full hierarchical model for multi-subject data (see Fig. 1), and thus

Acknowledgments

This work was funded by the Wellcome Trust (KES, WDP, RJM, KJF) and the University Research Priority Program “Foundations of Human Social Behaviour” at the University of Zurich (KES). JD is funded by Marie Curie Fellowship. We are very grateful to Marcia Bennett for helping prepare this manuscript, to the FIL Methods Group, particularly Justin Chumbley, for useful discussions and to Jon Roiser and Dominik Bach for helpful comments on practical applications. Finally, we would like to thank the

References (43)

Cited by (0)

View full text