Technical Report No. 9815, Department of Statistics, University of Toronto
Markov Chain Sampling Methods for Dirichlet Process Mixture Models
Radford M. Neal
Department of Statistics and Department of Computer Science University of Toronto, Toronto, Ontario, Canada
http://www.cs.utoronto.ca/ radford/ firstname.lastname@example.org
1 September 1998
Abstract. Markov chain methods for sampling from the posterior distribution of a
Dirichlet process mixture model are reviewed, and two new classes of methods are presented. One new approach is to make Metropolis-Hastings updates of the indicators specifying which mixture component is associated with each observation, perhaps supplemented with a partial form of Gibbs sampling. The other new approach extends Gibbs sampling for these indicators by using a set of auxiliary parameters. These methods are simple to implement and are more e cient than previous ways of handling general Dirichlet process mixture models with non-conjugate priors.
Modeling a distribution as a mixture of simpler distributions is useful both as a nonparametric density estimation method and as a way of identifying latent classes that can explain the dependencies observed between variables. Mixtures with a countably in nite number of components can reasonably be handled in a Bayesian framework, by employing a prior distribution for mixing proportions, such as a Dirichlet process, that leads to a few of these components dominating. Use of countably in nite mixtures bypasses the need to determine the \correct" number of components in a nite mixture model, a task which is fraught with technical di culties. In many contexts, a countably in nite mixture is also a more realistic model than a mixture with a small number of components. Use of Dirichlet process mixture models has become computationally feasible with the development of Markov chain methods for sampling from the posterior distribution of the parameters of the component distributions and/or of the associations of mixture components with observations. Methods based on Gibbs sampling can easily be implemented for models based on conjugate prior distributions, but when non-conjugate priors are used, as is appropriate in many contexts, straightforward Gibbs sampling requires that an often di cult numerical integration be performed. West, Muller, and Escobar 1
(1994) use a Monte Carlo approximation to this integral, but the error from using such an approximation is likely to be large in many contexts. MacEachern and Muller (1998) have devised an exact approach to handling nonconjugate priors that utilizes a mapping from a set of auxiliary parameters to the set of parameters currently in use. Their \no gaps" and \complete" algorithms based on this approach are widely applicable, but somewhat ine cient. Walker and Damien (1998) apply a rather di erent auxiliary variable method to some Dirichlet process mixture models, but their method appears to be unsuitable for general use, as it again requires the computation of a di cult integral. In this paper, I review this past work, and present two new approaches to Markov chain sampling. A very simple method for handling non-conjugate priors is to use MetropolisHastings updates with the conditional prior as the proposal distribution. A variation of this method may sometimes sample more e ciently, particularly when combined with a partial form of Gibbs sampling. Another class of methods uses Gibbs sampling in a space with auxiliary parameters. The simplest method of this type is very similar to the \no gaps" algorithm of MacEachern and Muller, but is more e cient. This approach also yields an algorithm that resembles use of a Monte Carlo approximation to the necessary integrals, but which does not su er from any approximation error. I conclude with a demonstration of the methods on a simple problem.
Dirichlet process mixture models1 go back to Antoniak (1974) and Ferguson (1983). The have recently been developed as practical methods by Escobar and West (1995), MacEachern and Muller (1998), and others. The basic model applies to data y1 ::: y which we regard as exchangeable, or equivalently, as being independently drawn from some unknown distribution. The y may be multivariate, with components that may be real-valued or categorical. We model the distribution from which the y are drawn as a mixture of distributions of the form F ( ), with the mixing distribution over being G. We let the prior for this mixing distribution be a Dirichlet process (Ferguson 1973), with concentration parameter and base distribution G0 (ie, with base measure G0 ). This gives the following model:
n i i
2 Dirichlet process mixture models
F( ) G D(G0 )
Often, F and G0 will depend on additional hyperparameters not mentioned above, which, along with , may be given priors at a higher level. The computational methods discussed in this paper extend easily to these more complex models, as brie y discussed in Section 7.
1 Sometimes also called \mixture of Dirichlet process models", apparently because of Antoniak's (1974) characterization of their posterior distributions. Since models are not usually named for the properties of their posterior distributions, this terminology is avoided here.
Since realizations of the Dirichlet process are discrete with probability one, these models can be viewed as countably in nite mixtures, as pointed out by Ferguson (1983). This is also apparent when we integrate over G in model (1), to obtain a representation of the prior distribution of the in terms of successive conditional distributions of the following form (Blackwell and MacQueen 1973):
: : : ;1
;1 1 X ( )+ i ; 1+ =1 i ; 1+ G0
i j j
Here, ( ) is the distribution concentrated at the single point . Equivalent models can also be obtained by taking the limit as K goes to in nity of nite mixture models with K components having the following form: y jc F( ) c jp Discrete (p1 : : : p ) (3)
i ci i
Dirichlet ( =K : : : =K ) Here, c indicates which \latent class" is associated with observation y , with the numbering of the c being of no signi cance. For each class, c, the parameters determine the distribution of observations from that class. The mixing proportions for the classes, p , are given a symmetric Dirichlet prior, with concentration parameter written as =K , so that it approaches zero as K goes to in nity. By integrating over the mixing proportions, p , we can write the prior for the c as the product of conditional probabilities of the following form: P (c = c j c1 : : : c ;1 ) = ni ;+ +=K (4) 1 where n is the number of c for j < i that are equal to c. If we now let K go to in nity, we nd that the conditional probabilities de ning the prior for the c reach the following limits:2
K i i i c c c i i c i i i c j i
p1 : : : p
Since the c are signi cant only in so far as they are or are not equal to other c , the above probabilities are all that are needed to de ne the model. If we now let = we can see that the limit of model (3) as K ! 1 is equivalent to the Dirichlet process mixture model (1), due to the correspondence between the conditional probabilities for the in equation (2) and those implied by (5). I have previously used this limiting process to de ne a model which (unknown to me at the time) is equivalent to a Dirichlet process mixture (Neal 1992). This view is useful
i j i ci i
P (c = c j c1 : : : c ;1 ) ! i ;n1 + P (c 6= c for all j <i j c1 : : : c ;1 ) ! i ; 1 +
i c i i i j i
Some readers may be disturbed by the failure of countable additivity for these limiting probabilities, but the limiting distribution of the observable quantities and the limiting forms of the algorithms based on this model are both well de ned as K goes to in nity.
in deriving algorithms for sampling from the posterior distribution for Dirichlet process mixture models. Conversely, an algorithm for Dirichlet process mixture models will usually have a counterpart for nite mixture models. This is the case for the algorithms discussed in this paper, though I do not give details of the algorithms for nite mixtures. Yet another way of formulating a model equivalent to a Dirichlet process mixture is in terms of the prior probability that two observations come from the same mixture component (equal to 1=(1+ ) in the models above). This approach has been used by Anderson (1990, Chapter 3) in formulating a model for use as a psychological theory of human category learning.
Exact computation of posterior expectations for a Dirichlet process mixture model is infeasible when there are more than a few observations. However, such expectations can be estimated using Monte Carlo methods. For example, the predictive distribution for a P ) ) new observation, y +1 , can be estimated by (1=T ) =1 F ( ( +1 ), where the points ( +1 P are drawn from the distribution (n+ );1 =1 ( ( ) ) + (n+ );1 G0 (see equation (2)), ( where 1 ) : : : ( ) is the t'th point in a sample from the posterior distribution of the . We can sample from the posterior distribution of 1 : : : by simulating a Markov chain that has this as its equilibrium distribution. The simplest such methods are based on Gibbs sampling, which when conjugate priors are used can be done in three ways. The most direct approach to sampling for model (1) is to repeatedly draw values for each from its conditional distribution given both the data and the for j 6= i (written as ; ). This conditional distribution is obtained by combining the likelihood, written F (y ), and the prior conditional on ; , which is 1 X ( ) + j ; (6) n ; 1+ 6= n ; 1+ G0
n T t t t n n n i t i t t n i n i j i i i i i i j j i
3 Gibbs sampling when conjugate priors are used
This can be derived from equation (2) by imagining that i is the last observation, as we may, since the observations are exchangeable. When combined with the likelihood, this yields the following conditional distribution for use in Gibbs sampling:
( ) +rH
Here, H is the posterior distribution for based on the prior G0 and the single observation y , with likelihood F (y ). The values of the q and of r are de ned as
i i i i j i
= b F (y
where b is such that 6= q + r = 1. For this Gibbs sampling method to be feasible, computing the integral de ning r and sampling from H must be feasible operations. This will generally be so when G0 is the conjugate prior for the likelihood given by F .
j i i j i i i
r = b
F (y ) dG0 ( )
We may summarize this method as follows: Algorithm 1: Let the state of the Markov chain consist of 1 : : : . Repeatedly sample as follows: For i = 1 : : : n: Draw a new value from j ; y as de ned by equation (7). This algorithm is used by Escobar (1994) and by Escobar and West (1995). It produces an ergodic Markov chain, but convergence to the posterior distribution may be rather slow, and sampling thereafter may be ine cient. The problem is that there are often groups of observations that with high probability are associated with the same . Since the algorithm cannot change the for more than one observation simultaneously, changes to the values for observations in such a group can occur only rarely, as they require passage through a low-probability intermediate state in which observations in the group do not all have the same value. This problem is avoided if Gibbs sampling is instead applied to the model formulated as in (3), with the mixing proportions, p , integrated out. When K is nite, each Gibbs sampling scan consists of picking a new value for each c from its conditional distribution given y , the , and the c for j 6= i (written as c; ), and then picking a new value for each from its conditional distribution given the y for which c = c. The required conditional probabilities for c can easily be computed: ; P (c = c j c; y ) = b F (y ) nn ; + +=K (10) 1 where n; is the number of c for j 6= i that are equal to c, and b is the appropriate normalizing constant. The last factor is derived from equation (4) by imagining that i is the last observation. (Note that the denominator n ; 1+ could be absorbed into b, but here and later it is retained for clarity.) The conditional distribution for will also be easy to sample from when the priors used are conjugate, and even when Gibbs sampling for is di cult, one may simply substitute some other update that leaves the required distribution invariant. Note that when a new value is chosen for , the values of = will change simultaneously for all observations associated with component c. When K goes to in nity, we cannot, of course, explicitly represent the in nite number of . We instead represent, and do Gibbs sampling for, only those that are currently associated with some observation. Gibbs sampling for the c is based on the following conditional probabilities (with here being the set of currently associated with at least one observation): n If c = c for some j 6= i: P (c = c j c; y ) = b n ;; F (y ) 1+ (11) Z P (c 6= c for all j 6= i j c; y ) = b n ; 1+ F (y ) dG0 ( ) Here, b is the appropriate normalizing constant that makes the above probabilities sum to one. The numerical values of the c are arbitrary, as long at they faithfully represent whether or not c = c they may be chosen for programming convenience, or to facilitate the display of mixture components in some desired order. When Gibbs sampling for c
n i i i c i i c j i c i i i i c i i i i c i c j c c c i ci c c i c i c j i i i i c i j i i i i i j i
chooses a value not equal to any other c , a value for is chosen from H , the posterior distribution of based on the prior G0 and the single observation y . We can summarize this second Gibbs sampling method as follows: Algorithm 2: Let the state of the Markov chain consist of c1 : : : c and = ( : c 2 fc1 : : : c g). Repeatedly sample as follows: For i = 1 : : : n: If the present value of c is associated with no other observation (ie, n; = 0), remove from the state. Draw a new value for c from c j c; y as de ned by equation (11). If the new c is not associated with any other observation, draw a value for from H and add it to the state. For all c 2 fc1 : : : c g: Draw a new value from j y s.t. c = c. This is essentially the method used by Bush and MacEachern (1996) and by West, Muller, and Escobar (1994). As was the case for the rst Gibbs sampling method, this approach R is feasible if we can compute F (y ) dG0 ( ) and sample from H , as will generally be the case when G0 is the conjugate prior. Finally, in a conjugate context, we can often integrate analytically over the , eliminating them from the algorithm. The state of the Markov chain then consists only of the c , which we update by Gibbs sampling using the following conditional probabilities:
j ci i i n c n i i ci ci i i i i i ci i n c i i i i c
Here, H; is the posterior distribution of based on the prior G0 and all observations y for which j 6= i and c = c. This third Gibbs sampling method can be summarized as follows: Algorithm 3: Let the state of the Markov chain consist of c1 : : : c . Repeatedly sample as follows: For i = 1 : : : n: Draw a new value from c j c; y as de ned by equation (12). This algorithm is presented by MacEachern (1994) for mixtures of normals and by myself (Neal 1992) for models of categorical data.
i c j j n i i i
n; Z F (y ) dH ( ) If c = c for some j 6= i: P (c = c j c; y ) = b n ; 1+ ; (12) Z P (c 6= c for all j 6= i j c; y ) = b n ; 1+ F (y ) dG0 ( )
i i c j i i i i i c i j i i i
Algorithms 1 to 3 above cannot easily be applied to models where G0 is not the conjugate prior for F , as the integrals in equations (9), (11), and (12) will usually not be analytically tractable. Sampling from H may also be hard when the prior is not conjugate. West, Muller, and Escobar (1994) suggest using either numerical quadrature or a Monte R Carlo approximation to evaluate the required integral. If F (y ) dG0 ( ) is approximated by an average over m values for drawn from G0 , one could also approximate
4 Existing methods for handling non-conjugate priors
a draw from H , if required, by drawing from among these m points with probabilities proportional to their likelihoods, given by F (y ). Though their paper is not explicit, it appears that West, Muller, and Escobar's non-conjugate example uses this approach with m = 1 (see MacEachern and Muller 1998). Unfortunately, this approach is potentially quite inaccurate. Often, H , the posterior based on y alone, will be considerably more concentrated than the prior, G0 , particularly when y is multidimensional. If a small to moderate number of points are drawn from G0 , it may be that none are typical of H . Consequently, the probability of chosing c to be a new component can be much lower than it would be if the exact probabilities of equation (11) were used. The consequence of this is not just slower convergence, since on the rare occasions when c is in fact set to a new component, with an appropriate typical of H , this new component is likely to be discarded in the very next Gibbs sampling iteration, leading to the wrong stationary distribution. This problem shows that the usual Gibbs sampling procedure of forgetting the current value of a variable before sampling from its conditional distribution will have to be modi ed in any valid scheme that uses values for drawn from G0 . MacEachern and Muller (1998) present a framework that does allow auxiliary values for drawn from G0 to be used to de ne a valid Markov chain sampler. I will explain their idea as an extension of Algorithm 2 of Section 3. There, the numerical values of the c were regarded as signi cant only in so far as they indicate which observations are associated with the same component. MacEachern and Muller consider more speci c schemes for assigning distributions to the c , which serve to map from a collection of values for to values for the . Many such schemes will produce the same distribution for the , but lead to di erent sampling algorithms. The \no gaps" algorithm of MacEachern and Muller arises when the c for i = 1 : : : n are required to cover the set of integers from 1 to k, with k being the number of distinct c , but are not otherwise constrained. By considering Gibbs sampling in this representation, they derive the following algorithm: Algorithm 4: Let the state of the Markov chain consist of c1 : : : c and = ( : c 2 fc1 : : : c g). Repeatedly sample as follows: For i = 1 : : : n: Let k; be the number of distinct c for j 6= i, and let these c have values in f1 : : : k; g. If c 6= c for all j 6= i, then with probability k; = (k; + 1) do nothing, leaving c unchanged. Otherwise, label c as k; + 1 if c 6= c for all j 6= i, or draw a value for ;+1 from G0 if c = c for some j 6= i. Then draw a new value for c from f1 : : : k; + 1g using the following probabilities: 8 < b n; F (y ) if 1 c k; P (c = c j c; y 1 : : : ; +1 ) = : b =(k; + 1)] F (y ) if c = k; + 1
i i i i i i i i i i i c i i i i n c n j j i j i i i j k i j i i c i c i i i k i c
where b is the appropriate normalizing constant. Change the state to contain only those that are now associated with an observation. For all c 2 fc1 : : : c g: Draw a new value from j y s.t. c = c, or perform some other update to that leaves this distribution invariant.
c n c i i c
This algorithm can be applied to any model for which we can sample from G0 and compute F (y ), regardless of whether G0 is the conjugate prior for F . However, there is a puzzling ine ciency in the algorithm's mechanism for setting c to a value di erent from all other c | ie, for assigning an observation to a newly-created mixture component. The probability of such a change is reduced from what one might expect by a factor of k;+ 1, with a corresponding reduction in the probability of the opposite change. As will be seen in Section 6, a similar algorithm without this ine ciency is possible. MacEachern and Muller have also developed an algorithm based on a \complete" scheme for mapping from the to the . It requires maintaining n values for , which may be ine cient when k n. The approach that will be presented in Section 6 allows more control over the number of auxiliary parameter values used. Another approach to handling non-conjugate priors has recently been devised by Walker and Damien (1998). Their method avoids the integrals needed for Gibbs sampling, but requires instead that the probability under G0 of the set of all for which F (y ) > u be computable, and that one be able to sample from G0 restricted to this set. Although these operations are feasible for some models, they will in general be quite di cult, especially when is multidimensional. Finally, Green and Richardson (1998) have developed a Markov chain sampling method based on splitting and merging components that is applicable to non-conjugate models. Their method is considerably more complex than the others discussed in this paper, since it attempts to solve the more di cult problem of obtaining good performance in situations where the other methods tend to become trapped in local modes that aren't easily escaped with incremental changes. Discussion of this issue is beyond the scope of this paper.
i i j c i i
5 Metropolis-Hastings updates and partial Gibbs sampling
Perhaps the simplest way of handling non-conjugate priors is by using the MetropolisHastings algorithm (Hastings 1970) to update the c , using the conditional prior as the proposal distribution. Recall that the Metropolis-Hastings algorithm for sampling from a distribution with density (x) using proposals with density g(x jx) updates the state x as follows: Draw a candidate state, x , according to the density g(x jx). Compute the acceptance probability g(x ) a(x x) = min 1 g(xjxx) ((x )) (13) j x With probability a(x x), set the new state, x0 , to x . Otherwise, let x0 be the same as x. This update from x to x0 leaves invariant. When x is multidimensional, proposal distributions that change only one component of x are often used. Updates based on several such proposals, along with updates of other types, can be combined in order to construct an ergodic Markov chain that will converge to . 8
This approach can be applied to model (3) for nite K , with the p integrated out, using Metropolis-Hastings updates for each c in turn, along with Gibbs sampling or other updates for the . When updating just c , we can ignore those factors in the posterior distribution that do not involve c . What remains is the product of the likelihood for observation i, F (y ), and the conditional prior for c given the other c , which is
c i c i i i ci i j
; (14) P (c = c j c; ) = nn ; + +=K 1 where, as before, n; is the number of c for j 6= i that are equal to c. This can be obtained from equation (4) by imagining that i is the last observation. If we now choose to use this conditional prior for c as the proposal distribution, we nd that this factor
i c i i i c j
cancels when computing the acceptance probability of equation (13), leaving
a(c c ) = min 1 F (y F (y
This approach continues to work as we let K ! 1 in order to produce an algorithm for a Dirichlet process mixture model. The conditional prior for c becomes
If we use this as the proposal distribution for an update to c , we will need to draw an associated value for from G0 if the candidate, c , is not in fc1 : : : c g. Note that if the current c is not equal to any other c , the probability of choosing c to be the same as c is zero | ie, when c is chosen to be di erent from the other c it will always be a new component, not the current c , even when that also di ers from the other c . (The method would be valid even if a new component were not created in this situation, but this is the behaviour obtained by taking the K ! 1 limit of the algorithm for nite K .) We might wish to perform more than one such Metropolis-Hastings update for each of the c . With this elaboration, the algorithm can be summarized as follows:
i i n i j i i i j i j i
n If c = c for some j 6= i: P (c = c j c; ) = n ; ; + 1 P (c 6= c for all j 6= i j c; ) = n ; 1 +
i c j i i i j i
Algorithm 5: Let the state of the Markov chain consist of c1 : : : c and = ( : c 2 fc1 : : : c g). Repeatedly sample as follows:
n c n i i i i n ci
For i = 1 : : : n, repeat the following update of c R times: Draw a candidate, c , from the conditional prior for c given by equation (16). If a c not in fc1 : : : c g is proposed, chose a value for from G0. Compute the acceptance probability, a(c c ), as in equation (15), and set the new value of c to c with this probability. Otherwise let the new value of c be the same as the old value. For all c 2 fc1 : : : c g: Draw a new value from j y s.t. c = c, or perform some other update to that leaves this distribution invariant.
i i i i i n c i i c i
If R is greater than one, it is possible to save computation time by reusing values of F that were previously computed. An evaluation of F can also be omitted when c turns 9
out to be the same as c . The number of evaluations of F required to update one c is thus no more than R +1. For comparison, the number of evaluations of F needed for Gibbs sampling and the \no gaps" algorithm is approximately equal to one plus the number of distinct c for j 6= i. If the updates for the in the last step of Algorithm 5 are omitted, the result is equivalent to the following: Algorithm 6: Let the state of the Markov chain consist of 1 : : : . Repeatedly sample as follows: For i = 1 : : : n, repeat the following update of R times: Draw a candidate, , from the following distribution: 1 X ( ) + n ; 1+ 6= n ; 1+ G0
i i j c n i i j j i
Compute the acceptance probability
) = min 1 F (y
) = F (y
Set the new value of to with this probability otherwise let the new value of be the same as the old value. This might have been justi ed directly as a Metropolis-Hastings algorithm, but the fact that the proposal distribution for is not continuous introduces conceptual, or at least notational, di culties. Note that this algorithm su ers from the same problem of not being able to change several simultaneously as was discussed for Algorithm 1. The behaviour of the Metropolis-Hastings methods (Algorithms 5 and 6) di ers substantially from that of the corresponding Gibbs sampling methods (Algorithms 2 and 1) and the \no gaps" method (Algorithm 4). These other methods consider all mixture components when deciding on a new value for c , whereas the Metropolis-Hastings method is more likely to consider changing c to a component associated with many observations than to a component associated with few observations. Also, the probability that the Metropolis-Hastings method will consider changing c to a newly created component is proportional to . (Of course, the probability of actually making such a change depends on for all methods here the issue is whether such a change is even considered.) It is di cult to say which behaviour is better. Algorithm 5 does appear to perform adequately in practice, but since small values of (around one) are often used, one might wonder whether an algorithm that could consider the creation of a new component more often might be more e cient. We can produce such an algorithm by modifying the proposal distribution for updates to the c . In particular, whenever c = c for some j 6= i, we can propose changing c to a newly created component, with associated drawn from G0 . In order to allow the reverse change, the proposal distribution for \singleton" c that are not equal to any c with j 6= i will be con ned to those components that are associated with other observations, with probabilities proportional to n; . Note that when the current c is not a singleton, the probability of proposing a new component is a factor of (n ; 1+ ) = greater than the
i i i i i i i i j i i j i c i
conditional prior, while when c is a singleton, the probability of proposing any existing component is a factor of (n ; 1 + ) = (n ; 1) greater than its conditional prior. The probability of accepting a proposal must be adjusted by the ratio of these factors. On their own, these updates are su cient to produce a Markov chain that is ergodic, but such a chain would often sample ine ciently, since it can change an observation from one existing component to another only by passing though a possibly unlikely state in which that observation is a singleton. Such changes can be made more likely by combining these Metropolis-Hastings updates with partial Gibbs sampling updates, which are applied only to those observations that are not singletons, and which are allowed to change c for such an observation only to a component associated with some other observation. In other words, these updates perform Gibbs sampling for the posterior distribution conditional on the set of components that are associated with observations not changing. No di cult integrations are required for this partial Gibbs sampling operation. Combining the modi ed Metropolis-Hasting updates, the partial Gibbs sampling updates, and the usual updates to for c 2 fc1 : : : c g produces the following algorithm: Algorithm 7: Let the state of the Markov chain consist of c1 : : : c and = ( : c 2 fc1 : : : c g). Repeatedly sample as follows: For i = 1 : : : n, update c as follows: If c is a not a singleton (ie, c = c for some j 6= i), let c be a newly-created component, with drawn from G0 . Set the new c to this c with probability # " F (y ) a(c c ) = min 1 n ; 1 F (y ) Otherwise, when c is a singleton, draw c from c; , choosing c = c with probability n; = (n;1). Set the new c to this c with probability # " n ; 1 F (y ) a(c c ) = min 1 F (y ) If the new c is not set to c , it is the same as the old c . For i = 1 : : : n: If c is a singleton (ie, c 6= c for all j 6= i), do nothing. Otherwise, choose a new value for c from fc1 : : : c g using the following probabilities: P (c = c j c; y c 2 fc1 : : : c g) = b n; 1 F (y ) n; where b is the appropriate normalizing constant. For all c 2 fc1 : : : c g: Draw a new value from j y s.t. c = c, or perform some other update to that leaves this distribution invariant.
i i c n n c n i i i j i ci i i i i i ci i ci i i i i i c i i i i i ci i ci i i i i i j i n i c i i i i n i c n c i i c
6 Gibbs sampling with auxiliary parameters
In this section, I show how models with non-conjugate priors can be handled by applying Gibbs sampling to a state that has been extended by the addition of auxiliary parameters. This approach is similar to that of MacEachern and Muller (1998), but di ers in that the auxiliary parameters are regarded as existing only temporarily this allows more exibility in constructing algorithms. 11
φc ci θi
Figure 1: Representing the conditional prior distribution for a new observation using auxiliary parameters. The component for the new observation is chosen from among the four components associated with other observations plus three possible new components, with parameters, 5 6 7 , drawn independently from G0 . The probabilities used for this choice are shown at the top. The dashed arrows illustrate the possibilities of choosing an existing component, or a new component that uses one of the auxiliary parameters. The basic idea of auxiliary variable methods is that we can sample from a distribution for x by sampling from some distribution for (x y), with respect to which the marginal distribution of x is . We can extend this idea to accommodate auxiliary variables that are created and discarded during the Markov chain simulation. The permanent state of the Markov chain will be x, but a variable y will be introduced temporarily during an update of the following form: 1) Draw a value for y from its conditional distribution given x, as de ned by . 2) Perform some update of (x y) that leaves invariant. 3) Discard y, leaving only the value of x. It is easy to see that this update for x will leave invariant as long as is the marginal distribution of x under . We can combine several such updates, which may involve di erent auxiliary variables, along with other updates that leave invariant, to construct a Markov chain that will converge to . We can use this technique to update the c for a Dirichlet process mixture model without having to integrate with respect G0 . The permanent state of the Markov chain will consist of the c and the , as in Algorithm 2, but when c is updated, we will introduce temporary auxiliary variables that represent possible values for the parameters of components that are not associated with any other observations. We then update c by Gibbs sampling with respect to the distribution that includes these auxiliary parameters. Since the observations y are exchangeable, and the component labels c are arbitrary, we can assume that we are updating c for the last observation, and that the c for other observations have values in the set f1 : : : k; g, where k; is the number of distinct c for j 6= i. We can now visualize the conditional prior distribution for c given the other c in terms of m auxiliary components and their associated parameters. The probability of c being equal to a c in f1 : : : k; g will be n; =(n ; 1+ ), where n; is the number of times c occurs among the c for j 6= i. The probability of c having some other value will be =(n;1+ ), which we will split equally among the m auxiliary components we have introduced. Figure 1 illustrates this setup for m = 3.
x xy x xy xy x x xy x x i i c i i i i i j j i j i i c i c j i
This representation of the prior gives rise to a corresponding representation of the posterior, which also includes these auxiliary parameters. The rst step in using this representation to update c is to sample from the conditional distribution of these auxiliary parameters given the current value of c and the rest of the state. If c = c for some j 6= i, the auxiliary parameters have no connection with the rest of the state, or the observations, and are simply drawn independently from G0 . If c 6= c for all j 6= i (ie, c is a singleton), then it must be associated with one of the m auxiliary parameters. Technically, we should select which auxiliary parameter it is associated with randomly, but since it turns out to make no di erence, we can just let c be the rst of these auxiliary components. The corresponding value for must of course be equal to the existing . The values for the other auxiliary components (if any, there are none if m = 1) are again drawn independently from G0 . We now perform a Gibbs sampling update for c in this representation of the posterior distribution. Since c must be either one of the components associated with other observations or one of the auxiliary components that were introduced, we can easily do Gibbs sampling by evaluating the relative probabilities of these possibilities. Once a new value for c has been chosen, we discard all values that are not now associated with an observation. This algorithm can be summarized as follows: Algorithm 8: Let the state of the Markov chain consist of c1 : : : c and = ( : c 2 fc1 : : : c g). Repeatedly sample as follows: For i = 1 : : : n: Let k; be the number of distinct c for j 6= i, and let h = k; + m. Label these c with values in f1 : : : k; g. If c = c for some j 6= i, draw values independently from G0 for those for which k; < c h. If c 6= c for all j 6= i, let c have the label k; + 1, and draw values independently from G0 for those for which k; + 1 < c h. Draw a new value for c from f1 : : : hg using the following probabilities:
i i i j i j i i ci i i i n c n j j i j c i j i c
the appropriate normalizing constant. Change the state to contain only those that are now associated with one or more observations. For all c 2 fc1 : : : c g: Draw a new value from j y s.t. c = c, or perform some other update to that leaves this distribution invariant. Note that the relabellings of the c above are conceptual they may or may not require any actual computation, depending on the data structures used. When m = 1, Algorithm 8 closely resembles Algorithm 4, the \no gaps" algorithm of MacEachern and Muller (1998). The di erence is that the probability of changing c from a component shared with other observations to a new singleton component is approximately k;+1 times greater with Algorithm 8, and the same is true for the reverse
c n c i i c j i
8 n; > b n ; 1+ F (y ) for 1 c k; < P (c = c j c; y 1 : : : ) = > : b =m F (y ) for k; < c h n ; 1+ where n; is the number of c for j = i that are equal to c, and b is 6
i c i c i i i h i c i c j
change. When is small, this seems to be a clear bene t, since the probabilities for other changes are a ected only slightly. In the other extreme, as m ! 1, Algorithm 8 approaches the behaviour of Algorithm 2, since the m (or m;1) values for drawn from G0 e ectively produce a Monte Carlo approximation to the integral computed in Algorithm 2. However, the equilibrium distribution of the Markov chain de ned by Algorithm 8 is exactly correct for any value of m, unlike the situation when a Monte Carlo approximation is used to implement Algorithm 2.
For many problems, it is necessary to extend the model to incorporate uncertainty regarding the value of or regarding the values of other hyperparameters that determine F and G0. These hyperparameters can be included in the Markov chain simulation, as is brie y discussed here. The conditional distribution of given the other parameters depends only on k, the number of distinct c . It can be updated by some Metropolis-Hastings method, or by methods discussed by Escobar and West (1995). If F depends on hyperparameters , the conditional density for given the current Q will be proportional to its prior density times the likelihood, =1 F (y ). If G0 depends on hyperparameters , the conditional density for given the current c and Q will be proportional to its prior density times G0 ( ), where the product is over values of c that occur in fc1 : : : c g. Note that each such c occurs only once in this product, even if it is associated with more than one observation. The di culty of performing Gibbs sampling or other updates for and will depend on the detailed forms of these conditional distributions, but no issues special to Dirichlet process mixture models are involved. One subtlety does arise when algorithms employing auxiliary parameters are used. If values not associated with any observation are retained in the state, the conditional distribution for given the rest of the state will include factors of G0 ( ) for these as well as for the values associated with observations. Since this will tend to slow convergence, it is desirable to discard all unused values, regenerating them from G0 as needed, as is done for the algorithms in this paper.
i i n i i i i c c c n
7 Updates for hyperparameters
8 A demonstration
I tested the performance of Algorithms 4 through 8 on the following data (y1 : : : y9 ):
;1:48 ;1:40 ;1:16 ;1:08 ;1:02 +0:14 +0:51 +0:53 +0:78
A Dirichlet process mixture model was used with the component distributions having the form F ( ) = N ( 0:12 ), the prior being G0 = N (0 1), and the Dirichlet process concentration parameter being = 1. Although G0 is in fact conjugate to F , the algorithms for non-conjugate priors were used. 14
Alg. Alg. Alg. Alg. Alg. Alg. Alg.
Time per iteration Autocorrelation Autocorrelation in microseconds time for k time for 1 4 (\no gaps") 7.6 13.7 8.5 5 (Metropolis-Hastings, R = 4) 8.6 8.1 10.2 6 (M-H, R = 4, no update) 8.3 19.4 64.1 7 (mod M-H & partial Gibbs) 8.0 6.9 5.3 8 (auxiliary Gibbs, m = 1) 7.9 5.2 5.6 8 (auxiliary Gibbs, m = 2) 8.8 3.7 4.7 8 (m = 30, approximates Alg. 2) 38.0 2.0 2.8
Table 1: Performance of the algorithms tested. A state from close to the posterior distribution was found by applying 100 iterations of Algorithm 5 with R = 5. This state was then used to initialize the Markov chain for each of the algorithms, which were all run for 20000 subsequent iterations (one iteration being one application of the operations in the descriptions given earlier). The performance of each algorithm was judged by the computation time per iteration and by the \autocorrelation time" for two quantities: k, the number of distinct c , and 1 , the parameter associated with y1 . The autocorrelation time for a quantity, de ned as one plus twice the sum of the autocorrelations at lags one and up, is the factor by which the sample size is e ectively reduced when estimating the expectation of that quantity, as compared to an estimate based on points drawn independently from the posterior distribution (see Ripley 1987, Section 6.3). It was estimated using autocorrelation estimates from the 20000 iterations. The Metropolis-Hastings methods (Algorithms 5 and 6) were run with R, the number of updates for each c , set to 4. This makes the computation time per iteration approximately equal to that for the other methods tested. Gibbs sampling with auxiliary parameters (Algorithm 8) was tested with m = 2 and m = 3. It was also run with m = 30, even though this is clearly too large, because with a large value of m, this algorithm approximates the behaviour of Algorithm 2 (apart, of course, from computation time). This lets us see how much the autocorrelation times for the algorithms are increased over what is possible when the prior is conjugate. The results are shown in Table 1. They con rm that Algorithm 8 with m = 1 is superior to the \no gaps" method. Setting m = 2 decreases autocorrelation times further, more than o setting the slight increase in computation time per iteration. The simple Metropolis-Hastings method (Algorithm 5) performs about as well as the \no gaps" method. The combination of Metropolis-Hastings and partial Gibbs sampling of Algorithm 6 performs about as well as Algorithm 8 with m = 1. As expected, performance is much worse when updates for the are omitted, as in Algorithm 6. The results for Algorithm 8 with m = 30 show that there is a cost to using algorithms that do not rely on the prior being conjugate, but this cost is small enough to be tolerable when a non-conjugate prior is a more realistic expression of prior beliefs. Of course, the relative performance of the methods may be di erent from what is seen here when they are applied to more realistic problems, in which the number of observations is often
i i c
greater, and updates are also done for various hyperparameters. The methods tested here are implemented in my software for exible Bayesian modeling (version of 199809-01), available from my web page, which can be used to experiment with some more complex models.
This research was supported by the Natural Sciences and Engineering Research Council of Canada.
Anderson, J. R. (1990) The Adaptive Character of Thought, Hillsdale, NJ: Erlbaum. Antoniak, C. E. (1974) \Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems", Annals of Statistics, vol. 2, pp. 1152-1174. Blackwell, D. and MacQueen, J. B. (1973) \Ferguson distributions via Polya urn schemes", Annals of Statistics, vol. 1, pp. 353-355. Bush, C. A. and MacEachern, S. N. (1996) \A semiparametric Bayesian model for randomised block designs", Biometrika, vol. 83, pp. 275-285. Escobar, M. D. (1994) \Estimating normal means with a Dirichlet process prior", Journal of the American Statistical Association, vol. 89, pp. 268-277. Escobar, M. D. and West, M. (1995) \Bayesian density estimation and inference using mixtures", Journal of the American Statistical Association, vol. 90, pp.577-588. Ferguson, T. S. (1973) \A Bayesian analysis of some nonparametric problems", Annals of Statistics, vol. 1, pp. 209-230. Ferguson, T. S. (1983) \Bayesian density estimation by mixtures of normal distributions", in H. Rizvi and J. Rustagi (editors) Recent Advances in Statistics, pp. 287-303, New York: Academic Press. Green, P. J. and Richardson, S. (1998) \Modelling heterogeneity with and without the Dirichlet process", draft manuscript. Hastings, W. K. (1970) \Monte Carlo sampling methods using Markov chains and their applications", Biometrika, vol. 57, pp. 97-109. MacEachern, S. N. (1994) \Estimating normal means with a conjugate style Dirichlet process prior", Communications in Statistics: Simulation and Computation, vol. 23, pp. 727-741. MacEachern, S. N. and Muller, P. (1998) \Estimating mixture of Dirichlet process models", Journal of Computational and Graphical Statistics, vol. 7, pp. 223-238. Neal, R. M. (1992) \Bayesian mixture modeling", in C. R. Smith, G. J. Erickson, and P. O. Neudorfer (editors) Maximum Entropy and Bayesian Methods: Proceedings of 16
the 11th International Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis, Seattle, 1991, p. 197-211, Dordrecht: Kluwer Academic Publishers. Ripley, B. D. (1987) Stochastic Simulation, New York: Wiley. Walker, S. and Damien, P. (1998) \Sampling methods for Bayesian nonparametric inference involving stochastic processes", draft manuscript. West, M., Muller, P., and Escobar, M. D. (1994) \Hierarchical priors and mixture models, with application in regression and density estimation", in P. R. Freeman and A. F. M. Smith (editors) Aspects of Uncertainty, pp. 363-386, John Wiley.