Bayesian spatial scan statistic adjusted for overdispersion and spatial correlation.
Deepak Agrawal, (Yahoo! Research), email@example.com
Spatial scan statistic has become the method of choice for detecting spatial clustering after adjusting for inhomogeneity. The method is particularly suitable in applications where the goal is to find the actual location of spatial clusters or ``hotspots'' as opposed to testing for global clustering. The method has been extremely successful and has found applications in diverse areas ranging from biosurveillance, forestry, criminology, psychology etc. The method proceeds by scanning the study region using all possible spatial sub-regions that conform to some geometric shape (e.g., circle, rectangle, ellipsoid, etc). Each sub-region is assigned a discrepancy measure which is based on a likelihood ratio test that compares the intensity inside the sub-region with the intensity outside. The sub-region with the maximum discrepancy is generally declared to be a ``hotspot'' provided it is statistically significant. The significance test is based on an expensive randomization procedure which computes a Monte Carlo p-value by repeatedly (approximately 10K times) generating realizations under the null hypothesis of no spatial clustering.
In this paper, we propose a Bayesian solution to the problem. A Bayesian solution has several advantages in this scenario. First, hotspot detection is based on posterior probabilities of models corresponding to each sub-region and hence there is no need to conduct the randomization procedure. This gain in computational efficiency is obtained by performing a slightly more expensive discrepancy calculation for each sub-region wherein a simple and closed form likelihood maximization is substituted by a numerical integration routine. Second, compared to the classical approach where multiple hotspots are generally detected using a conservative test, detecting multiple hotspots in the Bayesian framework is automatic and does not require any additional machinery. Finally, the Bayesian setting also provides a natural framework to incorporate any prior knowledge that might be known about the hotspots. To the best of our knowledge, no rigorous work in a Bayesian framework exists in the statistics literature. Recently, a Bayesian solution to the problem was proposed by (Neil et al., NIPS 2005) in the machine learning literature. However, their solution made strong assumptions on the priors of sub-regions. Moreover, it is not possible to adjust for additional characteristics like overdispersion and spatial correlation using their framework. Such adjustments are potentially useful in the context of biosurveillance where the analyst might not be interested in investigating clusters that are caused only due to presence of routine overdispersion relative to the usual Poisson or Bernoulli model. Adjusting for such routine characteristics in the baseline model can potentially reduce false positives and enhance disease monitoring systems used in public health.
Our contributions are in two directions. First, we propose a modeling framework using a point process formulation. We propose the use of a Cox process to enhance the usual assumptions of a Poisson process. The Cox process assumes that conditional on a latent error process, data comes from a Poisson process. Marginalizing over the latent process enables adjusting for features like overdispersion that might be present in the data. For instance, assuming a gamma distribution gives rise to the usual negative binomial distribution that has been widely used to model overdispersion in count data. Other possibilities include a Conditionally Autoregressive Process (CAR) that is widely used to model spatial correlation in epidemiology. Next, we provide a Bayesian solution to the problem in our proposed framework. Our solution does not depend on eliciting data based priors for each sub-region as in (Neil et al., NIPS 2005). In fact, the main computational bottleneck in the Bayesian procedure is the computation of a Bayes factor for each sub-region. For the usual Poisson model proposed by (Kulldorf, 1997), this boils down to computing a 2-dimensional integral for each sub-region which is done efficiently and accurately using a Laplace approximation. For a negative binomial model, the same strategy works with the 2-d integral being replaced by a 3-d integral. For models like CAR that are multivariate in nature, one needs to compute a high dimensional integral. The Laplace approximation does not provide accurate answers in this scenario and one needs to take recourse to computationally intensive procedures like MCMC. However, the computations are amenable to parallel computing and could be performed efficiently in a cluster computing environment for reasonably sized datasets. We illustrate the efficacy of our procedure on datasets that have been previously analyzed in the literature.