A Simple, Versatile, Data-adaptive Approach for Alerting Based on Temporal Biosurveillance Data
Howard S. Burkom, (The Johns Hopkins University Applied Physics Laboratory), Howard.Burkom@jhuapl.edu, and
Sean Patrick Murphy, (The Johns Hopkins University Applied Physics Laboratory), Sean.Murphy@jhuapl.edu
This effort describes a simple yet versatile method for automated data classification that addresses the problem of selecting appropriate alerting algorithms for biosurveillance data based on limited data history. This method is applicable to the univariate time series that result from syndromic classification of clinical records and also from nonclinical data such as filtered counts of over-the-counter remedy sales. Intended beneficiaries are local public health monitors using their own data streams as well as large system developers managing many disparate data types.
Numerous, recent papers have presented and evaluated algorithms for biosurveillance-related anomaly detection. However, authors of these papers are rarely able to share their datasets and can often publish only limited information describing them, so these papers do little to help a health monitor decide whether a published method will work well on the data at hand. Accentuating this problem, health monitors at 2005-2006 conferences and workshops related to automated health surveillance repeatedly expressed the need for modifiable case definitions and syndromic filters, thus obtaining time series whose behavior cannot be modeled in advance. Impromptu case definition changes may lead to changes in the scale and cyclic or seasonal series behavior. We demonstrate that mismatched algorithms and data can result in significant, systematic loss of sensitivity at practical false alarm rates. Therefore, the automated selection of suitable alerting methods is necessary.
Published alerting methods intended to control for trends and other systematic data behavior have used various regression-based models and other approaches such as wavelets and LMS filters [1-3]. The success of these methods is related to the presence of the day-of-week effects, annual trends, or other features that they are designed to model. A common approach to removing such expected features is the Phase I/Phase II paradigm of the statistical process control community. Control chart parameters, model features such as regression or filter coefficients, and sometimes alerting thresholds are calculated from a set of historic baseline data assumed to be representative of the data to be monitored. The inferred quantities are then applied for prospective surveillance. Because many time series monitored for biosurveillance are nonstationary, baseline-inferred parameters and thresholds may produce unexpected and uneven detection performance. Autoregressive methods and adaptive regression models and filters have been applied, as in [4, 5], to address this obstacle. Our approach utilizes prediction by generalized exponential smoothing, which we implement as a form of Holt-Winters (H-W) forecasting . A comparison of this approach to nonadaptive and adaptive regression models yielded favorable results in  on multiple time series of two common types. The current effort discusses the automated selection of smoothing coefficients for H-W forecasting and the application of the H-W residuals in control charts. Smoothing coefficients yielding reliable daily forecasts may be obtained from a fairly small (as little as 2 months) representative data sample. We compute simple discriminants based on the scale, variability, overall trending, and day-of-week effects in the sample data to select from limited combinations of smoothing coefficients. The selection process may be wholly or partially overridden by user specifications based on knowledge of similar data.
The principal advantages of H-W forecasting are its capability to adapt to short-term trends without substantial model-fitting and its stability relative to the selected smoothing coefficients. Additionally, this approach can adapt to short-term trends without complex model-fitting, does not involve convergence problems, can be implemented in a spreadsheet, and can handle both rich and sparse data streams. Our particular H-W adaptations have been to account for day-of-week and holiday effects, avoid numerical problems resulting from ongoing or temporary (due to data dropouts) sparseness, and avoid inappropriate training based on unexpected outliers. We evaluate the derived control charts and compare them to other alerting algorithms using receiver operating characteristic (ROC) curves based on realistic outbreak signals added to authentic data.
(1) Brillman JC, Burr T, Forslund D, Joyce E, Picard R and Umland E. Modeling emergency department visit patterns for infectious disease complaints: results and application to disease surveillance, BMC Medical Informatics and Decision Making 2005, 5:4, pp 1-14 http://www.biomedcentral.com/content/pdf/1472-6947-5-4.pdf
(2) Goldenberg A, Shmueli G, et al, Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales, Proc. Natl. Acad. Sci. USA, Vol. 99, Issue 8, 5237-5240, April 16, 2002
(3) Najmi AH, Magruder SF. An adaptive prediction and detection algorithm for multistream syndromic surveillance. BMC Med Inform Decis Mak. 2005 Oct 12; 5:33.
(4) Reis BY, Mandl KD, Time series modeling for syndromic surveillance (2003). BMC Medical Informatics and Decision Making 2003, 3:2
(5) Burkom, H.S., Development, Adaptation, and Assessment of Alerting Algorithms for Biosurveillance, Johns Hopkins APL Technical Digest 24, 4: 335-342.
(6) Chatfield C. The Holt-Winters Forecasting Procedure. Applied Statistics 1978; 27: 264-279
(7) Burkom, H., Murphy, S.P., and Shmueli G, Automated Time Series Forecasting for Biosurveillance, accepted for 2007 publication in Statistics in Medicine.