The rate and speed of modern data collection has meant that we are able to begin to explore a whole new field of questions. Instead of static predictions we can now ask whether or not data continues to bear out our predictions and if not, then in what way is it evolving?
The focus of this paper is the discovery of anomalies in multivariate time series data. In particular, our area of interest is in data sets that are highly multivariate and include a mixture of continuous, categorical and structured nominal variables. Furthermore, we are interested in systems that display complex in-control behaviour, for example strong seasonality, which may impact the effectiveness of other outbreak detection methods.
The method examined in this paper was initially proposed by Sparks and Okugami (Interstat 2010) and incorporates three major aspects:
1. The modelling of existing behaviour.
2. The use of EWMA (exponentially weighted moving averages) smoothing to both observed and expected counts in order to build in temporal memory.
3. Lastly, the growing and pruning of decision trees in order to find areas of high deviation from expected counts in the multivariate space. The pruning procedure is chosen to control the false alarm rate.
We will explore the advantages of this method and give a detailed description of its application to the surveillance of the volume of patients to a number of Emergency Departments. While predictive tools are already being implemented to assist in forecasting the volume of patients to Emergency Departments, early detection of any changes would help authorities to manage limited health resources and communicate effectively about risk, both in a timely fashion. The data in this example includes all variable types as listed above and presents a number of modelling challenges including strong seasonality, day of the week effects and many interacting variables.
For a recent and broad review of methods for disease outbreak detection see Unkel 2010. However most of the methods therein are univariate. For the benchmark work in spatio-temporal surveillance see Kuldorff's work (2001 and 2007) on scan statistics. However, Kuldorff's work is difficult to extend to higher dimensions. This paper offers a multivariate approach that controls the false alarm rate over all dimensions.
Kulldorff, M. et al. Multivariate spatial scan statistics for disease surveillance. Stat. Med., 2007;26:1824-1833.
Kulldorff, M. Prospective time-periodic geographical disease surveillance using a scan statistic. J. Roy. Statistical Society, 2001;A164:61-72
Sparks, R.S., Okugami, C. Surveillance trees: early detection of unusually high number of vehicle crashes, InterStat, January 2010, http://interstat.statjournals.net/YEAR/2010/abstracts/1001002.php
Unkel, S. et al. A review of statistical methods for the prospective detection of infectious disease outbreaks. Technical Report. 2010. http://stats-www.open.ac.uk
Keywords: Anomaly detection; Surveillance; Health informatics; Multivariate time series
Biography: Sarah Bolt is a research scientist at CSIRO in the Division of Mathematics, Informatics and Statistics. She graduated from the Australian National University in 2007 with a Bachelor of Philosophy (Honours) degree in Computational Mathematics and joined CSIRO in 2010 in the Graduate Fellowship program. Sarah is interested in the statistical analysis and computational challenges of large data sets and is currently involved in the area of spatio-temporal modelling.