Avoiding Overfit by Restricted Model Search in Tree-Based EEG Classification
Jan Klaschka
Dept. of Nonlinear Modelling, Inst. of Comp. Science, Academy of Sciences, Prague, Czech Republic

This work follows up previous studies [2, 3], accomplished in the framework of a broader project aimed at prevention of drivers' microsleeps and the traffic accidents resulting from them. In the papers cited, electroencephalography (EEG) frequency spectra of a group of experimental subjects (persons) were analyzed in order to find accurate enough classifiers discriminating somnolence (sleepiness) from other brain states. The classifiers considered were complex models whose building blocks were classification forests constructed by the Random Forests method (see [1]). Since the EEG signals are highly individual, it is necessary to tailor a separate classifier for each subject. Studies [2, 3] have shown, however, that a model (classification forest) based exclusively on the subject's own data may be improved, when combined (through a weighted average of votes for classes) with a model derived from the data of a well selected subset S of other subjects. In [3], different strategies of the search for a proper set S were experimentally compared and a “winning” strategy chosen.

The starting point of the present study was a strange and undesirable behavior of the best models from [3], observed when the size of the forests (number of trees) was varied (originally, the default of 500 trees per forest was used): Some of the models deteriorated (i.e., the jackknife estimates of the true misclassification errors increased) with the growing forest size. Such phenomena often result from an overfit due to a too extensive model search, but the tendency to overfit is not, as a rule, a property of the components of the models, i.e. of the forests ([1]). That has given rise to the hypothesis that combining bigger forest is more prone to overfit than that of the smaller ones. If so, the model search should be the more restricted (i.e. the number of candidate models kept smaller), the bigger forests are combined. The hypothesis is supported by a computational experiment whose results will be reported: After applying restrictions to the size of set S (and to the values of some numerical parameters of the models, too), the trend of model deterioration with the growing forest size vanished or, at least, was considerably attenuated.

References:

[1] Breiman, L. (2001) Random Forests. Machine Learning 45, 5-32.

[2] Klaschka, J. (2007) Combining Individual and Global Tree-Based Models in EEG Classification. Bulletin of the International Statistical Institute 56th Session. Proceedings, 1-4, International Statistical Institute, Lisboa.

[3] Klaschka, J. (2008) Classification of Heterogeneous EEG Data by Combining Random Forests. In: Mizuta, M., Nakano, J. (eds.), Proceedings of IASC 2008, 888-896, Japanese Society of Computational Statistics, Tokyo, ISBN 978-4-9904445-1-8.

Keywords: model search; electroencephalography; classification trees and forests; random forests

Biography: Jan Klaschka, born 1954, is a researcher affiliated with the Institute of Computer Science, Academy of Sciences of the Czech republic, and a part-time lecturer in biostatistics at the First Faculty of Medicine, Charles University in Prague. His research interests include biomedical statistics, computational statistics, and classification trees and forests. He is a member of the International Association for Statistical Computing, and a past member of the Board of Directors of its European Regional Section (elected for the period 2004 – 2008).