Use of Machine Learning for Automated Survey Coding
Frederic R. Clarke, Steven Brooker
Australian Bureau of Statistics, Belconnen, ACT, Australia

Survey coding involves assigning a symbolic code from a predefined set to the response provided to an open-ended survey question. The code set is derived from a classification (such as the ANZSCO standard for occupation) that typically contains a structured hierarchy of mutually exclusive concepts and associated code values. A natural language response to a survey question is associated with a code value based on a match of key words or phrases in the concept description. When this process is done by a computer with no direct human involvement, it is referred to as automated coding (or autocoding).

The Australian Bureau of Statistics (ABS) is developing an approach to the autocoding problem where it is treated as an instance of multiclass text classification using supervised machine learning techniques. This is intended to overcome the limitations of existing ABS autocoders, wherein digitised textual responses are literally compared with candidate key phrases in a large static index. In the machine learning approach, a software learning system constructs a model of the association between survey responses and codes using a training set of precoded examples. Each response is considered a text document, and each code a document category to which these documents can be assigned. The learned model is then a classification function that can be applied to the coding of new responses.

This paper provides a background to statistical machine learning for text classification, and outlines the work done on survey coding in the ABS with a particular class of machine learning algorithms - support vector machines (SVM). Different techniques for text preprocessing, vector weights construction, and multilevel category representation and scoring are evaluated against coder performance and accuracy. The technical challenges associated with using SVM coders for large surveys such as the Australian Census of Population are discussed.

Keywords: Machine learning; Support vector machine; Automated coding; Survey processing

Biography: Ric Clarke is the Director for Enterprise Architecture in the Australian Bureau of Statistics. A part of his role involves the investigation of strategic technologies for future ABS capabilities. Ric joined the ABS in 2008 after a 17-year career with the Department of Defence in Canberra that spanned mathematical and technology research, communications analysis, software development, strategic planning and project management. Originally a physicist, he also has post-graduate qualifications in information technology.