In classical data analysis, each individual takes one “value” on each descriptive variable. Symbolic Data Analysis generalizes this framework by allowing each individual to take a finite set, an interval or a distribution on each variable. Interval-valued variables are a particular case of histogram-valued variables if you consider in histograms only one interval with probability one.
Concepts and methods of classical statistics have been adapted to these types of variables; methods for histogram-valued variables are frequently a generalization of their counterparts for interval-valued variables.
The first linear regression model for histogram-valued variables was a generalization of the first model proposed for interval-valued variables by Billard et al (2007); for these latter, other models have been proposed by Lima Neto et al (2010). However, those models present some limitations: firstly, they are based on differences between real values and do not appropriately quantify the closeness between intervals; then, the elements estimated by the models may fail to build an interval; the most recent model imposes positivity constraints on the coefficients, therefore forcing a positive linear relationship. These limitations prevent a generalization of these models to histogram-valued variables; therefore alternative models should be developed. Our goal is to propose a linear regression model for histogram–valued variables allowing estimating histograms from other histograms, without forcing a positive linear relationship.
Using the representation of histograms as inverse cumulative distribution functions or quantile functions (Irpino, Verde (2006)), a linear regression model for histogram-valued variables is proposed which includes both the quantile functions of the histogram-valued variables and the respective quantile functions of the symmetric histogram-valued variables. Estimation of the model requires solving a quadratic optimization problem, subject to non-negativity constraints on the unknowns; the error measure uses the Mallows distance. As in classical analysis, the model is associated with a goodness-of-fit measure.
Using the new linear regression model, simulation results for the particular case of the interval-valued variables are presented and discussed.
Billard, L., Diday, E. (2007): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.
Bock, H.-H., Diday, E. (2000): Analysis of Symbolic Data. Springer, Berlin-Heidelberg.
Irpino, A. Verde, R. (2006): A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In: Classification, Data Science and Classification, Proc. Sixth Conference of the International Federation of Classification Societies, Springer, Berlin, 185-192.
Lima Neto, E.A., de Carvalho, F.A.T. (2010): Constrained linear regression models for symbolic interval-valued. Computational Statistics and Data Analysis, 54, 333-347.
Keywords: Symbolic Data Analysis; Linear Regression; Histogram-valued variables; imprecise data
Biography: Sόnia Dias holds a BsC (2000) and a MsC (2005) in Mathematics by the University of Minho, Portugal. She is currently in the 2nd year of her PhD in Applied Mathematics at the University of Porto, Portugal. Her PhD goals are to develop novel models and data analysis methods for histogram-valued variables. This comes within the framework of Symbolic Data Analysis, a recent discipline that is experiencing a quick developement, aiming at extending classical data analysis to more complex data.
Sόnia Dias has been a lecturer at the Instituto Politécnico de Viana do Castelo for the last 10 years, teaching Engineering courses in mathematics.