Dimensionality reduction by clustering of variables while setting aside atypical variables


Clustering of variables is one possible approach for reducing the dimensionality of a dataset. However, all the variables are usually assigned to one of the clusters, even the scattered variables associated with atypical or noise information. The presence of this type of information could obscure the interpretation of the latent variables associated with the clusters, or even give rise to artificial clusters. We propose two strategies to address this problem. The first is a "K +1" strategy, which consists of introducing an additional group of variables,  called the "noise cluster" for simplicity. The second is based on the definition of sparse latent variables. Both strategies result in refined clusters for the identification of more relevant latent variables.

DOI Code: 10.1285/i20705948v9n1p134

Keywords: dimensionality reduction; clustering of variables; noise cluster; sparse latent variables


Bailly, N., Maitre, I., Amand, M., Herve, C., and Alaphilippe, D. (2012). The dutch eating behaviour questionnaire (DEBQ). Assessment of eating behaviour in an aging french population. Appetite, 59:853-858.

Berget, I., Mevik, B.-H., Vebo, H., and Maes, T. (2005). A strategy for finding relevant clusters; with an application to microarray data. Journal of Chemometrics, 19:482-491.

Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of Multivariate Analysis, 90:44-66.

Camus, S. (2004). Proposition d'échelle de mesure de l'authenticité perçue d'un produit alimentaire. Recherche et Applications en Marketing, 19:39-63.

Centner, V., Massart, D. L., de Noord, O. E., de Jong, S., Vandeginste, B. M., and Sterna, C. (1996). Elimination of uninformative variables for multivariate calibration. Anal Chem, 68(21):3851-3858.

Dahl, T. and Naes, T. (2009). Identifying outlying assessors in sensory profiling using fuzzy clustering and multi-block methodology. Food Quality and Preference, 20:287-294.

Dave, R. N. (1991). Characterization and detection of noise in clustering. Pattern Recognition Letters, 12(11):657-664.

Dhillon, I. S., Marcotte, E. M., and Roshan, U. (2003). Diametrical clustering for identifying anti-correlated gene clusters. Bioinformatics, 19(13):1612-1619.

Enki, D. G., Trendafilov, N. T., and Jolliffe, I. T. (2012). A clustering approach to interpretable principal components. Journal of Applied Statistics, 40(3):583-599.

Folstein, M., Folstein, S., and McHugh, P. (1975). Mini-mental state. a practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research, 13:189-198.

Guigoz, Y., Lauque, S., and Vellas, B. (2002). Identifying the elderly at risk for malnutrition. the mini nutritional assessment. Clinics in Geriatric Medicine, 59:737-757.

Guralnik, J., Simonsick, E., Ferrucci, L., Glynn, R., Berkman, L., Blazer, D., Scherr, P., and Wallace, R. (1994). A short physical performance battery assessing lower extremity function: Association with self-reported disability and prediction of mortality and nursing home admission. Journal of Gerontology, 59:M85-M94.

Hair, J., Black, W., Babin, B., and Anderson, R. (2010). Multivariate Data Analysis, 7th ed. Prentice Hall.

Jolliffe, I. (2002). Principal Component Analysis, 2nd ed. Springer -Verlag, New York.

Jolliffe, I. T., Trendafilov, N. T., and Uddin, M. (2003). A modified principal component technique based on the lasso. J. Comput. Graph. Statist., 12:531-547.

Lawton, M. and Brody, E. (1969). Assessment of older people: Self-maintaining and instrumental activities of daily living. Gerontologist, 9:179-186.

Maitre, I. (2014). Perceptions sensorielles et préférences alimentaires des seniors. Contribution au maintien du statut nutritionnel et à l'appréciation des produits. PhD thesis, Univ. Angers (France).

Maitre, I., Van Wymelbeke, V., Amand, M., Vigneau, E., Issanchou, S., and Sulmont-Rossé, C. (2014). Food pickiness in the elderly: Relationship with dependency and malnutrition. Food Quality and Preference, 32:145-151.

Matsunaga, M. (2010). How to factor-analyze your data right: Do's, don'ts, and how-to's. International Journal of Psychological Research, 3(1):97-110.

Roininen, K., Lhteenmki, L., and Tuorila, H. (1999). Quantification of consumer attitudes to health and hedonic characteristics of foods. Appetite, 33:71-88.

Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math, 20:53-65.

Sarle, W. S. (1990). SAS/STAT User's Guide: The Varclus Procedure. 4th Edition. SAS Institute, Inc., Cary, NC, USA.

Sheikh, J. I. (1986). Geriatric depression scale (gds) recent evidence and development of a shorter version. Clinical Gerontologist, 5:165-173.

Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 99(6):1015-1034.

Sulmont-Rossé, C., Maitre, I., Amand, M., Symoneaux, R., and Van Wymelbeke, V. (2015). Evidence for different patterns of chemosensory alterations in the elderly population: Impact of age versus dependency. Chemical Senses, 40:153-164.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267-288.

Tibshirani, R., Guenther, W., and Hastie, T. (2001). Estimating the number of clusters in a dataset using the gap statistic. J. Roy. Statist. Soc. Ser. B, 63(2):411-423.

Vallieres, E. F. and Vallerand, R. J. (1990). Traduction et validation canadienne-française de l'chelle de l'estime de soi de rosenberg. International Journal of Psychology, 25:305-316.

Vichi, M. and Saporta, G. (2009). Clustering and disjoint principal component analysis. Computational Statistics and Data Analysis, 53:3194-3208.

Vigneau, E., Charles, M., and Chen, M. (2014). External preference segmentation with additional information on consumers: A case study on apples. Food Quality and Preference, 32:83-92.

Vigneau, E. and Chen, M. (2015). ClustVarLV: Clustering of variables around Latent Variables. R package version 1.3.2.

Vigneau, E. and Qannari, E. M. (2003). Clustering of variables around latent components. Comm. Stat. - Simul Comput., 32(4):1131-1150.

Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist., 15:265-286.

Full Text: pdf

Creative Commons License
This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License.