{"title": "Visualizing Group Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 452, "page_last": 458, "abstract": null, "full_text": "Visualizing Group Structure* \n\nMarcus Held, Jan Puzicha, and Joachim M. Buhmann \n\nemail: {heldjanjb}.cs.uni-bonn.de. \n\nVVVVVV: http://vvv-dbv.cs.uni-bonn.de \n\nInstitut fur Informatik III, \n\nRomerstraBe 164, D-53117 Bonn, Germany \n\nAbstract \n\nCluster analysis is a fundamental principle in exploratory data \nanalysis, providing the user with a description of the group struc(cid:173)\nture of given data. A key problem in this context is the interpreta(cid:173)\ntion and visualization of clustering solutions in high- dimensional \nor abstract data spaces. In particular, probabilistic descriptions \nof the group structure, essential to capture inter-cluster relation(cid:173)\nships, are hardly assessable by simple inspection ofthe probabilistic \nassignment variables. VVe present a novel approach to the visual(cid:173)\nization of group structure. It is based on a statistical model of the \nobject assignments which have been observed or estimated by a \nprobabilistic clustering procedure. The objects or data points are \nembedded in a low dimensional Euclidean space by approximating \nthe observed data statistics with a Gaussian mixture model. The \nalgorithm provides a new approach to the visualization of the inher(cid:173)\nent structure for a broad variety of data types, e.g. histogram data, \nproximity data and co-occurrence data. To demonstrate the power \nof the approach, histograms of textured images are visualized as an \nexample of a large-scale data mining application. \n\n1 \n\nIntroduction \n\nClustering and visualization are key issues in exploratory data analysis and are \nfundamental principles of many unsupervised learning schemes. For a given data \nset, the aim of any clustering approach is to extract a description of the inherent \ngroup structure. The object space is partitioned into groups where each partition \n\n-This work has been supported by the German Research Foundation (DFG) under grant \n#BU 914/3-1, by the German Israel Foundation for Science and Research Development \n(GlF) under grant #1-0403-001.06/95 and by the Federal Ministry for Education, Science \nand Technology (BMBF #01 M 3021 A/4) . \n\n\fVisualizing Group Structure \n\n453 \n\nis as homogeneous as possible and two partitions are maximally heterogeneous. For \nseveral reasons it is useful to deal with probabilistic partitioning approaches: \n\n1. The data generation process itself might be stochastic, resulting in over(cid:173)\n\nlapping partitions. Thus, a probabilistic group description is adequate and \nprovides additional information about the inter-cluster relations. \n\n2. The number of clusters might be chosen too large. Forcing the algorithm \nto a hard clustering solution creates artificial structure not supported by \nthe data. On the other hand, superfluous clusters can be identified by a \nprobabilistic group description . \n\n3. There exists theoretical and empirical evidence that probabilistic assign-\n\nments avoid over-fitting phenomena [7]. \n\nSeveral well-known clustering schemes result in fuzzy cluster assignments: For the \nmost common type of vector- valued data, heuristic fuzzy clustering methods were \nsuggested [4, 5] . In a more principled way, deterministic annealing algorithms pro(cid:173)\nvide fuzzy clustering solutions for a given cost function with a rigorous statistical \nfoundation and have been developed for vectorial [9], proximity [6] and histogram \ndata [8]. In mixture model approaches the assignments of objects to groups are \ninterpreted as missing data. Its conditional expectations given the data and the \nestimated cluster parameters are computed during the E- step in the corresponding \nEM-algorithm and can be understood as assignment probabilities. \n\nThe aim of this contribution is to develop a generic framework to visualize such \nprobabilities as distances in a low dimensional Euclidean space . Especially in high \ndimensional or abstract object spaces, the interpretation of fuzzy group structure is \nrather difficult, as humans do not perform very well in interpreting probabilities. It \nis, therefore, a key issue to make an interpretation of the cluster structure more fea(cid:173)\nsible. In contrast to multidimensional scaling (MDS), where objects are embedded \nin low dimensional Euclidean spaces by preserving the original inter object distances \n[3], our approach yields a mixture model in low dimensions, where the probabilities \nfor assigning objects to clusters are maximally preserved. The proposed approach \nis similar in spirit to data visualization methods like projection pursuit clustering, \nGTM [1], simultaneous clustering and embedding [6]' and hierarchical latent vari(cid:173)\nable models [2] . It also aims on visualizing high dimensional data. But while the \nother methods try to model the data itself by a low dimensional generator model, \nwe seek to model the inferred probabilistic grouping structure. As a consequence, \nthe framework is generic in the sense that it is applicable to any probabilistic or \nfuzzy group description. \n\nThe key idea is to interpret a given probabilistic group description as an observa(cid:173)\ntion of an underlying random process. We estimate a low- dimensional statistical \nmodel by maximum likelihood inference which provides the visualization . To our \nknowledge the proposed algorithm provides the first solution to the visualization \nof distributional data, where the observations of an object consists of a histogram \nof measured features . Such data is common in data mining applications like image \nretrieval where image similarity is often based on histograms of color or texture \nfeatures. Moreover, our method is applicable to proximity and co- occurrence data. \n\n2 Visualizing Probabilistic Group Structure \n\nLet a set of N (abstract) objects CJ = {01 , ... , ON} be given which have been par(cid:173)\ntitioned into K groups or clusters. Let the fuzzy assignment of object OJ to cluster \nCv be given by qjv E [0,1], where we assume 2:~=1 qjv = 1 to enable a probabilistic \ninterpretation . We assume that there exists an underlying \"true\" assignment of \n\n\f454 \n\nM Held, J Puzicha and J M Buhmann \n\nobjects to clusters which we encode by Boolean variables Miv denoting whether \nobject OJ belongs to (has been generated by) cluster C v . We thus interpret qiv as \nan empirical estimate of the probability P(Miv = 1). For notational simplicity, we \nsummarize the assignment variables in matrices Q = (qiv) and M = (Miv). \nThe key idea for visualizing group structure is to exploit a low-dimensional statis(cid:173)\ntical model which \"explains\" the observed qiv. The parameters are estimated by \nmaximum likelihood inference and provide a natural data visualization. Gaussian \nmixture models in low dimensions (typically d = 2 or d = 3) are often appropriate \nbut the scheme could be easily extended to other classes, e.g. hierarchical mod(cid:173)\nels. To define the Gaussian mixture model, we first introduce a set of prototypes \nY = {Y1, ... ,YK} C JRd representing the K clusters, and a set vector-valued object \nparameters X = {Xl, ... ,XN} C JRd. To model the assignment probabilities, the \nprototypes Y and the data points X are chosen such that the resulting assignment \nprobabilities are maximally similar to the given frequencies Q. For the Gaussian \nmixture model we have \n\nNote that the probability distribution is invariant under translation and rotation \nof the complete parameter sets X, y. In addition, the scale parameter f3 could be \ndropped since a change of f3 only results in a rescaling of the prototypes Y and the \ndata points X. For the observation Q the log-likelihood is given by1 \n\nLQ (X,Y) = LLqivlogmiv . \n\nN K \n\ni=l v=l \n\n(2) \n\nIt is worth to note that when the qiv = (Miv}ptrue are estimates obtained by \na factorial distribution, i.e. ptrue(M) = I1 Lv Mivqiv, then maximizing (2) is \nidentical to minimizing the Kullback- Leibler (KL-)divergence DKdptruellP) = \nLM p true log (ptrue IP). In that case the similarity to the recent approach of Hof(cid:173)\nmann et al. [6] proposed as the minimization of DKdPllptrue) becomes apparent. \nCompared to [6] the role of P and ptrue is interchanged. From an information(cid:173)\ntheoretic viewpoint DKdptruellP) is a better choice as it quantifies the coding in(cid:173)\nefficiency of assuming the distribution P when the true distribution is p true . Note \nthat the choice of the KL-divergence as a distortion measure for distributions fol(cid:173)\nlows intrinsically from the likelihood principle. Maximum likelihood estimates are \nderived by differentiation: \n\n~ qiv 8m iv \n) \nL..J-. -8. =-2f3L..Jqiv L..J m i/1Y/1-Yv \nm~v x, \nv=l \nN K \n\n~ (~ \n/1=1 \nv=l \n\nN K \n\nLL qj~ 88miV =-2f3LLqiv(miO'- JO'v)(Xi-YO') \n\ni=l v=l mw \n\nyO' \n\ni=l v=l \n\n, \n\n(3) \n\nN \n\n-2f3L (miO' - qiO') (Xi - yO') \n\ni=l \n\n(4) \n\nThe gradients can be used for any gradient descent scheme. In the experiments, \nwe used (3)-(4) in conjunction with a simple gradient descent technique, which has \n\n1 Here, it is implicitly assumed that all qiv have been estimated based on the same \n\namount of information. \n\n\fVisualizing Group Structure \n\n455 \n\n0.8 \n\n0.6 \n\n+ \n\nFigure 1: Visualization of two-dimensional artificial data. Original data generated \nby the mixture model with f3 = 1.0 and 5 prototypes. Crosses denote the data \npoints Xi, circles the prototypes Ya. The embedding prototypes are plotted as \nsquares, while the embedding data points are diamonds. The contours are given by \n!(x) = maXa (exp (-f3llx - Ya 112)/L~=1 exp (-f3llx - Y/JW)), For visualization \npurposes the embedding is translated and rotated in the correct position. \nbeen observed to be efficient and reliable up to a few hundred objects. From (4) an \nexplicit formula for the prototypes may be recovered \n\nYa \n\n(5) \n\nwhich can be interpreted as an alternative centroid rule. The position of the proto(cid:173)\ntypes is dominated by objects with a large deviation between modeled and measured \nassignment probabilities. Note that (5) should not be used as an iterative equation \nas the corresponding fixed point is not contractive. \n\n3 Results \n\nAs a first experiment we discuss the approximately recoverable case, where we sam(cid:173)\nple from (1) to generate artificial two-dimensional data and infer the positions of \nthe sample points and of the prototypes by the visualizing group structure approach \n(see Fig. 1). Due to iso- contour lines in the generator density and in the visual(cid:173)\nization density not all data positions are recovered exactly. We like to emphasize \nthat the complete information available on the grouping structure of the data is \npreserved, since the mean KL-divergence is quite small (Ri 2.10.10- 5 ). It is worth \nmentioning that the rank-order of the assignments of objects i to clusters (}' is \ncompletely preserved. \n\nFor many image retrieval systems image similarity has been defined as similarity of \noccurring feature coefficients, e.g. colors or texture features. In [7], a novel statis(cid:173)\ntical mixture model for distributional data, the probabilistic histogram clustering \n(ACM), has been proposed which we applied to extract the group structure inher(cid:173)\nent in image databases based on histograms of textured image features. The ACM \nexplains the observed data by the generative model: \n\n\f456 \n\nM. Held. J puzicha and J M. Buhmann \n\nFigure 2: Embedding of the VisTex database with MDS. \n\n1. select an object OJ E 0 with probability Pi, \n2. choose a cluster Ca according to the cluster membership Mia of Oi, \n3. sample a feature Vj E V from the cluster conditional distribution qjla. \n\nThis generative model is formalized by \n\nP (OJ, vjIM,p, q) = Pi L Miaqjla \n\nK \n\na=1 \n\n(6) \n\nThe parameters are estimated by maximum likelihood inference. The assignments \nMia are treated as unobserved data in an (annealed) EM procedure, which provides \na probabilistic group description. For the details we refer to [7]. \n\nIn the experiments, texture features are extracted by a bank of 12 Gabor filters \nwith 3 scales and 4 orientations. Different Gabor channels are assumed to be inde(cid:173)\npendently distributed, which results in a concatenated histogram of the empirically \nmeasured channel distributions. Each channel was discretized into 40 bins resulting \nin a 480 dimensional histogram representing one image. For the experiments two \ndifferent databases were used. \nIn Fig. 3 a probabilistic K = 10 cluster solution with 160 images containing different \ntextures taken from the Brodatz album is visualized. The clustering algorithm \nproduces 8 well separated clusters, while the two clusters in the mid region exhibit \nsubstantial overlap. A close inspection of these two clusters indicates that the \nfuzziness of the assignments in this area is plausible as the textures in this area \nhave similar frequency components in common. \n\nThe result for a more complex database of 220 textured images taken from the MIT \nVisTex image database with a large range of uniformly and non-uniformly textured \nimages is depicted in Fig. 4. This plot indicates that the proposed approach provides \na structured view on image databases. Especially the upper left cluster yields some \n\n\fVisualizing Group Structure \n\n457 \n\ninsight in the clustering solution, as this cluster consists of a large range of non(cid:173)\nuniformly textured images, enabling the user to decide that a higher number of \nclusters might yield a better solution. The visualization approach fits naturally in \nan interactive scenario, where the user can choose interactively data points to focus \nhis examination to certain areas of interest in the clustering solution. \n\nFor comparison, we present in Fig. 2 a multidimensional scaling (Sammon's mapping \n[3]) solution for the VisTex database. A detailed inspection of this plot indicates, \nthat the embedding is locally quiet satisfactory, while no global structure of the \ndatabase is visible. This is explained by the fact, that Sammon's mapping only \ntries to preserve the object distances, while our novel approach first extracts group \nstructure in a high dimensional feature space and than embeds this group structure \nin a low dimensional Euclidean space. While MDS completely neglects the grouping \nstructure we do not care for the exact inter object distances. \n\n4 Conclusion \n\nIn this contribution, a generic framework for the low-dimensional visualization of \nprobabilistic group structure was presented. The effectiveness of this approach was \ndemonstrated by experiments on artificial data as well as on databases of textured \nimages. While we have focussed on histogram data the generality of the approach \nmakes it feasible to visualize a broad range of different data types, e.g. vectorial, \nproximity or co-occurrence data. Thus, it is useful in a broad variety of applications, \nranging from image or document retrieval tasks, the analysis of marketing data to \nthe inspection of protein data. We believe that this technique provides the user \nsubstantial insight in the validity of clustering solutions making the inspection and \ninterpretation of large databases more practicable. \n\nA natural extension of the proposed approach leads to the visualization of hierar(cid:173)\nchical cluster structures by a hierarchy of visualization plots. \n\nReferences \n[1] C.M. Bishop, M. Svensen, and C .K.1. Williams. GTM: the generative topographic \n\nmapping. Neural Computation, 10(1):215-234, 1998. \n\n[2] C.M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visu(cid:173)\n\nalization. Technical Report NCRG/96/028, Neural Computing Research Group Dept. \nOf Computer Science & Applied Mathematics, Aston University, 1998. \n\n[3] T.F. Cox and M.A.A. Cox. Multidimensional Scaling, volume 59 of Mongraphs On \n\nstatistics and applied probability. Chapman & Hall, London, New York, 1994. \n\n[4] J.C. Dunn. A fuzzy relative of the ISODATA process and its use in detecting well(cid:173)\n\nseparated clusters. Journal of Cybernetics, 3:32-57, 1975. \n\n[5] 1. Gath and A. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions on \n\nPattern Analysis and Machine Intelligence, 11:773-781, 1989. \n\n[6] T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic annealing. \n\nPAMI, 19(1):1- 25, 1997. \n\n[7] T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In Advances \n\nin Neural Information Processing Systens 11. MIT Press, 1999. \n\n[8] F.C.N. Pereira, N.Z. Tishby, and L. Lee. Distributional clustering of English words. \nIn 30th Annual Meeting of the Association for Computational Linguistics, Columbus, \nOhio, pages 183-190, 1993. \n\n[9] K. Rose, E. Gurewitz, and G.C . Fox. A deterministic annealing approach to clustering. \n\nPattern Recognition Letters, 11(9):589-594, September 1990. \n\n\f458 \n\nM. Held. J. Puzicha and J. M. Buhmann \n\nFigure 3: Visualization of a probabilistic grouping structure inferred for a database \nof 160 Brodatz textures. A mean KL-divergence of 0.031 is obtained. \n\nFigure 4: Visualization of a probabilistic grouping structure inferred for 220 images \nof the VisTex database. A mean KL-divergence of 0.0018 is obtained. \n\n\f", "award": [], "sourceid": 1552, "authors": [{"given_name": "Marcus", "family_name": "Held", "institution": null}, {"given_name": "Jan", "family_name": "Puzicha", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}