Abstract
Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may
contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the ffectiveness and efficiency of our approach.
contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the ffectiveness and efficiency of our approach.
Original language | English |
---|---|
Title of host publication | ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD) |
Pages | 213--222 |
DOIs | |
Publication status | Published - 2014 |
Externally published | Yes |
Keywords
- Relevant Subspace Clustering
- Categorical Data
- Minimum Description Length
Fingerprint
Dive into the research topics of 'Relevant overlapping subspace clusters on categorical data'. Together they form a unique fingerprint.Profiles
-
Thai Son Mai
- School of Electronics, Electrical Engineering and Computer Science - Senior Lecturer
Person: Academic