Relevant overlapping subspace clusters on categorical data

Xiao He, Jing Feng, Bettina Konte, Thai Son Mai, Claudia Plant

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)


Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may
contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the ffectiveness and efficiency of our approach.
Original languageEnglish
Title of host publicationACM International Conference on Knowledge Discovery and Data Mining (SIGKDD)
Publication statusPublished - 2014
Externally publishedYes


  • Relevant Subspace Clustering
  • Categorical Data
  • Minimum Description Length

Fingerprint Dive into the research topics of 'Relevant overlapping subspace clusters on categorical data'. Together they form a unique fingerprint.

Cite this