Interpretable and Reconfigurable Clustering of Document Datasets by Deriving Word-based Rules

Vipin Balachandran, Deepak Padmanabhan, Deepak Khemani

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)

Abstract

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.
Original languageEnglish
Pages (from-to)475-503
Number of pages29
JournalKnowledge and Information Systems
Volume32
Issue number3
Publication statusPublished - 2012

Fingerprint Dive into the research topics of 'Interpretable and Reconfigurable Clustering of Document Datasets by Deriving Word-based Rules'. Together they form a unique fingerprint.

Cite this