Lianyu Hu

Ph.D. Thesis

Lianyu Hu, Clustering Categorical Data Based on Hypothesis Test, 2025.
Supervised by Prof. Zengyou He.

Abstract: Clustering analysis is a fundamental area in machine learning and data mining, crucial for uncovering latent structures and identifying key patterns across disciplines. Categorical data, widely encountered in applications such as survey responses, credit records, event logs, and gene expression analysis, consists of discrete features rather than continuous ones, with attributes that often lack comparability. As a result, general-purpose clustering methods designed for numerical data are often unsuitable for categorical data. While significant progress has been made in categorical data clustering algorithms, existing methods primarily focus on maximizing clustering accuracy, often overlooking the credibility and interpretability of the results. Key issues include: Regarding credibility, clustering algorithms may generate invalid partitions in random data or data lacking intrinsic cluster structure; generated partitions may also lack statistical significance, leading to false positives. Regarding interpretability, clustering processes often lack reliable explanations, and cluster assignments typically fail to provide rational justifications for sample-to-cluster relationships. To address these issues, this dissertation introduces a statistical inference perspective by reformulating clustering problems as hypothesis testing tasks at different stages. Falsifiable null and alternative hypotheses are established, corresponding test statistics are constructed, and p-values are derived, providing principled and interpretable solutions.

M.S. Thesis

Lianyu Hu, Clustering based on Space Transformation, 2019.
Supervised by Prof. Caiming Zhong.

Abstract: Clustering based on Space Transformation (CST) is a novel framework on unified clustering. It enhances the effectiveness and universality of a clustering algorithm on various datasets. Meanwhile, it avoids the problem of selecting suitable clustering algorithms and corresponding parameters for unknown datasets. CST uses a new similarity matrix transformed from the original Euclidean distance matrix of a dataset by a nonlinear mapping that is explicit and interpretable for clustering algorithms. Furthermore, under the transformed space of the similarity, more evidence of cluster structures are extracted, which can result in more accurate clustering results by traditional clustering algorithms. Surprisingly, it is also robust to outliers. In this thesis, the Normalized Cut is used to show the effectiveness and universality of CST. As a starting point, I researched two kinds of CST and try to figure out the essential part of validity in clustering. To choose a specific new similarity from a candidate set of CSTs, a new proposed internal validity index is used. The key to my studies is on designing a good internal validity index. Finally, some clustering mechanisms are involved in clusters or clustering, contributing to a new internal validity index called CVDD. Experimental results show that CVDD outperforms some classic ones and can cope with challenging datasets such as non-spherical clusters, density-separated clusters, and datasets with outliers.