Clustering and Class Discovery
Find the innate structure of gene expression data by clustering genes and/or samples.
Click the desired algorithm
- Hierarchical clustering (Eisen et al., 1998) groups elements based on
how close they are to one another. The result is a tree structure, referred to as dendrogram. This is a common and valuable approach; however, it is highly sensitive to the measurement used to
assess distance and requires you to define clusters subjectively based on the dendogram (Brunet et al., 2004).
- K-means clustering (MacQueen, 1967) groups elements into a specified number of clusters, which can be useful when you know or suspect the
number of clusters in the data. The algorithm randomly selects a
center data point for k clusters and assigns each data point to the nearest
cluster center. Iteratively, it recalculates a new center data point for each cluster based on the mean value of its members and reassigns all data
points to the closest cluster center until the
distance between consecutive cluster centers converges into k stable clusters.
- Non-negative matrix factorization (NMF) (Brunet et al., 2004) clusters the data by
breaking it down into metagenes or metasamples,
each of which represents a group of genes or samples, respectively. NMF extracts features that may more accurately correspond to biological processes.
References
Brunet, J-P., Tamayo, P., Golub, T.R., and Mesirov, J.P. 2004. Metagenes and molecular pattern discovery using
matrix factorization. Proc. Natl. Acad. Sci. USA 101(12):4164–4169.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci. USA 95:14863-14868.
MacQueen, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. In Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. University of California Press, California. pp. 281-297.