K-means Clustering

Cluster genes and/or samples into a specified number of clusters. The result is k clusters, each centered around a randomly selected data point.

Before you begin

Gene expression data must be in a GCT or RES file.
Example file: all_aml_test.gct.

learn more:
file formats

Step 1: PreprocessDataset

Preprocess gene expression data to remove platform noise and genes that have little variation. Although researchers generally preprocess data before clustering if doing so removes relevant biological information, skip this step.

Open module in the GenePattern window.
Open module with example data in the GenePattern window.

Considerations

PreprocessDataset can preprocess the data in one or more ways (in this order):
1. Set threshold and ceiling values. Any value lower/higer than the threshold/ceiling value is reset to the threshold/ceiling value.
2. Convert each expression value to the log base 2 of the value.
3. Remove genes (rows) if a given number of its sample values are less than a given threshold.
4. Remove genes (rows) that do not have a minimum fold change or expression variation.
5. Discretize or normalize the data.
When using ratios to compare gene expression between samples, convert values to log base 2 of the value to bring up- and down-regulated genes to the same scale. For example, ratios of 2 and .5 indicating two-fold changes for up- and down-regulated expression, respectively, are converted to +1 and -1.
If you did not generate the expression data, check whether preprocessing steps have already been taken before running the PreprocessDataset module.

learn more:
PreprocessDataset

Step 2: KMeansClustering

Run k-means clustering on genes (rows) or samples (columns). The module creates a GCT file for each cluster and a GCT file that organizes all of the expression data by cluster.

learn more:
KMeansClustering

Step 3: HeatMapViewer

For an overview of the results, use a heatmap to display the expression data organized by cluster.

Considerations

The HeatMapViewer displays gene expression data as a heat map, which makes it easier to see patterns in the numeric data. Gene names are row labels and sample names are column labels.

learn more:
HeatMapViewer