K-means Clustering
Cluster genes and/or samples into a specified number of clusters. The result is k clusters, each centered around a randomly selected data point.
Step 1: PreprocessDataset
Preprocess gene expression data
to remove platform noise and genes that have little variation.
Although researchers generally preprocess data before clustering if doing so removes relevant biological information, skip this step.
Considerations
- PreprocessDataset can preprocess the data in one or more ways (in this order):
- Set threshold and ceiling values. Any value lower/higer than the threshold/ceiling
value is reset to the threshold/ceiling value.
- Convert each expression value to the log base 2 of the value.
- Remove genes (rows) if a given number of its sample values are less than
a given threshold.
- Remove genes (rows) that do not have a minimum fold change or expression
variation.
- Discretize or normalize the data.
- When using ratios to compare gene expression between samples,
convert values to log base 2 of the value to
bring up- and down-regulated genes to the same scale.
For example, ratios of 2 and .5 indicating two-fold changes for up- and
down-regulated expression, respectively, are converted to +1 and -1.
- If you did not generate the expression data,
check whether preprocessing steps have already been taken before
running the PreprocessDataset module.
Step 2: KMeansClustering
Run k-means clustering on genes (rows) or samples (columns). The module creates
a GCT file for each cluster and a GCT file that organizes all of the expression data by cluster.
Step 3: HeatMapViewer
For an overview of the results, use a heatmap to display
the expression data organized by cluster.
Considerations
- The HeatMapViewer
displays gene expression data as a heat map, which makes it easier to see patterns in the numeric data.
Gene names are row labels and sample names are column labels.