KNN Class Prediction: One Data Set

Before you begin

A gene expression data set consists of two files:

GCT or RES file that contains gene expression data.
Example file: all_aml_train.gct.
CLS file that identifies the class of each sample in the gene expression data.
Example file: all_aml_train.cls.

learn more:
file formats

Step 1: PreprocessDataset

Preprocess gene expression data to remove platform noise and genes that have little variation. Note: If preprocessing the data removes relevant biological information, skip this step.

Open module in the GenePattern window.
Open module with example data in the GenePattern window.

Considerations

PreprocessDataset can preprocess the data in one or more ways (in this order):
1. Set threshold and ceiling values. Any value lower/higer than the threshold/ceiling value is reset to the threshold/ceiling value.
2. Convert each expression value to the log base 2 of the value.
3. Remove genes (rows) if a given number of its sample values are less than a given threshold.
4. Remove genes (rows) that do not have a minimum fold change or expression variation.
5. Discretize or normalize the data.
When using ratios to compare gene expression between samples, convert values to log base 2 of the value to bring up- and down-regulated genes to the same scale. For example, ratios of 2 and .5 indicating two-fold changes for up- and down-regulated expression, respectively, are converted to +1 and -1.
If you did not generate the expression data, check whether preprocessing steps have already been taken before running the PreprocessDataset module.

learn more:
PreprocessDataset

Step 2: KNNXValidation

KNNXValidation runs the KNN class prediction method iteratively against the known data set. For each iteration, it leaves one sample out, builds the classifier using the remaining samples, and then tests the classifier on the sample left out. It creates two files:

a prediction results file that shows the accuracy of the classifiers
a features results file, which lists all genes used in any classifier and the number of times that gene was used in a classifier

learn more:
KNNXValidation

Step 3: View results

To view the prediction results file (*.pred.odf), use the PredictionResultsViewer module. The viewer lists each sample with its actual and predicted class. Error rates for class predictions are averaged across all iterations.

To view the features results file (*.feat.odf), use the FeatureSummaryViewer module. The viewer ists each gene used in a class predictor and the number of times it was used in a predictor.

Considerations

The PredictionResultsViewer provides an absolute error rate (incorrect cases/total cases) and an ROC error rate (fraction of true positives versus the fraction of false positives). Use the ROC error rate for comparing results across data sets.
In the FeatureSummaryViewer, the most interesting genes are generally those used by all (or most) of the classifiers. To retrieve gene annotations from a variety of public databases, click the GeneCruiser menu item.

learn more:
PredictionResultsViewer
FeatureSummaryViewer

Building a classifier (model) file

KNNXValidation does not save the classifiers that it generates. Typically, an analyst builds a classifier using one data set and tests the classifier using a second data set. It is rare to build a classifier (model) file using one data set without having a second data set available for testing; however, it is possible. To build a classifier (model) file using one data set, run the KNN module: specify the one data set as the training data set.

learn more:
KNN