KNN Class Prediction: Two Data Sets
To build and test classifiers using the k-nearest-neighbors (KNN) class prediction method and two gene expression data sets:
- First use the KNNXValidation module to determine the best parameter settings for the KNN class prediction method.
- Then use the KNN module to build KNN classifiers, test previously generated KNN classifiers, or classify unknown samples using previously generated KNN classifiers.
Before you begin
Use one data set to train the classifier and the other to test it. Each gene expression data set consists of two files:
Step 1: PreprocessDataset
Preprocess gene expression training data to remove platform noise and genes that have little variation. Note: If preprocessing the data removes relevant biological information, skip this step.
Do not preprocess the gene expression test data. The test data should contain all of the genes present in the training data.
Considerations
- PreprocessDataset can preprocess the data in one or more ways (in this order):
- Set threshold and ceiling values. Any value lower/higer than the threshold/ceiling
value is reset to the threshold/ceiling value.
- Convert each expression value to the log base 2 of the value.
- Remove genes (rows) if a given number of its sample values are less than
a given threshold.
- Remove genes (rows) that do not have a minimum fold change or expression
variation.
- Discretize or normalize the data.
- When using ratios to compare gene expression between samples,
convert values to log base 2 of the value to
bring up- and down-regulated genes to the same scale.
For example, ratios of 2 and .5 indicating two-fold changes for up- and
down-regulated expression, respectively, are converted to +1 and -1.
- If you did not generate the expression data,
check whether preprocessing steps have already been taken before
running the PreprocessDataset module.
Step 2: KNNXValidation
KNNXValidation runs KNN class prediction iteratively against a known data set. For each iteration, it leaves one sample out, builds the classifier using the remaining samples, and then tests the classifier on the sample left out. It creates two files:
- a
prediction results file that shows the accuracy of the classifiers
- a features results file, which lists all genes used in any classifier and the number of times that
gene was used in a classifier
Choose the best parameter settings for the KNN class prediction method by running KNNXValidation with different parameter
values. For example, set the num features parameter to 10, 20 and 30. Choose the parameter values that generate the most accurate classifier.
Step 3: View results
To view the prediction results file (*.pred.odf), use the PredictionResultsViewer module.
The viewer lists each sample with its actual and predicted class. Error rates for class predictions are averaged across all iterations.
To view the features results file (*.feat.odf), use the FeatureSummaryViewer module. The viewer ists each gene used in a class predictor and the number of times
it was used in a predictor.
Considerations
- The PredictionResultsViewer provides an absolute error rate (incorrect
cases/total cases) and an ROC error rate (fraction of true positives
versus the fraction of false positives). Use the ROC error rate for comparing
results across data sets.
- In the FeatureSummaryViewer, the most interesting genes are generally
those used by all (or most) of the classifiers. To retrieve gene annotations
from a variety of public databases, click the GeneCruiser menu item.
Step 4: KNN
The KNN module builds and/or tests a classifer by running the KNN class prediction method:
- To build a classifier, specify the training data set. The module
creates a classifier (*.knn.model).
- To test a previously built classifier, specify the classifier (*.knn.model) and
the test data set. The module creates a
prediction results file (*.pred.odf) that assesses the accuracy of the predictor.
- To build and test a classifier, specify both the training and test
data sets. The module creates a classifier and a prediction results file.
Considerations
- When building a KNN classifier, use the "best parameter settings" for the KNN class prediction method as determined by running KNNXValidation (Step 2).
Step 5: View results
To view the prediction results file (*.pred.odf), use the PredictionResultsViewer module.
The viewer lists each sample with its actual and predicted class.
To view the model file (*.knn.model), click it. The model file that contains the classifier (or model) created from the training data set.
Considerations
- The PredictionResultsViewer provides an absolute error rate (incorrect
cases/total cases) and an ROC error rate (fraction of true positives
versus the fraction of false positives). Use the ROC error rate for comparing
results across data sets.
- The model file lists the gene expression profiles that can be used to
classify unknown samples.
Step 6: Determine the class of an unknown sample
To classify unknown samples using the KNN module:
- Use the saved model filename parameter to specify a previously
generated classifier (*.knn.model file).
- Use the test filename parameter to specify an expression data set
that contains the unknown samples.
- The test class filename is a required parameter that specifies the
class of each sample in the expression data set. For the unknown samples,
create a class file that assigns some class (for example, "unknown") to each
sample.
The module uses the classifier to predict the class
of each unknown sample and creates a prediction results file. Use the PredictionResultsViewer module to view the prediction results (*.pred.odf) file:
- The viewer lists each sample with its actual and predicted class.
- Ignore the actual class, which was unknown.
- Ignore the error rates, which are evaluating the class predictor against "known" data.