Class Prediction
Using a data set that contains known samples, create a model
(also referred to as a class predictor or classifier)
that can be used to predict the class of a previously unknown sample.
Click the desired algorithm
- CART (Breiman et al., 1984)
builds Classification And Regression Trees.
It works by recursively splitting the
feature space into a set of non-overlapping regions and then predicting
the most likely value of the dependent variable within each region. A
classification (or regression) tree represents the set of nested if-then conditions
used to predict a categorical dependent (or continuous dependent) variable
based on the observed values of the feature variables. CART is vulnerable to
overfitting and therefore not commonly used with microarray data.
- K-nearest-neighbors (KNN) classifies an unknown sample by assigning
it the phenotype label most frequently represented among the k nearest
known samples (Golub and Slonim et al., 1999). In GenePattern, an analyst can
select a weighting factor for the 'votes' of the nearest neighbors. For example, one
might weight the votes by the reciprocal of the distance between neighbors.
- Probabilistic Neural Network (PNN) calculates the probability
that an unknown sample belongs to a given set of known phenotype classes
(Lu et al., 2005; Specht, 1990). The contribution of each known sample
to the phenotype class of the unknown sample follows a Gaussian distribution.
PNN can be considered as a Gaussian-weighted KNN classifier - known samples
close to the unknown sample have a greater influence on the predicted class
of the unknown sample.
PNN is not on the GenePattern public server. The PNN modules require the Windows operating system. To use PNN, install the GenePattern server and the PNN modules on a Windows machine.
- Support Vector Machines (SVM) is designed for multiple class classification
(Rifkin et al., 2003). The algorithm creates a binary SVM classifier for
each class by computing a maximal margin hyperplane that separates the
given class from all other classes; that is, the hyperplane with maximal
distance to the nearest data point. The binary classifiers are then combined
into a multiclass classfier. For an unknown sample, the assigned class
is the one with the largest margin.
- Weighted Voting (Slonim et al., 2000) classifies an unknown sample
using a simple weighted voting scheme. Each gene in the classifier 'votes'
for the phenotype class of the unknown sample. A gene's vote is weighted
by how closely its expression correlates with the differentiation between
phenotype classes in the training data set.
References
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. 1984.
Classification and regression trees. Wadsworth & Brooks/Cole Advanced
Books & Software, Monterey, CA.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 286:531-537.
Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., Sweet-Cordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A, Downing, J.R., Jacks, T., Horvitz, H.R., Golub, T.R. 2005. MicroRNA expression profiles classify human cancers. Nature 435:834-838.
Rifkin, R., Mukherjee, S., Tamayo, P., Ramaswamy, S., Yeang, C-H, Angelo, M., Reich, M., Poggio, T., Lander, E.S., Golub, T.R., Mesirov, J.P. 2003. An Analytical Method for Multiclass Molecular Cancer Classification. SIAM Review 45(4):706-723.
Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub, T.R., Lander, E.S. 2000. Class prediction and discovery using gene expression data. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology (RECOMB). ACM Press, New York. pp. 263-272.
Specht, D. F. 1990. Probabilistic Neural Networks. Neural Networks 3(1):109-118. Elsevier Science Ltd., St. Louis.