Differential Expression Analysis

Find genes that are significantly differentially expressed between classes of samples.

Before you begin

Gene expression data must be in a GCT or RES file.
Example file: all_aml_test.gct.
The class of each sample must be identified in a CLS file.
Example file: all_aml_test.cls.

learn more:
file formats

Step 1: PreprocessDataset

Preprocess gene expression data to remove platform noise and genes that have little variation.

Open module in the GenePattern window.
Open module with example data in the GenePattern window.

Considerations

PreprocessDataset can preprocess the data in one or more ways (in this order):
1. Set threshold and ceiling values. Any value lower/higher than the threshold/ceiling value is reset to the threshold/ceiling value.
2. Convert each expression value to the log base 2 of the value.
3. Remove genes (rows) if a given number of its sample values are less than a given threshold.
4. Remove genes (rows) that do not have a minimum fold change or expression variation.
5. Discretize or normalize the data.
ComparativeMarkerSelection expects non-log-transformed data. Some calculations, such as Fold Change, will produce incorrect results on log transformed data.
If you did not generate the expression data, check whether preprocessing steps have already been taken before running the PreprocessDataset module.

learn more:
PreprocessDataset

Step 2: ComparativeMarkerSelection

ComparativeMarkerSelection computes differential gene expression. For each gene, it uses a test statistic to calculate the difference in gene expression between classes and then computes a p-value to estimate the significance of the test statistic score.

Because testing tens of thousands of genes simultaneously increases the possibility of mistakenly identifying a non-marker gene as a marker gene (a false positive), ComparativeMarkerSelection corrects for multiple hypothesis testing by computing both false discovery rates (FDR) and family-wise error rates (FWER).

Considerations

If the data set includes at least 10 samples per class, use the default value of 1000 permutations to ensure accurate p-values. If the data set includes fewer than 10 samples in any class, permuting the samples cannot give an accurate p-value; specify 0 permutations to use asymptotic p-values instead.
If the data set includes more than two classes, use the phenotype test parameter to analyze each class against all others (one-versus-all) or all class pairs (all pairs).

learn more:
ComparativeMarkerSelection

Step 3: ComparativeMarkerSelectionViewer

Run the ComparativeMarkerSelectionViewer module to view the results. The viewer displays the test statistic score, its p value, two FDR statistics and three FWER statistics for each gene.

Considerations

Generally, researchers identify marker genes based on FDR rather than the more conservative FWER.
Often, marker genes are identified based on an FDR cutoff value of .05, which indicates that a gene identified as a marker gene has a 1 in 20 (5%) chance of being a false positive. Select Edit>Filter Features>Custom Filter to filter results based on that criteria (or any other).
Select File>Save Derived Dataset to create a GCT file that contains a subset of the expression data.

learn more:
ComparativeMarkerSelectionViewer

Reference

Gould, J., Getz, G., Monti, S., Reich, M., and Mesirov, J.P. 2006. Comparative gene marker selection suite. Bioinformatics 22(15):1924-1925.