Rankgene: a program to rank genes from expression data
Yang Su, T. M. Murali, Vladimir Pavlovic, Mike Schaffer, and Simon
Kasif
Computational Genomics
Laboratory
Bioinformatics Graduate
Program, Boston University
Update
Rankgene version 1.1 is now available. The new version
accepts expression data sets in two formats: a standard
Affymetrix-like gene expression data format and the original OC1
format. Check the README file in the package for details.
Introduction
Rankgene is a program for analyzing gene expression data, feature
selection and ranking genes based on the predictive power of each gene
to classify samples into functional or disease categories.
A paper describing Rankgene will be
published in Bionformatics.
One useful feature of this program is that the user can select eight
different ranking criteria. Rankgene uses the six measures of
predictability adopted from the popular OC1
decision tree software developed at Johns Hopkins university. In
addition we provide the traditional t-test and a novel efficient
implementation of one-dimensional support vector machine (SVM) as two
new options.
The Rankgene program can be used as a feature selection program to
select or rank genes based on their relative predictive power in
classification of gene expression data. The input to Rankgene is a
gene expression data consisting of a set of samples, the expression
levels of all the genes across these samples, and the class label for
each sample. A typical example would be gene expression measurements
from normal or cancerous tissues. For each gene, Rankgene analyzes the
expression values of each gene and ranks them based on their
capability to distinguish between the classes.
Dowload and install
Click the link to download: rankgene-1.1.tar.gz
After downloading,run the following commands:
$ gunzip rankgene-1.1.tar.gz
$ tar -xvf rankgene-1.1.tar
$ cd rankgene-1.1
$
make
Now the rankgene program should be ready to run.
Supported operating systems and compilers
| Operating System | Compiler |
| Linux (Redhat 7.2/7.3) | gcc 2.96 |
| Linux (Redhat 8.0) | gcc 3.2 |
| |
Note that the program should compile well on most linux/unix systems,
please let us know if you have problems or suggestions when compiling
the program.
Usage
Input
Rankgene accepts input files containing gene expression data in two
formats:
- Standard: This format is similar to the usual format used in gene
expression data sets. All lines are tab-delimited. The first row of
the file specifies the names of the first two columns followed by all
the sample names. Every other line corresponds to a gene. The line
contains the gene id, the gene name, and all the expression values for
that gene. You can indicate missing values either by
"?" or "NA". RankGene replaces missing values for a
gene with the average value of the other expression values for that
gene.
In this format, a separate file specifies the class labels. Each line
of this file contains a sample name and a class name, separated by a
tab. RankGene will ignore a sample whose class name is "NA."
The files in the data directory in the RankGene package
are examples of this format.
Note: We suggest that you use a format where the first column
in each line contains the gene id (gene accession number) and the
second column contains the gene name. RankGene recognises
some standard names for the gene name column (Gene Description and
Name) and the gene id column (Gene Accession Number, GID, and Image
Id). If you use different names for these columns, please let us know.
Note: RankGene does not accept data sets that contain information
apart from the gene expression values. For instance, some Affymetrix
data sets contain CALL values or p-values. Please remove these columns
from the data set before invoking RankGene.
- OC1: Each line corresponds to a sample or experimental condition.
The line contains the expression values of all the genes for that
sample followed by the class label of that sample. The elements of the
line are separated by commas. RankGene expects an expression value to
be a floating point number. Unlike the previous format, the class
label must be an integer.
Output
RankGene outputs a file containing the genes most predictive of the
sample classes based on the measure specified on the command line. The
number of genes is specified on the command line. The first few lines
of the output file contain some general information about RankGene,
the data contained in the input files, and the measure used. Each
succeeding line lists the index of a gene in the data file (indices
start at 1), the name of the gene, the id of the gene, and the value
of the measure for that gene. The genes are sorted in increasing order
of the value of the measure.
Command Line Options
RankGene accepts the following command-line options:
- -i : name of the file containing the input data.
- -c : name of the file containing the class labels
- -o : name of the file to output the results to.
- -n : number of genes to list in the output file. The default value
is 100.
- -m : the measure to use to rank genes. This number must
range between 1 and 8. The default value is 1. The correspondence
between this option and the measures is:
- Information gain.
- Twoing rule.
- Gini index.
- Sum minority.
- Max minority.
- Sum of variances.
- t-statistic.
- One dimensional SVM.
- -w : the weight parameter for the 1D SVM. This parameter
can be any positive number. Its default value is 1. This parameter is
used only when the -m option is 8.
- -R: specifies that the input file is in OC1 format. The class file
is ignored if this option is set.
-
For example, to list the best 500 genes using the t-test and the input
files in the data directory:
./rankgene -m 7 -n 100 -o data/gene.list -i data/all-aml.txt -c data/all-aml-class.txt
- To do so for the one-dimensional SVM measure with a weight value of
10:
./rankgene -m 8 -w 10 -n 500 -o gene.list -i data/all-aml.txt -c
data/all-aml-class.txt
Note:
- For the t-statistic, RankGene ranks genes according to the
decreasing order of the statistics. For each gene, it prints out two
values: the reciprocal of the t-statistic and the t-statistic
itself.
- For the one-dimensional SVM, RankGene also prints out the expression values
corresponding to the two support vectors and their corresponding classes.
Please refer to README file in the same directory for more information
on usage.
License
Rankgene is available without a fee for all non-profit and academic
institutions.
Commercial users are required to obtain a license for Rankgene.
The license is required for any use of Rankgene in a profit-making
enterprise, and it gives the company all rights to any discoveries or
inventions made with Rankgene. There is a modest fee for this license.
Please contact Prof. Simon Kasif for details.
Rankgene is an evolving development software, thus we would appreciate
comments from users.
Contact
If you have any questions, please contact Simon Kasif.
Rankgene supports eight measures for quantifying a gene's ability to
distinguish between classes. In the formulae below, k is the
total number of classes; n is the total number of expression
values; nl (resp., nr) is the
number of values in the left (resp., right) partition;
li (resp., lr) is the number
of values that belong to class i in the left (resp., right)
partition; and c is the class of the ith sample.
- Information gain
- Twoing rule
- Sum minority
- Max Minority
- Gini index
- Sum of variances
- t-statistic: Rankgene
sorts the genes in decreasing order of the absolute value of the
t-statistic for each gene.
- One dimensional support vector machine (SVM): We
train an SVM on each gene's expression values. The gene's measure is
the function optimised by the SVM training algorithm. Standard SVM
training algorithms run in O(n3) time, where
n is the number of training samples. We have developed and
implemented an algorithm for training one-dimensional SVMs with linear
kernels that runs in O(n log n) time. You can read about the details of this algorithm in this paper.
Each one of the first 6 measures attempts to capture the best possible
reduction in uncertainty (analogous to increase in predictabilty) that
we can obtain by dividing the full range of expression of a given gene
into two intervals (up-regulated, down-regulated).
E.g., Sum-minority is a simple rule where for a given threshold
we test the error obtained by predicting all samples below the
threshold to be in class one (e.g, normal) and above the threshold to be
in class two (e.g, cancer). The sum-minority rule counts the minority
class samples below and above the treshold as errors.
For information gain we use reduction in class entropy resulting from
partitioning the samples in two ranges (below/above
a single threshold) as a measure of predictability.
Comparisons of the measures
We have implemented some techniques for comparing and contrasting the
lists of predictive genes computed by each measure. We provide links
to the comparison results for some publicly available data sets.
Last modified: Mon Nov 18 10:01:00 EST 2002