| GEMS performs biclustering (also referred to as co-clustering,
two way clustering, projective clustering, block clustering) on microarray
data to detect gene expression biclusters. These biclusters are potential targets for genes that are functionally
related or co-regulated by common transcription factors. The samples produced by the algorithm can potentially suggest sub-classes
of diseases and can serve as a diagnostic tool.
If you use GEMS in your research please acknowledge the following paper: |
| Genes express differently across a range of conditions or cell types. Co-expressed genes are more likely to have similar functions. However, genes rarely exhibit similar expression pattern across a wide range of conditions, Bi-clustering of gene expression data is a promising methodology for identification of gene groups that show a coherent expression profile across a subset of conditions. Gene expression biclusters are defined by these coherent gene groups across the subset of conditions . |
| The mining of subsets of genes in subset of samples is a non-trivial
task. We have proposed a heuristic algorithm (GEMS
pdf) based on
Gibbs Sampling paradigm to solve this problem.
In the GEMS algorithm, mathematical criteria to define a gene expression bicluster are shown below: |
| [Back to Top] |
| Gene Expression Mining Server utilizes the GEMS algorithm
to extract bicluster(s) in user-uploaded gene expression data. Users can
specify different criteria and parameters, and the server will try to find
bicluster(s) with the highest weighting scores / numbers of
genes. The results will be sent to the clients by email.
In additional to bicluster extraction, users may want to perform permutation test in the given expression data to see how significant a bicluster score is. The GEMS server allows users to choose whether ordinary bicluster extraction or permutation test should be performed. The uploaded files and all the reports will be cleared 48 hour after task completion. Users may query and download the report files from the website. |
|
| [Back to Top] |
For every bicluster extracted, three files will be generated. These files can be downloaded separately, and user can download a zipped package file containing all of them.
|
|||||||||||||
| [Back to Top] | |||||||||||||
| A valid email address is required to submit the job. The first reason is because the notification of completion and reports will be sent back to users by email. The second reason is to prevent malignant users from accessing others' data by guessing the accession number. | |
| [Back to Top] |
| The format of the expression data file is similar to the
usually used in gene expression data sets. All lines are tab-delimited
plain text file. The first row of the file specifies the names of the first two columns followed by all the sample names. Every line from second row corresponds to a gene. The line contains the gene identifier, a description, and all the expression values for that gene. Current version of server accepts an array file with up to 50,000 genes and 512 samples. A sample array file can be downloaded here. Below is a simple example of the format.
|
|||||||||||||||||||||||||||||||||
| [Back to Top] |
| A bicluster containing only a few conditions will be over-specific. Users should specify the minimal number of conditions included in a expression bicluster. It is the alpha value specified in the mathematical criteria. The value should be between 0 and 1, however, if a positive number is given, it will be used as the minimally required number of conditions in the biclusters. | |
| [Back to Top] |
| A gene is defined as consistently expressed across a range of conditions if its expression values across the condition subset is within a small range. Which means that the difference of the maximal and minimal values of any gene in the sample subset is less than a width constraint. It is the w value specified in the mathematical criteria. The value should be a positive real number. | |
| [Back to Top] |
| Every gene can be assigned a weighting score. The score of a
expression bicluster is the sum of the weighting scores of the genes included
in it. GEMS will try to search biclusters with the highest scores. If no
weighting file is uploaded, the weighting of every gene will be 1.
The feature of assigning gene weighting is useful. In a file containing both cDNA array data and protein array data, usually cDNA outnumbers the protein records. By assigning different weighting can help not to biased to cDNA side. Furthermore, if the weighting score is based on some functional categories. The biclusters with highest scores can be considered as most likely to be related to that functional category. |
|
| [Back to Top] |
| It is a plain text file. The first line is a header. Every
other line corresponds to a gene. The weighting scores should be a
non-negative real number.
A sample weighting file can be downloaded here. |
|
| [Back to Top] |
| In a semi-supervised analysis, the labels of part of the conditions are known, and others are unknown. Associating unknown samples to known ones can help to make classification. The current version of GEMS allows users to specify a set of samples in the array file to be seeds. These seed samples will be kept in the bicluster along the Gibbs Sampling Iteration. | |
| [Back to Top] |
| It is a plain text file. The first line is a header. Every
other line corresponds to a gene. The class is 1 if the sample is a seed
(forced to stay in biclusters), and the class is 0 if the corresponding
sample is a candidate. Any sample with a class label other than 0 or 1
will be ignored during bicluster mining.
A sample weighting file can be downloaded here. The output sample vector file (1=selected, 0=unselected) can be used as a class labeling file for next search. In this scenario, the first column of file containing the sample names will be ignored. By doing so, a bicluster extracted from stricter criteria can be expanded to larger bicluster if size or width constraint is increased. |
|
| [Back to Top] |
| An expression bicluster contains only one or two genes provide little information. Users can set a lower limit for the sum of weighting score or gene number for a bicluster to be reported. The value should be a non-negative real number. | |
| [Back to Top] |
| The speed of the program is determined by the length of the lag period of Gibbs Sampling iteration. Longer lag period takes much time but will be more likely to achieve globally optimal result. The running speed is a integer between 1 and 9. Nine is fastest, and one is slowest. The default speed is 5. | |
| [Back to Top] |
| The seed number of Random Number Generator is fixed to assure the reproducibility of previous results. However, users can specify different seed number to simulate multiple start in a heuristic searching. | |
| [Back to Top] |
| Sometimes there are missing values in the array file. For
the GEMS, all non-numeric value will be considered missing. It might be
useful to assign some values to represent missing data. For example, in a
binary array file (only 0 or 1), if we declare 0 to be missing, a
submatrix containing only 1s will be extracted.
All values representative of missing can be inputted where comma is used to separate multiple values, for example: "99.9, 999.9, -1234.5". |
|
| [Back to Top] |
| In a ordinary bicluster extraction task, users can specify the
number of biclusters desired, and GEMS will try to extract multiple biclusters.
However, if no more biclusters meet the criteria, the number of biclusters
reported may be below the desired number, especially when only unique
biclusters are requested. The default setting is to find single
expression bicluster.
After biclusters extraction, users can further choose to perform permutation tests. The expression values on each row will be randomly shuffled ( except the columns corresponding to seed samples), each iteration will generate one bicluster even the number of genes is zero. Users can specify the times of permutation test. GEMS will output a file contains the number of genes in the biclusters from permutated data (and the sum of weighting scores). |
|
| [Back to Top] |
| If multiple biclusters are to be extracted from an array file, the algorithm may find a bicluster more than one times. They don't provide additional information. Users may choose to report only unique biclusters or report all biclusters even redundant for statistical use. | |
| [Back to Top] |
| If multiple biclusters are to be extracted from an array file,
we may want to acquire non-overlapping biclusters. In such condition, earlier
extracted biclusters should be masked. Three masking methods can be selected.
The first method is to mask the selected genes only on the selected
conditions. The second method is to mask the selected conditions. The
third method is to mask the selected genes.
Genes or conditions masked will be excluded from searching space, and won't be included in further extracted biclusters. Please notice that the seed samples won't be masked. |
|
| [Back to Top] |
| Expression values of mRNAs below some threshold may be considered as a random noise rather than a signal. Users can set a threshold value to filter out the genes that have low expression values for all samples. Currently the threshold can be either 50 or 100. | |
| [Back to Top] |
| Different genes typically have a wide range of expression levels and variances. It might be desirable to normalize all genes into fixed range or equal variance for better analysis. Users can choose to project the expression values of each gene into a fixed range between 0 and 1, or to normalize the values of each gene into a distribution with mean=0 and variance=1. | |
| [Back to Top] |
| GEMS is open source software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or any later
version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. GEMS is an evolving development software, thus we would appreciate comments from users. If you have any comments or questions, please contact Chang-Jiun Wu.
If you use GEMS in your research please acknowledge the following paper: |
|
| [Back to Top] |
| This work is supported in part by NSF grants DBI-0239435 and ITR-048715 and NHGRI grant #1R33HG002850-01A1. | |
| [Back to Top] |