previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 3 of 3
next page

C. Grouping genes with similar expression patterns

Mining gene array data is still a science under development. The most obvious thing to look for is to try to classify groups of genes as being similar based on similarity in expression pattern. At the outset, even the criteria for similarity are subjective. A set of genes that behaves identically when comparing, for example, a "wild type" and a "mutant" in a given tissue, might show little similarity of expression if different stages of development were compared. One approach may be to simply test as many conditions as possible, and include all of them in the calculation of similarity. This has the drawback that we may "miss the trees for the forest". In the end, it seems that there may be no single approach to grouping genes by function. Rather, different groupings may be valid, depending on the biological context.

There are dozens of approaches for downstream analysis of microarray data. We will focus on two of them below.

1. Cluster analysis

Michael B. Eisen*, Paul T. Spellman*, Patrick O. Brown, and David Botstein* (1998)
Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Aca. Sci. USA Vol. 95, Issue 25, 14863-14868.


Clustering genes into groups based on expression patterns requires a way to do pairwise comparisons of expression patterns. In a typical experiment, expression of any gene G will be measured in N different conditions, such as treatment, cell type, time after treatment etc. The variation in G over N conditions  is given by

In experiments in which the expression is normalized to the mean expression for gene G, Goffset is the mean of G, which makes  the standard deviation for G. However, in most gene array experiments,  Goffset  is set to 0, the log of the fluorescence ratio of 1, meaning the ratio that would be seen if no change was observed from one conditon to the next. The similarity of gene expression patterns for any two genes X and Y  can be expressed as a correlation coeficient

S(X,Y) can be used in any clustering method. For example, Eisen et al. have used the method of Sokal and Michener, similar to UPGMA, to produce dendrograms that cluster sets of genes together.

Eisen et al. examined 8600 human genes in cells grown in the presence or absence of serum. Genes whose expression changed by a factor of 3.0 or more in at least 2 timepoints were subjected to cluster analysis.
 
In the figure, each gene is represented in a row, while each condition (time) is represented in a column. Genes have been clustered so that genes with the most similar expression patterns are in nearby rows. The hierarchical relationships among genes in clusters is indicated by the tree.

Green - strong down-regulation at a given timepoint

RED - strong up-regulation at a given timepoint.

BLACK - little or no difference between serum-treated and serum starved cells.

Gene clusters: A - cholesterol biosynthesis; B - cell-cycle; C immediate-early respoinse; D signaling and angiogenesis; E - wound healing and tissue remodeling



2. Self-organizing maps

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A
 16;96(6):2907-12

Törönen P, Kolehmainen M, Wong G, Castrén E (1999)  Analysis of gene expression data using self-organizing maps. FEBS Letters 21;451(2):142-146.


Tamayo and coworkers have applied Self-Organizing Maps (SOM) to grouping gene expression data.  In Figure 1., they illustrate very simple X,Y data as groups of raw datapoints as black dots. Such a dataset might represent, for example, a measurement on a wild-type gene on the X-axis, and a measurement on a mutant gene on the Y-axis.  A timecourse with n-timepoints would therefore be represented in n dimensions. Just looking at the  datapoints, it looks as if there are distinct groups.

The goal is to find sets of X,Y points that most closely-approximate the mean value for each group of points. SOMs begin by arbitrarily creating a set of nodes (N) with randomly-assigned values. In the example, a set of six nodes (1-6) are randomly placed in the X,Y space.

For each iteration of the algorithm, a datapoint P is chosen, and the position of each node is changed to move it closer to P. The closer a node N is to point P, the greater the distance it is moved towards P. This process is continued for thousands of iterations, until the total change is lower than some threshold.

The net result is that all nodes will be moved many times, but each node will "come to rest" in the vicinity of the set of datapoints to which it is closest.

For example, Tamayo et al. studied 6000 human genes in myeloid leukemia cell line HL-60, in response to phorbol ester PMA, which stimulates macrophage differentiation. 567 genes were shown to change significantly with addition of PMA. Expression data were modeled onto a 3 x 4 array in which each node in the array had a randomly-generated timecourse curve. Each iteration consisted of selecting an actual timecourse curve for a human gene, and modifying all 12 randomized curves to fit that timecourse. The curves most closely-matching the data were modified to strongly resemble the data. Curves that were less closely-related to the data to begin with were underwent less modification. The 12 resultant curves are shown below:

The authors point out that, "An SOM based on a rectangular grid is analogous to an entomologist's specimen drawer,with adjacent compartments holding similar insects."

Timecourses from different treatments can be combined into a single SOM. The SOM below was constructed using four different timecoureses in cells treated (left to right): HL-60 cells + PMA; U937 cells + PMA; NB4 cells + ATRA (all tans-retinoic acid); Jurkat cells + PMA.


 

One of the potential drawbacks of SOMs is that an arbitrary number of categories must be chosen at the beginning of the analysis.

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada


 
previous  page PLNT4610/PLNT7690 Bioinformatics
Lecture 12, part 3 of 3
next page