| PLNT4610/PLNT7690
Bioinformatics Lecture 12, part 3 of 3 |
There are dozens of approaches for downstream analysis of microarray data. We will focus on two of them below.
Clustering genes into groups based on expression patterns requires a way to do pairwise comparisons of expression patterns. In a typical experiment, expression of any gene G will be measured in N different conditions, such as treatment, cell type, time after treatment etc. The variation in G over N conditions is given by

In experiments in which
the expression is normalized to the mean expression
for gene G, Goffset is the mean of G, which makes
the standard deviation for G. However, in most gene array
experiments,
Goffset is set to 0, the log of the fluorescence ratio
of 1, meaning the ratio that would be seen if no change was observed
from
one conditon to the next. The similarity of gene expression patterns
for
any two genes X and Y can be expressed as a correlation
coeficient
![]()
S(X,Y) can be used in any
clustering method. For example, Eisen et al.
have used the method of Sokal and Michener, similar to UPGMA, to
produce
dendrograms that cluster sets of genes together.
Eisen et al. examined 8600 human genes in cells grown in the presence
or absence of serum. Genes whose expression changed by a factor of 3.0
or more in at least 2 timepoints were subjected to cluster analysis.
In the
figure, each gene is represented in a row, while each condition
(time) is represented in a column. Genes have been clustered so that
genes with the
most similar expression patterns are in nearby rows. The hierarchical
relationships
among genes in clusters is indicated by the tree.
Törönen P, Kolehmainen M, Wong G, Castrén E (1999) Analysis of gene expression data using self-organizing maps. FEBS Letters 21;451(2):142-146.

Tamayo and coworkers have applied Self-Organizing Maps (SOM) to
grouping
gene expression data. In Figure 1., they illustrate very simple
X,Y
data as groups of raw datapoints as black dots. Such a dataset might
represent,
for example, a measurement on a wild-type gene on the X-axis, and a
measurement
on a mutant gene on the Y-axis. A timecourse with n-timepoints
would
therefore be represented in n dimensions. Just looking at the
datapoints,
it looks as if there are distinct groups.
The goal is to find sets of X,Y points that most closely-approximate the mean value for each group of points. SOMs begin by arbitrarily creating a set of nodes (N) with randomly-assigned values. In the example, a set of six nodes (1-6) are randomly placed in the X,Y space.
For each iteration of the algorithm, a datapoint P is chosen, and the position of each node is changed to move it closer to P. The closer a node N is to point P, the greater the distance it is moved towards P. This process is continued for thousands of iterations, until the total change is lower than some threshold.
The net result is that all nodes will be moved many times, but each node will "come to rest" in the vicinity of the set of datapoints to which it is closest.
For example, Tamayo et al. studied 6000 human genes in myeloid leukemia cell line HL-60, in response to phorbol ester PMA, which stimulates macrophage differentiation. 567 genes were shown to change significantly with addition of PMA. Expression data were modeled onto a 3 x 4 array in which each node in the array had a randomly-generated timecourse curve. Each iteration consisted of selecting an actual timecourse curve for a human gene, and modifying all 12 randomized curves to fit that timecourse. The curves most closely-matching the data were modified to strongly resemble the data. Curves that were less closely-related to the data to begin with were underwent less modification. The 12 resultant curves are shown below:
The authors point out that, "An SOM based on a rectangular grid is analogous to an entomologist's specimen drawer,with adjacent compartments holding similar insects."
Timecourses from different treatments can be combined into a single SOM. The SOM below was constructed using four different timecoureses in cells treated (left to right): HL-60 cells + PMA; U937 cells + PMA; NB4 cells + ATRA (all tans-retinoic acid); Jurkat cells + PMA.

One of the potential
drawbacks of SOMs is that an arbitrary number of
categories must be chosen at the beginning of the analysis.
Unless otherwise cited or referenced, all content on this
page is licensed under the Creative Commons License Attribution
Share-Alike 2.5 Canada |
| PLNT4610/PLNT7690
Bioinformatics Lecture 12, part 3 of 3 |