November
24,
2011
MICROARRAYS
MICROARRAY LINKS
Bibliography on Microarray Data Analysis [http://www.nslij-genetics.org/microarray/]
A.
Microarray Technology
1. How
do microarrays work?
2. Types of experiments
3. Types of data
4. What are we trying to learn from microarrays?
B.
Experimental design and normalization
1.
Sources of experimental variation
2. Normalization
C.
Grouping genes with similar expression patterns
1.
Cluster analysis
2. Self-organizing maps
A.
Microarray Technology
1. How do
microarrays work?
In microarray hybridization,
oligonucleotides of 60 - 70 bases in length are synthesized directly on
a glass slide. Each array on the slide may contain 10,000 or more
oligonucleotide spots, each representing a gene. Slides are
hybridized with labeled cDNA made from an mRNA population. The labeled
cDNA population, therefore, is a population of cDNA molecules
representing
the original mRNA population. The cDNAs are hybridized with slides. The
amount of hybridization to
a given clone represents the amount of mRNA present for the
corresponding
gene.
Microarray
technology: measures mRNA levels for thousands
of
genes in

- any cell or tissue type
- at
any point in development
- in
response to any stimulus
Microarrays
consist of thousands of oligonucleotides spotted onto microscope
slides
(microarray). Sequences are chosen from EST
collections or genomic sequences,
so the sequences, and usually the identities of genes in the array are
known.
In a microarray experiment, an mRNA population is isolated from
cells.
The population is labeled by synthesizing complementary cDNAs using
reverse
transcriptase and labeled nucleotides. The resulting cDNA population is
then
hybridized to the array.
gene x - strongly expressed; high abundance transcript
gene y - moderately expressed; medium abundance transcript
gene z - weakly expressed; low abundance transcript
Each transcript base pairs with the complementary DNA for its
corresponding
gene on the array.
Signal strength is proportional to the abundance of each mRNA
WARNING!
Each one of these steps contributes to experimental variation.
|
a. Microarrays
Each gene on an array is
represented as an single-stranded
oligonucleotide in sizes
ranging from 25 - 70 nt. Smaller sizes tend to be less efficient in
binding probe. However, smaller oligonucleotides have greater
specificity. Beyond about 70 nt, there is little increase in signal.
Oligonucleotides are synthesized de-novo, so there is no chance of
contamination from other sequences, as with cDNAs
b. Labeled cDNA
This is probably the biggest single source of experimental variation.
Microarray experiments typically attempt to compare gene expression
levels in different tissues or conditions, or at different times after
a treatment. RNA is extracted from each tissue, condition, or traatment
and RNA samples are diluted so that each sample contains the same
concentration
of RNA. To create a single-stranded probe, RNA is added to a reaction
mix
containing oligo dT primers, which can base pair with the polyA tail on
mRNA, Reverse Transcriptase (RNA-dependent DNA polymerase) and labeled
nucleotides. Commonly, labeled nucleotides are either tagged with
fluorescent
labels such as Cy3 and Cy5, or digoxygenin (DIG), which can be detected
using chemiluminescent detection. In principle, for every mRNA molecule
in the original RNA population, a single-stranded labeled cDNA will be
produced, complementary to the mRNA. The higher the concentration of a
particular mRNA, the more cDNA will be present.
c. Hybridization and washing
Incorporation of label
into each probe is quantified, and probes are diluted so that all are
at
an equal concentration. Usually, a duplicate filter or microarray is
prepared
for each probe to be assayed. cDNAs are hybridized separately with
each
array. Filter arrays are incubated with labeled cDNA and washed in much
the
same
way as is done for Southern or Northern blotting. For glass
microarrays,
hybridization is done under a coverslip, and slides are washed by
dipping
into wash solutions. Commercially-produced arrays come in cassettes, in
which
hybridization, washing, and detection are done.
d. Data acquisition
Hybridized probe is detected by UV fluorescence in a slide reader
using confocal laser microscopy. The raw intensity of each
spot
is measured by a CCD camera, and the data acquired as a TIF image.
2. Types of experiments
Single
label experiment
The simplest type of
microarray experiment is the single label experiment. Duplicate arrays
are
hybridized
with probes made using a single label. To allow comparison between
treatments,
controls must be included in the probes and on the arrays to act as
hybridization
standards.

from Mark
Schena*,, Dari Shalon, Renu Heller*, Andrew Chai*, Patrick O.
Brown§,
and Ronald
W.
Davis* (1996) Parallel human genome analysis: Microarray-based
expression
monitoring of 1000 genes Vol. 93, Issue 20, 10614-10619.
Expression of human genes
was measured in RNA populations from cells
grown at 37°C (-Heat shock) or 43°C (+Heat shock). White boxes:
genes whose expression changes with heat shock. Red boxes: genes
activated
by heat shock. Green boxes: genes suppressed by heat shock.
Double
label experiments
Another approach to comparing
expression between two conditions is double
label experiments. For example, in work from Patrick
Brown's
lab at Stanford, cDNA probes were made from yeast cells grown in
the presence of either galactose or glucose. To distinguish between
signals
from the two probes, different fluorescently-tagged nucleotides, either
Cy3 or Cy5 were added during reverse transcription. Cy3 has emission
maxima
at 565 and 615 nm, while Cy5 has an emission peak of 670nm. Replicate
experiments
were done in which dyes were switched. By scanning the arrays twice,
once
for Cy3 and once for Cy5, a composite image can be generated in which
the
ratio of the two dyes, and hence, the ratio of transcripts in the two
growth
conditions, can be measured. In pseudocolor images, spots in the array
representing genes that are more strongly expressed in the presence of
galactose are shown in green, and
spots representing genes more
strongly
expressed in the presence of glucose
are shown in red.
http://www.pnas.org/cgi/content/full/94/24/13057/F1
3) Types of data
microarray studies tend to
generate two different types of data. Studies in which two or more
conditions are compared
at a time generate discrete state data. Often it is critical to follow
the
expression of a gene over time after a treatment. In timecourse
experiments,
the expression of each gene in response to two or more treatments is
measured
over time. For example, in the timecourse at right, the solid blue and
red
dashed curves might represent the expression levels for a gene in
response
to two different drugs.

There is
a whole family of problems in normalization of data and controlling for
components
of experimental variation.
To put things into
perspective,
if the experiment was repeated 4 times, the timecourse above represents
2 treatments x 6 times x 4 replicates = 48
labeled cDNA populations
hybridized
to 48 duplicate arrays
to generate the data. Although
the data for each replicate are averaged, there is often a great deal
of variation
in the results, which can potentially negate any meaning. Therefore,
extraordinary
measures must be taken to minimize experimental variation at each step
in
the procedure, to minimize the overall variation.
2. What
are we trying to learn from microarrays?
The primary
goal of microarray experiments is to
generate expression information for every gene in the array, under some
set of condittions. Expression may be studied in
- different tissues
- different developmental stages
- different genotypes
- different treatments
- different times after a treatment.
The kind of
results that are sought in microarray
experiments can be illustrated as follows:

In the example, timecourse
data are generated for each gene in an array.
The raw data consists of a series of expression curves for timecourses,
or histograms where other types of treatments are being compared. The
goal
is usually to find which groups of genes have the most similar
expression
patterns. In the example, two genes in the array (hatched background)
show
a gradual induction over the period of the timecourse. Two other genes
(shaded background) show a biphasic response with two distinct periods
of strong expression.
Key questions:
- Which genes are expressed
differentially,
between condition A and condition B?
- How can genes be grouped according
to
similarities in expression patterns?
|
B.
Experimental design and normalization
It is critical to realize
that every experimental step in a procedure
contributes to the final experimental error. Therefore, one should
conceptualize the data as a set of observations each with a measureable
amount of variation. In the figure, error bars represent the standard
error
of each measurement. The goal can then be restated as that of setting
up
the experiment in such a way as to minimize the final standard error in
the observations. For some timepoints in which there is little true
difference,
a difference can only be detected when the standard error for both
treatments
is small. For other timepoints where the differences are large, higher
standard errors will still allow the detection of the difference
between
two treatments.
1.
Sources of experimental variation
Making a list
of factors that contribute to experimental
error is essentialy the same as making a list of steps in the
microarray
experiment. However several points are worth highlighting.
- Treatments
- Experimental
conditions
- Tissue preparation
- Targets
- RNA
isolation - use idential amounts of tissue, identical
extraction methods; use minimum number of steps; measure amount of RNA
and normalize concentration
- labeling
- measure incorporation of label and normalize
samples to same concentration
- amount
- add same amount of label to each hybridization
- Arrays
- PCR
products - amplify directly from bacterial cells,
rather than isolated plasmids; add same amount of product to each spot
on filter
- Uniformity
of spotting - use arraying tool for filter
arrays or robot for microarrays.
- treatment
of filters or slides
- Hybridization and washing
- Long
hybridization to ensure that hybridization goes
to completion.

In
any hybridization experiment, the time required
for hybridization to go to completion is proportional to the
concentration
of the target cDNA. As
illustrated, high abundance transcripts will hybridize
to completion in very short times, so the signal should be roughly the
same regardless of how long hybridization is done. For moderately
abundant
transcripts, it takes longer for hybridization to proceed to
completion,
so the amount of transcript for that mRNA will be underestimated unless
a long hybridization time is used. Finally for rare abundance
transcripts,
the hybridization curve will still be in the linear phase after a long
time. For example, at the time indicated by the dotted line on the
X-intercept,
the moderately abundant transcript would be estimated with only a small
error, while the abundance of the rare transcript would probably be
greatly
underestimated. (See http://www.umanitoba.ca/afs/plant_science/courses/PLNT3140/l14/cot.html
for
more details on reassociation kinetics).
- Washing -
For
genes that are members of multigene families, hybridization
results could vary depending on hybridization and washing stringency.
Hybridization
under low stringency conditions might allow cross hybridization between
members of a gene family, and all members would be expected to give
roughly
the same signal. Hybridization under high stringency conditions would
allow
for more discrimination between genes, because each transcript would
only
hybridize with its orthologous gene on the array.
- Data acquisition
- Image acquisition-
The acquisiton of the image data carries similar
built in sources of variation as does hybridization. Within a certain
intensity
range, the amount of signal detected is linearly proportional to the
time
of exposure. For microarrays, data acquisition is done by scanning the
slide in a confocal laser scanner. Data is saved as a TIFF image, where
intensity of a given
pixel is proportional to the amount of signal coming from part of the
filter
or slide. For highly abundant transcripts beyond a certain amount of
signal,
there may be little increase in intensity per unit time, and the spot
will
be saturated in the image. Moderately expressed genes may yeild signal
within the linear range of the camera's detection range. For rare
transcripts,
it may not be possible to expose long enough for signal from the
transcript
to shoulder out. It is important to recognize that these errors of
detection
are compounded on top of the errors associated with hybridization time!
- Spot and
background detection - Software has to delineate each spot in the
array, and to choose areas outside of spots for background estimation.
Spots diameter can vary, and spot morphology may be irregular,
rather
than being perfectly circular.
BIOLOGICAL
REPLICATES ARE THE SINGLE MOST EFFECTIVE WAY TO GET GOOD GENE
EXPRESSION RESULTS!
In
the next section we will see that there is an almost endless list of
ways to massage the data. The most heroic analytical methods are no
substitute for the simple step of doing several biological replicates.
- In
each biological replicate, the entire experiment, such as different
treatments of a batch of cells, plants or animals, sampling of
different tissues from different conditions, followed by extraction of
RNA, is repeated.
- The
RNA samples from different biological replicates are NOT mixed for a
single hybridization. Rather, a separate labeling and hybridization is
carried out for EACH REPLICATE.
- Technical
replicates, in which the same RNA sample is labeled and hybridized,
only control for differences in handling. Biological replicates include
all sources of biological and experimental variation. Therefor, they
are more realistic.
- As
the number of biological replicates increases, the total experimental
variation decreases.
Gene
chips are getting cheaper all the time, often less than $100 per chip.
The excuse that you can't do biological replicates because it is too
expensive no longer obtains.
Estimated sample size requirements for example data set
|
FDR = 0.10
|
FDR = 0.05
|
FDR = 0.01
|
| Power = 0.5 |
3 / 3 |
3 / 3 |
5 / 5 |
| Power = 0.6 |
3 / 3 |
3 / 4 |
7 / 6 |
| Power = 0.7 |
3 / 4 |
5 / 5 |
10 / 9 |
| Power = 0.8 |
4 / 6 |
9 / 8 |
20 / 14 |
| Power = 0.9 |
13 / 11 |
30 / 16 |
75 / 27 |
Power is the fraction of true positives detected. FDR is
the
false discovery rate ie. false positives. The numbers either side of
the right slash indicate sample-size (ie. biological replicates)
estimates made using the sample-size estimation methods described in
Ref. [8] and Ref. [10], respectively.
Agilent - 10 Pitfalls of Microarray Analysis
Tommy
S. Jorstad, Mette Langaas, Atle M. Bones, Understanding sample size:
what determines the required number of microarrays for an experiment?,
Trends in Plant Science, Volume 12, Issue 2, February 2007, Pages
46-50, ISSN 1360-1385, DOI: 10.1016/j.tplants.2007.01.001.
Knapen
D, Vergauwen L, Laukens K, Blust R (2009) Best practices for
hybridization design in two-color microarray analysis Trends in
Biotechnology 27:406-414
Simon, S. Myths & Truths About Microarray
Expression Profiling
|