Greg Butler: Bioinformatics Algorithms Course

COMP 691 R Bioinformatics Algorithms

Lecture Contents and Reading List

For each area to be studied, I plan to provide

an introduction to the genomics and biology involved,
a set of readings on the major algorithms and analysis techniques,
comparative studies of algorithms, and
links to the computer science literature for the algorithm design principles involved.

Sequence Analysis

The first set of web pages explain how a sequencer works, and how a sequencing project is organized. The second web page is a very good tutorial. You should know about the Smith-Waterman, FASTA, and BLAST algorithms, as well as how the scoring matrix represents the "theory of evolution". Then there is information on BLAST. The NCBI web pages are very detailed, and you should note that the statistical analysis that underlies BLAST is extremely important: it gives you a level of confidence for the results. Reference 5 is an example of pattern finding in sequences, in this case to classify the "family" to which a protein belongs. The last four references deal with multiple alignment of sequences.

Basic biotechnology behind sequencers. Read Genomics1, Genomics2, Genomics3, and the output of the ABI sequencer
A Tutorial on Searching Sequence Databases and Sequence Scoring Methods
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Research. 25: 3389 - 3402. An extremely good online guide and tutorial is available at NCBI here.
S.F. Altschul, The statistics of sequence similarity scores.
Hofmann, K; Bucher, P; Falquet, L and Bairoch, A (1999). The PROSITE database, its status in 1999. Nucl. Acids Res. 27, 215-219. prosite web site. See the information on patterns and motifs, including how to construct them, in stanford biochem 218 slides.
Thompson J.D., Higgins D.G., Gibson T.J.; "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice."; Nucleic Acids Res. 22:4673-4680(1994). See the help at the clustalw web server for more information.
B Morgenstern, K Frech, A Dress, and T Werner, DIALIGN: finding local similarities by multiple sequence alignment, Bioinformatics 1998 14: 290-294.
B Morgenstern, DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment , Bioinformatics 1999 15: 211-218. See the DIALIGN web page
D Thompson, F Plewniak, and O Poch, A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Res. 1999 27: 2682-2690.

Secondary Structure Prediction

Secondary Structure Prediction methods and links provides a good overview and links to servers.
X. Zhang, J.P. Mesirov, D.L. Waltz, Hybrid system for protein secondary structure prediction, Journal of Molecular Biology, 225 (1992) 1049-1063.
A.A. Salamov and V.V. Solovyv, Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments, Journal of Molecular Biology, 247 (1995) 11-15.
B. Rost and C. Sander, Prediction of protein secondary structure at better than 70% accuracy, Journal of Molecular Biology, 232 (1993) 584-599.
S. Salzberg and S. Cost, Preicting protein secondary structure with a nearest-neighbor algorithm, Journal of Molecular Biology, 227 (1992) 371-374.
J.U. Bowie, R. Luthy, D. Eisenberg, A method to identify protein sequences that fold into a known three-dimensional structure, Science 253 (1991) 164-170.
R. King and M.J.E. Sternberg, Machine learning approach for the prediction of protein secondary structure, Journal of Molecular Biology, 216 (1990) 441-457.

Gene Expression Analysis

The first two papers discuss the steps in using microarrays for comparative gene expression. The next three papers look at the issue of data normalization. The sixth paper discusses the statistical problems in analyzing microarray data. The remaining papers present clustering approaches with applications to gene expression analysis.

Jeremy Buhler, Anatomy of a Comparative Gene Expression Study.
Michael B. Eisen and Patrick O. Brown, DNA Arrays for Analysis of Gene Expession, Methods in Enzymology, vol. 303 (1999) pp. 179-205.
Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed, Normalization for cDNA Microarray Data. SPIE BiOS 2001, San Jose, California, January 2001.
Johannes Schuchhardt, Dieter Beule, Arif Malik, Eryc Wolski, Holger Eickhoff, Hans Lehrach, and Hanspeter Herzel, Normalization strategies for cDNA microarrays, Nucleic Acids Res. 2000 28: e47.
Alexander Zien, Thomas Aigner, Ralf Zimmer, and Thomas Lengauer, Centralization: a new method for the normalization of gene expression data, Bioinformatics 2001 17: 323S-331S. Abstract Paper
Sandrine Dudoit, Yee Hwa Yang, Matt Callow and Terry Speed, Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Technical report #578, August 2000.
A. Brazma, and L. Vilo, Minireview: Gene Expression Data Analysis. FEBS Letters 480 (2000) 17-24.
J. Zhu and M. Q. Zhang, Cluster, Function and Promoter: Analysis of Yeast Expression Array, Pacific Symposium on Biocomputing 5:476-487 (2000).
J. Vilo, A. Brazma, I. Jonassen, A. Robinson, and E. Ukkonen, Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data. ISMB'2000 August 2000. AAAI press. pp. 384-394.
Pavlidis, P., Grundy W.N. (2000) Combining microarray expression data and phylogenetic profiles to learn gene functional categories using support vector machines. Technical report, Columbia University Department of Computer Science.
R. Sasik, T. Hwa, N. Iranfar, and W.F. Loomis, Percolation Clustering: A Novel Algorithm Applied to the Clustering of Gene Expression Patterns in Dictyostelium Development, Pacific Symposium on Biocomputing 6:335-347 (2001).
Ka Yee Yeung, David R. Haynor and Walter L. Ruzzo, Validating Clustering for Gene Expression Data, Technical Report UW-CSE-00-01-01, January, 2000. Also appeared as Bioinformatics, 2001 v 17 #4: 309-318. Supplementary Web Site
Laura Lazzeroni and Art Owen, Plaid Models for Gene Expression Data Technical Report, Stanford University, March 2000.

Good people in the area are

Terry Speed at Berkeley.
Walter L. Ruzzo at Univesity of Wasington.
Alvis Brazma at European Bioinformatics Institute.
Large-Scale Gene Expression and Microarray Links and Resources web page maintained by Alan Robinson At EBI.

3D Structure Prediction

R. Srinivasan and G.D. Rose, LINUS: A hierarchic procedure to predict the fold of a protein, Proteins: Structure, Function, and Genetics 22 (1995) 81-99. LINUS home page
S. Lemieux, S. Oldziej and F. Major, Nucleic Acids : Qualitative Modeling, in The Encyclopedia of Computational Chemistry, P. Schleyer et al (editors), John Wiley & Sons: Chichester, 1998.
J.R. Gunn, Sampling protein conformations using segment libraries and a genetic algorithm, J. Chem. Phys. 106 (1997) 4270-4281.
Liisa Holm and Chris Sander, Protein structure comparison by alignment of distance matrices, Journal of Molecular Biology 233 (1993) 123-138. The Dali server

General Information Sources

Bioinformatics
Nucleic Acids Research
FEBS Letters
International Conference on Intelligent Systems in Molecular Biology: ISMB 2001
Pacfic Symposium on Biocomputing
The International Society for Computational Biology

Biochem 218 Computational Molecular Biology course at Stanford.

Last modified on May 1, 2003 by gregb@cs.concordia.ca