Computational Molecular Biology
(Offered as a Special Topics course in Operations Research)
47-863 Course Home Page
Minisemester 4, Spring 2001
R. Ravi, instructor.
Graduate School of Industrial Administration
Carnegie Mellon University,
Pittsburgh, PA 15213 USA.
Office: 233 B, Posner. Tel: (412) 268 3694. Fax: (412) 268 7357. Email: ravi@cmu.edu.
Office hours: By appointment.
Class Presentations Proposed Schedule:
I have now prepared a list of papers for the students to present, along with a tentative assignment and dates here.
Contents:
Course Description
Place: Wean Hall 4601
Time : Wed and Fri 1.30-3.20 pm
This is a one mini course offered in the second half of the Spring
2001 semester at GSIA. The course will start with a quick introduction
to background material in biology and summarize some traditional
methods in computational biology such as sequence alignment and
applications of dynamic programming. We will then look at some of the
underlying statistical models of sequence evolution and relate this to
scoring matrices and significance scores attached to sequence searches
used by popular programs such as BLAST. The next topic covered in the
course will be the many applications of Hidden Markov Models in
Computational Biology. The second half of the course will focus on
more recent high-throughput analysis of large datasets in biology
including computational analysis of gene expressions data from DNA
arrays, related problems in biological chips or arrays, modeling
biological pathways, and time permitting, an introduction to selected
approaches to the protein-folding problem.
The classes will mainly be lecture based, with possibly several
invited speakers from professors in campus working on closely related
topics, towards the end of the course. No background in molecular
biology is necessary but some familiarity and interest in this area
will help.
Grades will be based on one-page summaries of required reading papers
to be turned in at the beginning of most classes and a class
presentation of a recent research topic (based on a couple of papers)
towards the end of the course.
The intended audience for this course is advanced level undergraduates
and graduate students with curiosity in learning about current
research themes in Computational Molecular Biology and possessing a
basic background in algorithm design, discrete probability and
discrete mathematics.
Class Schedule and Resources for Lectures
Postscript files can be read using Ghostscript/GSview
(Aladdin Ghostscript 6.01
/
GSview 3.0);
PDF files can be read using Adobe Acrobat Reader
(Adobe Acrobat Reader with Search 4.05).
- Class 1, March 14: The first scheduled day of class (Wednesday, March 14th) coincides
with a day-long
symposium on Computational Molecular Biology, organized by
Prof. Durand. Consequently, there will be no class meeting that day,
but all registered for the class are encouraged to preregister for the
symposium and attend as many of the talks as possible to get a glimpse
of some of the topics that will be discussed in this course.
The first class meeting will be on Friday, March 16th, in WeH 4601 at
1.30 pm, and a one to two page summary of your insights/impressions from the
Workshop on Wednesday will be due in class.
If you did not hand in anything in Class 2, be sure to email me a summary by Monday, March 19.
- Class 2, March 16: Introduction to Molecular Biology and Recombinant DNA. Resources:
- Class 3, March 21: Classical Computational Molecular Biology: Illustration using a gene-hunting process.
- HW due in class: Write a one to two-page summary (total) of what you can learn about the folowing acronyms popular in Computational Biology: RH, STS, SBH, FASTA, GRAIL. If it represents a biological process, describe it in a couple of sentences and write a line about the computational problem centered or arising from it. If it is a program, describe what problem it solves and if you understand, what the underlying method(s) is.
- Reminder: There will be a short in-class quiz on the material from last class, mostly based on material from the first of the two links from the last class resources (Primer on Molecular genetics) and somewhat on the second one (HGP Overview).
- Class venue: The class will continue for the rest of the semester in the originally announced room: Wean Hall 4601.
- Class 4, March 23: Illustration using a gene-hunting process (completed). Guest lecture by Rob Knight, Princeton U.
- Class 5, April 4: CANCELLED.
- Class 5, April 6: Introduction to sequence comparison algorithms and programs.
- Class 6, April 9: (Makeup class for April 4) How BLAST works, Introductory ideas in statistics to understand the values output by BLAST.
- Class 7, April 11: More statistical theory to understand BLAST parameters and outputs.
- HW due in class today: Recall the Moment Generating Function of a discrete random variable Y defined as m(t) = \sum_y { p(y) exp(ty) }.
Suppose (A1) Y takes at least one positive value and (A2) the mean value of Y is negative, then show that there is a unique positive solution t+ to the equation: m(t) = 1. [ t=0 is the other solution. ]
- Class 8, April 13: Continuation of our discussion on BLAST.
- Class 9, April 18: Wrap-up discussion on BLAST. Some of the examples we discuss are in the Wash. U BLAST site .
- HW due in class today: Compute the mean (and variance) of the random variable that is the maximum of n identically and exponentially distributed random variables with parameter $\lambda$. (Probability density function of each r.v. is $f(x) = \lambda e^{-\lambda x}$)
- Extra credit problem (due before end of class -- Do NOT consult the literature; Turn in a solution only if you solved it by yourself) : Given a sequence of $n$ characters, each with a score, find an algorithm with running time $O(n)$ to find all the maximal scoring segments defined as follows: First find a segment (contiguous subsequence) with maximum total sum of scores [Assume it is minimal so that it contains no zero-sum prefixes or zero-sm suffixes]. Record this segment and delete this from the sequence and recursively find the maximum scoring segments on both the remaining subsequences. Stop when the maximum scoring segment has zero score (null subsequence).
Hint: Use a less procedural definition of a segment that will appear in the list of output segments as follows: A segment is maximal iff it obeys the following two properties: (P1) Every proper subsegment has smaller total score and (P2) No proper supersegment has property (P1).
- Class 10, April 20: An introduction to Hidden Markov Models.
- Class 11, April 25: Class presentations: Konstantin Andreev and Kinman Au.
- Class 12, April 27: Presentations: Prof. Christos Faloutsos and Akira Hasegawa.
- Symmetric PAM-1 HW due in class today: Consider a simple, symmetric model of amino-acid transition probabilities with equal probabilty of staying the same (=0.99) and of changing to another amino-acid (=0.01/19) for every amino-acid. Also assume that each amino-acid is equally likley to occur in random protein sequences (each protein locus is equally likely to be any of the 20 amino-acids independent of other loci). Using these probabilities, compute the rough value of N so that the PAM-N similarity scores arising from them have ratio of diagonal to off diagonal scores being -12 (an approximate value observed for a commonly used PAM matrix).
- Class 13, May 2: Class presentations: Venkatesh Natarajan and Bjarni Halldorsson. Last day to turn in Extra Credit problem.
- Class 14, May 4: Class presentations: Abraham Flaxman and Amitabh Sinha.
Resources