NBIC Pattern Recognition



a biennial course

Lecturers: dr. ir. Dick de Ridder
dr. Perry Moerland
dr. Lodewyk Wessels
Contact: dr. ir. Dick de Ridder
e-mail: d.deridder@tudelft.nl
telephone: +31 15 2785114

Objectives

After having followed this course, a student should have an overview of basic pattern recognition techniques and be able to recognize what method is most applicable to classification problems (s)he encounters in bioinformatics applications.

Time and location

The next course will be given January 17-21 2011, at the Faculty o fElectrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, Delft, The Netherlands (Building 36). Travel directions can be found here. The lectures will take place in the Snijderzaal (room LB01.010), on the first floor of the low-rise building (if you come in, take two lefts after the "Servicepunt" and take the stairs).

Target audience

The course is aimed at PhD students with a background in bioinformatics, computer science or a related field; a working knowledge of basic statistics and linear algebra is assumed. Preparation material on statistics and linear algebra will be distributed before the course, to be studied by students missing the required background.

Description

Many problems in bioinformatics require classification: prediction of the class to which a certain object (i.e. a gene, protein, cell, patient, ?) belongs. This calls for algorithms that can assign the most likely label (discrete output) to an object, given one or more measurements on that object. For most interesting problems, the underlying physics are too complex to explicitly formulate such an algorithm. In such cases, a machine learning approach is taken: an algorithm is constructed, with parameters that are tuned based on an available dataset of training examples. The algorithm should predict the labels for these examples as well as possible, yet still generalize, i.e. perform well on objects not seen before. Some examples of classification problems in bioinformatics are gene finding (sequence in, gene presence out), diagnostics (microarray data in, diagnosis out), data integration (measurements in, probability of interaction out), etc.

In this course, we will introduce basic techniques from the fields of pattern recognition and machine learning to solve such problems. We will introduce the pattern recognition pipeline: measuring, feature extraction and selection, classification and evaluation. The first two days will introduce the basic classification problem and a number of classic approaches to solve it. Next, methods for selecting or extracting informative features from a large set of measurements will be introduced. This will be followed by an introduction to a number of unsupervised techniques, that allow to find natural groupings or probabilistic descriptions of (unlabeled) data. The course will end with a cursory introduction to a number of intricate classifiers, artificial neural networks and support vector machines, and an overview of approaches to solve the generalization problem. For a large number of the methods discussed, we will turn to recent bioinformatics literature for examples.

Registration

You can register for this course by filling out the NBIC enrolment form. The registration deadline is December 17; the maximum number of participants is 40, so register soon to be sure of a course seat! Should the course be overbooked, PhD-students in the BioRange programme will be allowed access first.

The course is free for most PhD students; more information about the course fees can be found at the enrolment page. The fee includes:

  • Course material: handouts and a lab course manual will be handed out at the start of the course. Software required for the lab course (Matlab toolboxes) will be made available online.
  • Catering: coffee, tea and soft drinks and lunch will be provided. Drinks will be organized in the afternoon of Tuesday January 18.
Information about hotel accommodation in Delft during this week can be found here. Participants have to book (and pay for) the accommodation themselves if they need it. This is not included in the course fee.

Material

The course material is available here, including a self-evaluation and some primers for preparation, handouts of the slides, a lab course manual and the required data and Matlab toolboxes. Note that there is no need to print out material; slide handouts and the lab course manual will be handed out in a folder to participants at the start of the course.

Examination

Certificates of participation will be handed out at the end of the course. PhD students requiring a certificate of successful completion should perform additional work. The student will analyse a biological dataset (preferably one from his/her own practice) using the tools provided in the course, and write a small report (5-10 pages) on the results. If the student has no dataset available, one will be provided. The report will have to be handed in no later than three weeks after the course has finished (i.e. by February 11, 2011). We will strictly adhere to this deadline; if you require extension, you should contact us well in advance. The proposal will be graded "fail" or "pass", with one possible resubmission.

Format

The course will be given in the week of January 17-21, 2011. Preparation material on statistics and linear algebra will be distributed before the course, to be studied by students missing the required background. Finally, 2-3 days will have to be spent on the report to be handed in. Each day will have roughly the following layout:
9.30 - 12.30 Lectures
12.30 - 13.30 Lunch
13.30 - 17.30 Hands-on computer lab course

Tentative schedule

1. Monday 17-1-2011 Introduction
Lecturer Dick de Ridder
Subjects Introduction to pattern recognition: measurements, features, classification. Supervised vs. unsupervised learning, relation to regression. Bayesian framework: risk, cost; evaluation: ROCs, cross-validation. Density estimation: histograms, nearest neighbour, Parzen, Gaussian Bayesian classification.
2. Tuesday 18-1-2011 Classifiers
Lecturer Perry Moerland
Subjects Parametric classifiers: (D)LDA, (D)QDA. Nonparametric classifiers: k-NN, Parzen. Discriminant analysis: LDA, logistic regression. Decision trees and random forests.
3. Wednesday 19-1-2011 Feature selection and extraction
Lecturer Lodewyk Wessels
Subjects Feature selection: criteria, search algorithms (forward, backward, branch & bound). Sparse classifiers: Ridge, LASSO. Feature extraction: PCA, Fisher. Embeddings: MDS, LLE, Isomap.
4. Thursday 20-1-2011 Clustering and HMMs
Lecturer Perry Moerland
Subjects Hierarchical clustering. Agglomerative clustering. Model-based clustering: mixtures-of-Gaussians, EM. Hidden Markov models.
5. Friday 21-1-2011 Selected advanced topics
Lecturer Dick de Ridder
Subjects Artificial neural networks. Support vector machines. Classifier ensembles. Complexity and regularisation.
For more information about the course programme, please contact Dick de Ridder; for more information about registration or logistics, please contact Celia van Gelder.