There are numerous possible subjects for an MSc thesis project or internship in the Delft Bioinformatics Lab. The following is just a short list to give some flavour of the possible research areas. For inspiration on other possibilities, you can also browse a list of finished projects.
Single-molecule protein sequencing
Proteins are vital in all biological systems as they are the working machineries of cells. There are >20,000 protein species inside human cells which are all expressed at different levels. Medical scientists read the amino acid sequences of proteins to analyze the protein expression profiles of human cells; and biologists to chart protein‐protein interaction maps. Complete mapping, however, has not been achieved since current sequencing techniques have intrinsic limitations.
The Chirlmin Joo lab at the Dept. of Bionanoscience is developing a single‐molecule protein sequencer. This new sequencer will enable researchers to identify proteins with high fidelity using only a small quantity of sample (a few femtomole). While DNA consists of only 4 nucleotides (A,G,C,T), proteins are composed of 20 amino acids (Figure 1). Since it is impossible to detect each individual amino acid at the single-molecule level, we are developing a protein sequencing method based on a fingerprinting approach (patent pending).
The project has two parts:
- Biological: labeling of proteins; reading protein sequences
- Computational: data analysis; sequence prediction
We are looking for a bachelor or a master student who will work on the computational part, especially the sequence prediction. We use two methods -- cross correlation and Smith Waterman (dynamical programming) algorithms. The student will perform the following tasks:
- Test the robustness of the two methods
- Investigate other algorithms
- Develop an algorithm to reduce computing time
- Contribute to a scientific article as a leading author
For more information see http://www.chirlmin.org or e-mail dr. Chirlmin Joo (firstname.lastname@example.org),
dr. Margreet Docter (email@example.com), Jetty van Ginkel (firstname.lastname@example.org) or Dick de Ridder (email@example.com).
Systematic assessment of the PARADIGM method for integrating multiple genomic data types
High-throughput genomic data are frequently used to cluster and subtype cancer patients. Such patient clusters often relate to e.g. different clinical subtypes of tumours or to differences in expected survival. Clusters based on one genomic data type, e.g. gene expression may differ from clusters based on another, such as copy number data. To unify such different clusterings, recent approaches have started to explore clusterings based on multiple genomic data types at once. One such approach is PARADIGM (Vaske et al., Bioinformatics, 2010). Interestingly, PARADIGM does not only produce patient clusters. It also models genomic modalities for which it is not given supervised data, such as protein activation, or even concepts as high-level as pathway activation. Modelling these so-called hidden variables may ultimately contribute to more meaningful patient clusterings. Unforunately, the PARADIGM paper does not provide any assessment of the quality of the inferred hidden states. A lack of understanding of this part of the method, also makes the main results of PARADIGM, the patient clusters, more difficult to interpret.
We would like to conduct a systematic evaluation of the hidden state modelling performed by PARADIGM. This evaluation may be based on several different strategies, such as:
1) Leaving out the supervised data, e.g. gene expression or copy number, for some genes. Then assessing to what extent PARADIGM successfully completes this missing information.
2) Obtaining additional data types for modalities that PARADIGM models, and again assessing how well PARADIGM predicts this data type. For example, PARADIGM models protein activation, without any supervised training data. However, measurements on protein activation do exist, and can be used to evaluate PARADIGM. Alternatively, we could add the protein activation data to the supervised training data of PARADIGM, and assess how much the resulting patient clusterings change as a result of replacing the inferred protein activations by actual measurements.
3) Generating synthetic data, i.e. simulating complete pathways and generating 'fake' high-throughput genomic data from these simulations. This has the advantage that we have complete knowledge of the genomic state of the 'tumour', and thus know exactly what PARADIGM should predict.
Successfully conducting this analysis may provide a valuable contribution to both the field of tumour subtyping, and more generally machine learning involving hidden state modelling.
Supervisors: Sander Canisius, Lodewyk Wessels; Bioinformatics and Statistics, The Netherlands Cancer Institute, Amsterdam; http://bioinformatics.nki.nl/
Contact: Lodewyk Wessels.
Influenza virus epitope mapping for antibody development
Crucell is a global biopharmaceutical company dedicated to bringing meaningful innovation to global health. Crucell focuses on the research & development, production and marketing of vaccines and antibodies against infectious disease worldwide. One target is the influenza virus. This virus constantly mutates its glycoproteins in order to mask itself against the human immune system. One of these highly variable proteins on the virus surface is hemagglutinin (HA). In recent years, Crucell scientists have discovered broadly neutralizing antibodies against HA and revealed their mechanism of action. These antibodies target highly conserved epitopes on the HA surface.
We would like to explore the possibility of mapping epitopes of HA antibodies, by analyzing HA structure and conservation. For example - assuming that an antibody binds H1, H5 and H7 subtypes, we would like to identify possible antibody epitopes common for these subtypes, but different on subtypes against which the antibody is not active. This approach could indicate sites for follow up mutagenesis experiments.
A large amount of both sequence and structure data for HA is accessible. There are around 50 HA crystal structures in the Protein Data Bank and 12,000 non redundant HA sequences are stored in the NCBI flu database. Both databases will have to be used during the project. The goal of the project is to develop a tool that could help in the epitope mapping process. The project will consist of the following milestones:
1) Identification of HA surface
2) Iteration through possible antibody epitopes on the HA surface.
3) Excluding glycosylated epitopes
4) Calculation of the conservation index of the epitopes for given subtypes or strains.
5) Scoring of the best candidates and presentation on HA structure
Once the tool is developed it will be tested on a number of publically available and in-house data.
The candidate should have good programming skills, preferably in Perl (or C++) and interest in protein structure. Attention to detail and problem solving are the primary personal skills.
Contact: Dick de Ridder or Jaroslaw Juraszek, Scientist Antibody Discovery; Jaroslaw.Juraszek [at] crucell.com.
Identification of driver events based on recurrent mutations
Performing high-throughput experiments on a set of tumor samples or cell lines, one typically wants to identify which genes are recurrently mutated in many samples and how significant this is. A few methods have been proposed based on different statistics and different ways to compute a background mutation rate (Jeoblom et al., 2007; Getz et al., 2007, Comment; Rubin and Green, 2007, Comment; Dees et al., 2012). However, there seem to be no standard method: depending on the experiments, researchers use ad-hoc methods combined with manual curation.
We have developed a method, which compares the observed mutations with a constant mutation rate defined based on all mutations in the dataset. We would like to investigate alternative ways to define the background mutation rate. In addition, most genes have low mutation frequency. Thus it is not enough to investigate this question at the gene level. If several genes from the same pathway are frequently mutated as a group, we want to able to detect it as well. Consequently we need to define and assess methods to identify significantly mutated pathways. Finally, investigating the position of the mutations in the context of functional domains in the associated proteins can give some functional information about the consequences of the mutations.
We propose to develop methods to identify significant mutations at the domain, gene and pathway levels following the suggested plan:
- Identification of significantly mutated genes
- Propose alternative ways to define the background mutation rate
- Compare the proposed methods with existing ones
- Identification of significantly mutated pathways
- Update the methods to make them applicable to pathways
- Identification of significantly mutated protein domains
- Map mutations on functional domains
- Update the methods previously developed
- Compare the domain frequency of the mutations with existing functional scores
We are looking for a motivated student with a strong background in computational biology, statistics and the R programming language to work on this topic. There will be ample opportunity to bring forward your own ideas. The project will be carried out at the Netherlands Cancer Institute (NKI-AVL) in Amsterdam and the expected duration of the project will be 6 months. During this time a report has to be written about the work performed. Supervision will be performed by Magali Michaut and Lodewyk Wessels (Bioinformatics and Statistics group).
For further information contact: Magali Michaut (m.michaut [at] nki.nl) or Lodewyk Wessels (l.wessels [at] nki.nl).
Single nucleotide polymorphisms (SNPs) constitute the majority of differences between the genetic codes of individuals, and are considered to be an important factor in many diseases. Recently developed technology makes it possible to measure hundreds of thousands of SNPs in large cohorts, allowing for genome-wide association (GWA) studies on disease phenotypes such as cancer, dementia, diabetes, etc. Unlike previously thought, SNPs do not primarily affect phenotype by changing protein amino acid sequences; evidence is emerging that many SNPs have subtle regulatory influences.
The common disease-common variant hypothesis states that complex (non-monogenic) diseases may be caused by a number of such SNPs, each of which by itself only explains the disease marginally. This hampers GWA studies, as these SNPs will have to be found among many false positives, requiring studies to have sample sizes of over 20.000 individuals for most complex traits/diseases. Therefore, there is a need for a thorough, unbiased method of prioritizing SNPs for follow-up studies and validation. Such a method should be based on evidence of SNP effect on phenotype and of biological functionality, thereby complementing the GWA hypothesis-free approach.
In this project, we aim to develop such an integrated method of SNP prioritization, and to quantify the effect of SNPs based on their genomic properties (in exons, introns, upstream regions, ...) by performing a meta-analysis of recently published reports on GWA studies and studies linking genotype to gene expression.
This project calls for a good background in machine learning, good programming skills and good communication skills.
Contact: Dick de Ridder
Preliminary project descriptions
A problem which arises more as more genome-wide data on various levels of cellular organistion (genome, transcriptome, proteome, metabolome) becomes available is how "natural" models for each of these levels can be connected. Should models at each level be summarised (e.g. integrated over time) to use as an input to other levels, or is (stochastic) simulation a better tool. This will of course depend on the question to be answered.
Contact: Dick de Ridder
Using enzyme characteristics to improve specificity prediction
We have recently developed a method to predict enzyme specificity based on reaction information in BRENDA. In this system, the only data used to predict the function of an enzyme is the set of reactions it is known to catalyze. It is likely that performance can be increased by also taking characteristics of the enzymes (e.g. homology/similarity to other enzymes, domains, structural features) or of the binding (e.g. predicted protein-ligand binding prediction).
Contact: Dick de Ridder
Structured output prediction for enzyme specificity prediction
We have recently developed a method to predict enzyme specificity based on reaction information in BRENDA. I am interested to learn whether kernel-based methods for structured output prediction recently developed in the field of machine learning can be applied to improve this system (or alternative formulations). Such predictors could have chemical compounds graphs as outputs.
Contact: Dick de Ridder
A genomic mutation rate map
In a project recently started, we are interested in finding single-nucleotide polymorphisms, insertion and deletions (SNPs and indels) in sequences of entire genomes of evolutionarily engineered micro-organisms. To attach a Bayesian measure of reliability to such calls, I would like to have an indication of the prior probability of a SNP or indel at each position on the genome, i.e. an estimate of the mutation rate for each base-pair. Such an estimate would have to be based on basic biochemistry and perhaps some model of genomic instability (using local sequence characteristics).
Contact: Dick de Ridder