Note: This page will be discontinued. Please visit my new webpage HERE and update your bookmarks.
Ever wondered how we found out that humans share more than 90% of their DNA with chimpanzees? Or how is the DNA of a cell determined in the first place? Have you ever thought about the tree of life and tracing human origins? Are you interested in learning how computers are being used to make drugs and personalized medicine for fighting cancers and viruses? How can we design artifical life?
If you find you are a mathematician or a computer scientist who feels intrigued by the above or similar questions, you will appreciate Bioinformatics.
Bioinformatics is an interdisciplinary scientific field that develops mathematical formulations, algorithms and software for producing, storing, retrieving, organizing and analyzing biological data. Let’s see what this means.
Firstly let’s talk about biological data and what it means. Biology has progressed significantly to the point that experts now say that the 21st century belongs to Biology. This has been made possible through tremendous advances in extraction of data from biological systems through technologies such as sequencing, electron microscopy, X-rays, NMR and a lot others. As an example, consider the fact that the number of organisms with sequenced genomes, i.e., organisms whose DNA is now known through sequencing, is growing exponentially. Most biological data is large, noisy and hard or even impossible to interpret for humans without assistance from computers. However, this data is extremely useful. For example, knowing the DNA sequence of an organism allows one to find out, among other things, its evolutionary history, its genes, its disease susceptibility and its genetic behavior towards its environment.
The major problem in Biology is to move from raw data to useful information. This is where computer science comes into play. Solving biological problems through computer science is called Bioinformatics or computational biology. As mentioned earlier, these problems include the storage and retrieval of biological data (because its huge!), its organization and, most importantly, its analysis (because we want to make sense of it!). Research in Bioinformatics has produced algorithms and software which allow the conversion of raw data from biological systems into useful information. For example, sequencing technologies such as next generation sequencing generate a very large number of very short regions (about 500-1000 base pairs) from the very long genome string (about 3.2 Billion base pairs long for human). These reads then need to be mapped and aligned to produce the whole genome. You can think of this as a huge jig-saw puzzle. We can use short read mapping and alignment algorithms to solve this problem. Once the genome is known, we can use phylogenetic tools to construct phylogenetic trees like the tree of life that tell us about the evolutionary origins of an organism and its relationships with other life. We can also use sequencing information and differential expression algorithms to find out what makes a certain variety of crop resistant to drought. Techniques in computational proteomics and structural Bioinformatics can allow us to design new proteins with novel functions to solve global problems such as pollution, energy shortages and diseases. And the best part of it all: Bioinformatics can help in finding out how life works!
Reasons for a computer scientist to be engaged in Bioinformatics research
This is definitely a good time to be a computer scientist and to work in Bioinformatics. Here are some of the reasons why a computer scientist should consider to (or why we do) work in this area:
Below is a listing of some of the projects we are working on.
You can think of the biological cell as a chemical factory. In this factory, the genome acts as a pseudo-template for the generation of proteins. These proteins are the workhorse of the cellular factory and are, with a tilt of the hat to DNA and RNA, the most important macromolecules in the cell. Almost all cellular processes directly involve proteins. This can be appreciated by considering the fact that 50% of the dry weight of the human body is protein and 90% of the dry weight of a red blood cell is a single protein called Hemoglobin which is responsible for the transport of Oxygen to cells. Keeping true to our chemical factory analogy, these protein workers work together and interact with each other to perform different cellular functions in the cell. These interactions between proteins are made possible through different physiochemical phenomenon such as hydrogen and covalent bonds, van der Waals forces, the hydrophobic effect etc.
When two protein molecules bind to each other to form a complex, they do so at specific interfaces or binding sites. The study of protein interfaces and binding sites is a very important domain of research in Bioinformatics. Information about the interfaces between proteins can be used not only in understanding protein function but can also be directly employed in drug design and protein engineering. However, the experimental determination of protein interfaces is cumbersome, expensive and not possible in some cases with today’s technology. As a consequence, the computational prediction of protein interfaces from sequence and structure has emerged as a very active research area.
We work on the development of machine learning based methods for predicting protein interactions and interfaces interfaces and binding sites. The basic approach is to use data from known protein complexes to train a machine learning model to predict whether and how two previously unseen proteins bind to each other. Below, we describe the methods developed in this lab together with my collaborators for predicting protein interactions and interfaces.
We are currently engaged in research on predicting interactions between host and pathogen proteins. Such interactions form the basis of pathogen borne infectious diseases.
To read more about our work on host pathogen protein interactions, the interested reader is referred to this webpage.
PAIRPred (Partner Aware Interacting Residue PREDictor) is a partner specific protein-protein interaction site predictor that can make accurate predictions of whether a pair of residues from two different proteins interact or not. It differs from most existing interaction site predictors in that it considers the information about the interaction partner of a protein in making its predictions whereas most other methods produce partner-independent predictions. It employs a Support Vector Machine (SVM) to generate interaction propensity scores for a pair of residues from sequence information alone or in conjunction with structure based features. PAIRPred offers state of the art prediction accuracy. More details about how PAIRPred works and its performance evaluation are available here. We are also focusing on predicting interfaces between different proteins from Flaviviridae (the virus family includes Dengue, Hepatitis, West Nile, etc.). This work has been published in Proteins.
Shown below are the interactions (orange and red lines) between human ISG15 protein and the NS1 protein from Influenza A virus predicted using PAIRpred. These predictions agree well with experimental findings and have been used to infer why Influenza A infects only humans and non-human primates.
Currently, we are working on developing a webserver for PAIRpred and improving its prediction accuracy for complexes involving large binding associated conformational changes.
Calmodulin is a very important protein found, virtually conserved, across all higher organisms. It is involved in a large number of very critical biological functions ranging from neuronal spiking to breathing. It interacts with a large number of other proteins. Together with Dr. Asa Ben-Hur, Dr. Minhas was involved in the development of a sequence based predictor of binding sites on proteins that bind Calmodulin using machine learning. The interesting part of this work was that the annotated binding sites of Calmodulin binding proteins were imprecise and spanned an area larger than the true binding site. Consequently, we developed a novel multiple instance learning method for learning from such noisy and imprecise data. Our algorithm is called MI-1 and is currently the state of the art predictor of binding interfaces in Calmodulin binding proteins. This work has been published in Bioinformatics. The web server for this application can be accessed here.
Shown below is an example prediction generated by MI-1. It shows the interaction of Bordetella pertussis adenylyl cyclase toxin (PDB entry: 1yrt) with calmodulin. The binding of adenylate cyclase with CaM is one of the two mechanisms which allows the whooping cough bacteria to colonize the respiratory tract. The colors on adenylyl cyclase (from blue to red) indicate the sequence-only predictions generated by MI-1. The maximum propensity of interaction occurs at W242 (shown in stick form, encircled in green) which is known to be an interaction site of adenylyl cyclase with Calmodulin. This protein was not part of the training set of MI-1.
Metabolomics is the study of unique chemical fingerprints that are left behind by different cellular processes. RAMClust is a technique that uses clustering to group the features resulting from Liquid Chromatography and tandem Mass Spectrometry analysis of metabolomic samples such as urine, serum, etc. This work has been published in Analytical Chemistry and is being used at the Metabolomics and Proteomics Facility at Colorado State University. We have also filed a provisional patent application for this algorithm through CSU ventures.
We are beginning our work in the area of protein design and the study of molecular dynamics. Currently we have configured the GROMACS package on our local machines together with pyRosetta. Below is a very short (10ns) simulation of the protein 1AKL using GROMACS that we have obtained. Stay tuned for more news in this area!