Bioinformatics NSF REU

160days since
Summer 2009 REU Begins!

Navigation

Research Projects

Bioinformatics-based Identification of Transmission-blocking Malaria Vaccine Candidates
Mentor: Dr. Kim Williamson, Department of Biology

http://www.luc.edu/biology/williamson.shtml

Malaria remains a major global health problem and is estimated to be responsible for the death of one to two million people a year. The recent completion of the genome of the parasite that causes the most virulent human malaria, Plasmodium falciparum, and development of microarray and mass spectrometry techniques to determine whole cell gene expression patterns has provided a large database for the identification of new vaccine and drug targets. This provides an opportunity for upper level undergraduates to utilize their basic biology and bioinformatics training and participate in the evaluation of genes as candidates for malaria control strategies.

The focus of this project will be to identify genes that are potential targets for strategies to block the transmission of P. falciparum parasites. To be transmitted from one person to another, mature sexual stage parasites must be picked by a mosquito and differentiate into sporozoites that, once introduced into a human during a blood meal, can establish a new infection. During development in the mosquito midgut, the parasite is extracellular, directly exposed to the external environment for more that 24 hours. At no other time in its life cycle does the parasite live outside a host cell for more than an hour. Therefore, the prediction is that genes identified that are midgut stage-specific and membrane-associated proteins could be exposed on the surface of the parasites as they develop extracellularly and be targets for malaria transmission blocking strategies. A variety of datasets are available that can be searched to identify malaria transmission blocking candidates.

Large cDNA-based and microarrays representing known and predicted proteins have been used to analyze mRNA levels through the parasite’s infectious life cycle. The data are publicly accessible online. Recently, proteins that are specifically expressed in male or female gametocytes were determined by mass spectroscopy and P. falciparum homologs have been identified in a related parasite, P. berghei, that causes malaria in mice but not humans. Characteristic features of membrane proteins previously identified in P. falciparum or other organisms, such as secretory signal sequences, repeated amino acid motifs, predicted transmembrane domains or membrane anchor addition signals can also be determined from the primary genomic sequence data. These databases will be systematically searched and cross-referenced by students and used to identify malaria transmission blocking candidates.


Discovery and in silico construction of consensus repetitive DNA elements from sequence-tag connectors (STC)
Mentor: Dr. Howard Laten, Department of Biology
http://www.luc.edu/biology/laten.shtml

We have been characterizing and evaluating the evolutionary history of retrotransposons and endogenous retroviruses in higher plant genomes. With the exceptions of Arabidopsis and rice, most plant genome projects are in early draft stages and chromosomal regions containing repetitive sequences are not well represented in long contigs. Many retroelement families are present in plant genomes in copy numbers in excess of 1,000 and are well represented in BAC-end or STC sequences. Using DNA at the termini of the soybean retroelements we have previously characterized as queries in BLAST searches of the genome survey sequence (GSS) database at NCBI (primarily STC sequences), we have identified representatives of over two dozen repetitive retroelement families not previously found in soybean or in other legumes. Using reiterative BLAST searches of the GSS database with these initial hits as “seeds”, long robust contigs representing full-length consensus sequences have been generated for a handful of these elements using contig building software in the Lasergene 6.1 software suite. The lab provides opportunities to generate several additional full-length retroelements from unextended seeds and to discover new repetitive DNAs in soybean and other incompletely sequenced plant and animal genomes. The evolutionary history of the new elements can be addressed using PAUP and MEGA3, both available on the lab computers.


Characterizing the Glycogen Interactome
Mentor: Dr. Miguel Ballicora, Department of Chemistry
http://www.luc.edu/chemistry/facultystaff/faculty_ballicora.shtml

We have preliminary evidence from the photosynthetic cyanobacterium Anabaena suggesting that enzymes involved in glycogen metabolism physically interact with each other. We hypothesized that this type of interaction is necessary to modulate the synthesis and/or degradation pathways. We are extending this investigation to E. coli and evaluating the interaction of the enzymes of glycogen metabolism with their subcellular environment (other proteins and polysaccharides). Glycogen synthase has a high affinity for glycogen and they co-precipitate when centrifuged at high speed. In order to explore whether other proteins bind to glycogen or to glycogen synthase, glycogen will be incubated with E. coli extracts and ultracentrifuged. The proteins that bind to glycogen will be analyzed by SDS-PAGE, proteolysis, HPLC and MALDI-TOF MS. Students will focus on the identification of a subset of these proteins using MS and peptide fingerprinting. The ultimate goal is to describe all the proteins that interact with glycogen directly or indirectly (“glycogen interactome”).

In this proposal, since the E. coli genome is known, students will focus on few candidates and will try to identify them with bioinformatics tools based on the sequence of short peptides obtained through mass spectrometry. Once they are identified, a phylogenetic analysis will be performed for a better functional prediction in case that the proteins identified are product of gene duplications.

Simulation of Conformational Changes in Proteins
Mentor: Dr. Ken W. Olsen, Department of Chemistry
http://www.luc.edu/chemistry/facultystaff/faculty_olsen.shtml

The focus of the REU research in the Olsen laboratory will be to determine the pathways for the conformational changes in several proteins. In addition to participating in exciting scientific research, the REU student will gain experience with molecular mechanics calculations and several molecular graphics programs. The specific targets currently planned for the next three years are adenylate kinase, HIV protease and periplasmic binding proteins. Each of these has a significant conformational change that occurs when the protein binds its ligand , and the structures of both conformations are known. The students will use the conjugate peak refinement method, as implemented in the molecular modeling program CHARMM to determine the minimum energy pathway for each conformational change. They will then analyze these observations using the molecular graphics program VMD. The three known conformations of adenylate kinase are shown below.

The three target proteins each are important in their own right. Adenylate kinase is an important enzyme in energy metabolism. It converts AMP and ATP to 2 molecules of ADP. It is found in all organisms. It is a typical example of a family of proteins called NMP kinases, including guanylate kinase. Comparison of the structures of this family of enzymes shows that they are similar, indicating that the results for adenylate kinase may be applicable to other members of the family. HIV protease is an aspartic protease required for processing the HIV polyprotein precursor. Inhibition of this enzyme is a vital part of AIDS treatment. The homologous bacterial transport proteins have similar conformational changes. These results will allow us to assess the effects of sequence changes on the pathways for conformational transitions.


Investigating genomic sequences for the coevolution of viruses and their hosts
Mentor: Dr. Catherine Putonti, Departments of Biology and Computer Science
http://www.luc.edu/biology/putonti.shtml

The interactions between pathogenic organisms and their hosts have been observed for centuries as pathogenic organisms are the causative agents for many of the world’s illnesses and mortality.  In just the past few decades many new diseases or diseases once thought to be in decline have emerged posing an additional threat.  The coevolution of the pathogen and host is not only a likely factor in the emergence of infectious disease but also their perpetual survival within the population.  As the pathogenic organism evolves an increased pathogenicity, the host must also evolve in order to defend itself and hopefully clear the invading pathogen.  Thus, the genomes of both the pathogen and host species are expected to change over time as the two exist in a symbiotic relationship.

Host-specific adaptations have been observed at the sequence level (e.g. acquisition of host factors, mimicry of host protein structures, etc.) for numerous bacterial organisms as well as a few of the viral organisms with large genomes.  Smaller, compact viral organisms, however, cannot accommodate such adaptations.  Thus we pose the question: Are host-specific adaptations occurring within such organisms at the sequence level?  It is these small viruses, such as West Nile virus, Ebola virus, Dengue virus, and Influenza - just to name a few, that pose some of the biggest threats to global health. The ability to recognize such adaptations could provide us with the opportunity to predict what pathogens will “jump” hosts, leaving their normal reservoir to successfully propagate within a new host.


Automated Tools for Iterative BLAST searches
Mentor: Dr. Chandra Sekharan, Department of Computer Science
http://www.cs.luc.edu/~chandra

Research on generating nearest neighbors of flanking sequences and consensus sequences of repetitive DNAs are usually performed manually and take considerable amount of time, typically several tens of hours. Searches for these sequences in many genomes is exacerbated by the fact that the regions containing these DNAs are often the last to be covered by sequencing projects, and even nearly completed projects like the human genome have not fully described this class of DNAs. Students will focus, however, on many of the incompletely sequenced plant genomes in which databases are characterized by large numbers of sequence tagged connectors deposited in the genome survey sequence (GSS) database. Clearly, repetitive BLAST searches on multiple databases are time consuming, if each submission of a query to the various databases is done manually. Students will help to design appropriate software tools to make the process as automated as possible. The research project will use two powerful methodologies for supporting the development of software tools. GenBank has a “C” programming language based Application Programming Interface that developers can use to query various databases.

The first approach uses a tightly-coupled hardware architecture in which a networked cluster of machines with distributed memory is used. The computer science department at Loyola has Gigabit network consisting of anywhere from 5 to 60 computers available for this purpose. The second strategy uses the resources available in the internet and presents a distributed databases solution. In this strategy, no set up of expert infrastructure is needed. Clearly, the software developed in the first approach is expected to perform better than the second. However, the advantages of computing over distributed databases are: (i) Less setup time, (ii) suitable for people or facilities with limited computer-hardware resources, and (iii) being able to access up-to-date databases.

The collection of BLAST programs is computationally expensive but easily parallelizable. Several methodologies exist for parallelizing BLAST, including multithreading, query partitioning, and database partitioning. Using the parallel programming library called Message Passing Interface (MPI), mpi-BLAST partitions a database into several partitions such that each compute node in a cluster of processor nodes searches a unique portion of the database. Database partitioning offers two advantages to improve the performance of BLAST for large datasets. First, the current size of sequence databases is larger than the physical memory on most computers, forcing BLAST searches to use hard disk input/output. Partitioning the database permits each node to search a smaller portion of the database, eliminating the disk I/O. Since disk I/O read and write times are up to 10,000 times slower than memory reads and writes, this would dramatically improve the performance. MPI-BLAST needs a setup phase during which the sequence databases are formatted, partitioned, and distributed among computers in the high-speed network. BLAST queries are processed upon completion of the setup phase. Setup is accomplished by executing an MPI wrapper for the standard BLAST program called mpiformatdb. BLAST queries are executed by running an MPI wrapper program for the standard blastall program that comes with the NCBI distribution. The 'mpiblast' wrapper executes blastall on each individual node. Upon completion of the search by all cluster nodes, a master node aggregates the search results files are stored in a database such as Oracle 10G.

While mpi-BLAST may be the best and most efficient approach among tightly-coupled machines, the research would also explore a distributed systems approach. The basic idea here is for a user to have a browser interface that automates the task of querying multiple databases (NR, HTGS, and GSS) on the world-wide web and performing iterated BLAST searches to get the nearest neighbors and consensus sequences, as though the databases were present locally. This kind of tool is useful for researchers who do not have access to multi-processor systems or the expertise to support such systems. Secondly, we see this as an independent stream of computer science research, namely, virtual database systems. 


Integrating BioPython and Web Services
Mentor: Dr. George K. Thiruvathukal, Department of Computer Science
http://www.cs.luc.edu/gkt

Biopython is part of a collection of tools and algorithms to support DNA, RNA, and protein sequence analysis and allows users to interact with sequence and other data structures from bioinformatics at a very high level. Similar to what would be done in Windows-based graphical tools, Biopython provides the added capability of being able to use an intuitive scripting language (Python) to drive the algorithms via workflows. Contrasted with the Dynamic Programming project, which has independent merit, the emphasis of this project is to leverage an existing framework for analyzing and aligning sequences and to expose the Biopython capabilities through a collection of web services. While Biopython represents a powerful framework for bioinformatics, its potential is limited by being tied to a single programming language and by not being integrated with other languages and tools in common use. Web services--built atop Microsoft's .Net platform or J2EE--are widely considered to be the future of most software applications and allow for unprecedented integration into problem-solving environments and tools for scientific computing. Students will do extensive programming using Python, Microsoft's .Net platform, Java 2 Enterprise Edition (J2EE), and a selection of open-source software development tools. 


Application of Multivariate Techniques and Hidden Markov Models
Mentors: Timothy E. O’Brien, Department of Statistics
http://www.math.luc.edu/about/people/ft_faculty/tobrien

Students will develop improved scoring methods, substitution matrices (BLOSUM, etc.) and gap penalties associated with pairwise DNA sequences. Distances between sequences are approximated using multivariate discrimination, classification and grouping techniques such as tree and clustering methods, and these methods will be reviewed and extended to handle the study of amino acids (proteins). First- and second-order Markov models and HMM methods are proving useful in tackling alignment problems, and these techniques will also be applied to protein families and to the alignment of several (more than two) sequences simultaneously.


Additional Faculty Projects


For projects available with the following faculty, click here:

Liping Tong, http://www.math.luc.edu/~ltong/