
The central theme of the Center is the multiscale analysis of cellular networks. This theme is manifested in three Driving Biological Projects (DBPs) that target broad areas of basic Biology research, including
- (a)
- tackling the issue of biomolecular interaction directly, at the structural and physiochemical level,
- (b)
- constructing a context-specific map of cellular interactions, and
- (c)
- using such a map to dissect complex diseases.
The biological questions posed by the DBPs will generate the requirements to drive the biomedical computation research that will be carried out by the Center:
- Computational Sciences: basic research in computer science will address a number of computational and algorithmic challenges raised by the DBPs. Research activities will include the development of
- (a)
- Machine Learning (ML) algorithms for evidence integration, classification, and inference,
- (b)
- natural language processing algorithms, and
- (c)
- software engineering methodologies and frameworks for the assembly of the Center repository based software platform.
- Computational Biology and Biomedical Informatics: novel algorithms geared towards specific biomedical applications will be developed using both knowledge-based and physics-based approaches and leveraging methods yielded by the Computational Sciences research. These algorithms will be combined with existing and new databases to build a modular and extensible bioinformatics platform (geWorkbench) and an associated software toolkit for the analysis of biomolecular interactions. This platform will extend a very successful project for bioinformatics component interoperability, caWorkbench, which is already playing an important integrative role within the caBIG biomedical community.
COMPUTATIONAL SCIENCES
Efforts in this area will provide critical expertise in the advancement of theoretical knowledge-based methods that will then be applied to the solution of specific biomedical problems. Research will be carried out by investigators in Columbia's School of Engineering and Applied Science (SEAS) and the Columbia University Medical Center (CUMC). Of the SEAS researchers, Christina Leslie and David Waltz are at the Center for Computational Learning Systems (CCLS); Rocco Servedio, Yechiam Yemini, Gail Kaiser and Kenneth Ross are in the Computer Science Department (CS); and Chris Wiggins is a faculty member in Applied Physics and Applied Math (APAM). Of the CUMC researchers, Andrea Califano, Carol Friedman, and Yves Lussier are faculty members in the Department of Biomedical Informatics (DBMI). Planned research projects follow three leading themes:
-
Machine Learning (ML)
: CCLS specializes in Machine Learning (ML) theory, algorithms, and applications, and includes developers of two of the most important modern large-margin ML methods, Support Vector Machines SVMs (Vapnik) and boosting (Freund). Participating investigators are broadly versed in all modern ML methods, including PAC learning, random forest learning, kernel-based methods, kNN methods, bottleneck methods, information-theoretic methods, clustering/module-discovery methods, and graphical methods. They have been innovators in the use of ranking rather than classification for more accurate prediction, learning in the absence of a "gold standard" for training, and training using weighted and uncertain evidence. Projects in ML are divided across three separate topics:-
Protein Function, Structure, and Interactions
: This include the development of algorithms for- (a)
- evidence integration (Peer), design of SVM kernels (Bio-Kernels), and
- (b)
- identification and classification of pockets on proteins structures (Pockets).
-
Reverse Engineering of Gene Regulatory Networks
: This includes an information-theoretic algorithm (ARACNE) and two Boosting algorithms (GeneClass and MEDUSA) that integrate sequence and expression data to learn regulatory interactions predictive of mRNA expression data. -
Network-Theoretic Analyses
: These include a graph-diffusion-based method for protein similarity analysis (RankProp), large-margin ML methods for inferring evolutionary mechanisms from biological network topologies (NetClass), and parameter-free algorithm for organizing networks into modules (InfoMod).
-
-
NLP and Ontologies
: The lead investigators (Friedman & Lussier) are innovators and leaders in the use of Natural Language Processing (NLP) for extracting biological knowledge from text databases. Their research in new NLP methods directly impacts the Reverse Engineering and the Phenotypes projects. Ontologies will also be used to define complex biomedical informatics concepts and their relationships for component interoperability and interface design. The key effort here will be to bridge the gap between the NLP systems and the standard phenotypes schema and ontologies specified by the biological community. The NLP projects will build on the MedLEE system that processes patient reports; GENIES, that captures biomolecular interactions from the literature, and BioMedLEE, that captures genotypic-phenotypic relations associated with the underlying causes and treatments of diseases. -
Large Scale Systems
: CS systems researchers (Yemini, Kaiser and Ross) are experts in interoperability, complex distributed systems, and database technologies. They are innovators in modern software engineering technologies including object-oriented languages, self-diagnosing and self-healing systems, and publish-subscribe (pub-sub) technology. Dr. Califano has been involved in a number of academic and industrial large-scale software development efforts, including the development of caWorkbench which will constitute the foundation of the MAGNet Center bioinformatics platform. He leads Columbia University's activities in caBIG, the NCI-sponsored effort to establish a grid of interoperable bioinformatics services for cancer research. The main goals of this area are- (1)
- to develop a formal Biomedical Informatics Structured ONtology (BISON) for the representation of bioinformatics data-structure and data-structure transformations (algorithms, applications, tools); and
- (2)
- The development of a semantic layer (GeneTegrate) to capture all bioinformatics objects within an object-relationships graph model. This will simplify the discovery of important relationships by graph traversal.
COMPUTATIONAL BIOLOGY AND BIOMEDICAL INFORMATICS SCIENCES
Efforts in this area will target the development of novel algorithms geared towards specific biomedical applications using both knowledge-based and physics-based approaches. Research will be carried out by investigators in Columbia University Medical Center (CUMC) in collaboration with researchers from Columbia's School of Engineering and Applied Science (SEAS). All investigators are affiliated with the Center for Computational Biology and Bioinformatics (C2B2). Drs. Barry Honig and Burkhard Rost are in the Department of Biochemistry and Molecular Biophysics (CUMC), Drs. Andrea Califano, Carol Friedman, Andrey Rzhetsky, Yves Lussier, Paul Pavlidis, and Dennis Vitkup are in the Department of Bio-Medical Informatics (CUMC), Dr. Bussemaker is in the Department of Biological Sciences (SEAS), and Dr. Chris Wiggins in the Department of Applied Physics and Applied Math (SEAS). Research projects are organized around four leading themes:
-
Sequence and structure based annotation of protein function
(specifically protein-protein interactions): In the context of the Northeast Structural Genomics Consortium (NESG), the Honig and Rost groups are clustering protein sequences into individual domain families, and using structural information to annotate each of these clusters in terms of biological function. They are also developing methods for a new structure prediction pipeline which will generate homology models once the structure of one or more members of a sequence cluster has been determined. Building on this ongoing research, MAGNet-specific activities will include the development of new sequence and structure-based approaches for functional annotation and protein-protein interaction analysis. The new algorithms will make use of evidence integration methods and will be integrated into the Center's software platform (geWorkbench) providing a unified suite of programs. -
Cellular interaction reverse engineering algorithms
: By leveraging the core Computational Sciences methods, we will implement a variety of tools for the inference of molecular interactions in the cell. These include protein- DNA, protein-protein, and protein-mRNA interactions as well as the interaction of small molecules with any of these macro-molecular structures. In particular, we will implement algorithms for the reverse engineering of cellular interactions from experimental and literature data using regression, NLP, and information theory. These methods, will be used- (a)
- to create a cellular network Knowledge-base,
- (b)
- to identify regulators responsible for activating and deactivating specific interactions (e.g. a kinase activating the transcriptional interaction between a transcription factor and a target gene, via phosphorylation of the TF), and
- (c)
- to identify modular control structures conserved across distinct cellular states or types.
-
Using cellular and molecular phenotypes for context filtering
: Statements such as gene Y is a transcriptional target of protein X are not universally true. For instance, they may be true in yeast and drosophila but not in mammalian cells. More importantly, when cells are organized into distinct cellular phenotypes (i.e. a distinct tissue or disease state) these statements may be true or false in a phenotype dependent manner. Finally, at the molecular level, the transcriptional activation of gene Y by protein X may be contingent on protein X being activated by an acetylation or phosphorylation event. Hence, simple integration of the evidence across algorithms and databases will not be useful unless the molecular and cellular contexts are fully accounted for. These issues will be addressed using formal ontologies, across the entire continuum spectrum from the molecular, to the cellular, to the disease-related level. -
Software platform (geWorkbench)
: the methods, models, and data produced in the context of all MAGNet Center's activities will be provided as interoperable, grid-enabled components of a state-of-the-art bioinformatics platform, geWorkbench. This will allow them- (a)
- to be integrated with a variety of other existing bioinformatics modules for the analysis, visualization, and management of multiple data modalities and
- (b)
- to be assembled into complex bioinformatics workflows and biomedical applications using a simple yet powerful visual front-end and a scripting language.
We will define and use a Biomedical Informatics Structured Ontology (BISON) to create interoperable interfaces for geWorkbench components and the GeneTegrate semantic layer to create a traversable object-relationship graph to support the identification and the management of all the platform objects (e.g. a protein sequence or a software component for sequence analysis). Components that are data or computationally-intensive will be wrapped as grid-services.

