The methods, models, and data produced by the Center's research activities will be provided as interoperable, grid-enabled components of a state-of-the-art bioinformatics platform that will allow them to be
- (a)
- integrated with a variety of other existing bioinformatics modules for the analysis, visualization, and management of multiple data modalities and
- (b)
- assembled into complex bioinformatics workflows and biomedical applications using a simple yet powerful visual front-end and a scripting language.
The new platform will be called geWorkbench (Genomic Workbench) and will be based on caWorkbench, an existing integrated genomics environment developed by the Center's investigators with funding from NCI, specifically the caBIG initiative. A key feature of geWorkbench will be its integration with GenePattern (a leading bioinformatics application that is also funded by caBIG) which will enable geWorkbench users to gain access to the advanced analysis modules available in GenePattern. We will develop and use a Biomedical Informatics Structured Ontology (BISON) to create interoperable interfaces for the components of both platforms and the GenePattern scripting language to allow their assembly into complex workflows. We will also use the GeneTegrate semantic layer to create a traversable object-relationship graph to support the identification and the management of all the platform objects (e.g. a protein sequence or a software component for sequence analysis). Finally, components that are data or computationally-intensive will be wrapped as grid-services using the Globus toolkit.

A team of experts, familiar with large-scale commercial and academic software development will lead and coordinate the geWorkbench related activities. Based on significant expertise in interacting with the biomedical community, gained through participation in the caBIG project, the development effort will be both driven and tested by the broader biomedical community to ensure the usefulness of the geWorkbench tools and graphical user interfaces. Appropriate workshops and web-based seminars (webinars) for developers and end-users will be run by the Center and on-line documentation, including videos and multimedia materials, will be made available to the community for training purposes. Existing videoconferencing infrastructure, coupled with collaborative software available from the caBIG project will be used for distance and multi-site training, without requiring significant travel by the Center's personnel. The software development effort will proceed according to established software engineering principles, including:
- A rational Software Development Lifecycle based on UML tools and processes, including the creation of functional requirements, use cases, and entity relationship diagrams.
- The use of proven (community-based) software development methodologies, including
- (a)
- Source Code Version Control,
- (b)
- bug tracking and resolution,
- (c)
- mailing lists-based communities, and
- (d)
- extensive unit, system, and integration testing methodologies.
- An appropriate modification of the UML approach to allow rapid integration of software prototypes from individual Center projects, supporting the creation and management of an innovation pipeline
- A community-centric approach to the extension of the formal ontology (BISON) that supports the component interoperability framework. This will be accomplished by relying on the existing NCI's caDSR repository to deposit and revise BISON concepts and to obtain community feedback.
An important element of the software platform will be the development of specific software components that integrate algorithms and databases resulting from the Center's biomedical computation research activities. For instance, the interaction of two transcription factors may be identified from the literature (using NLP methods), from protein-protein structural interactions (using domain recognition motifs), from expression data (using information theoretic or regression methods), or from databases (using ChIP-on-Chip experimental assays). Such clues will have an associated likelihood and their integration will allow improving our overall confidence measure. Besides providing evidence about a specific interaction, this approach will allow clues obtained by one method to trigger conditional analysis via other methods. For instance, the identification of a specific transcription factor interaction by regression-based reverse-engineering may trigger the analysis of the specific protein-protein interaction at the structural level (assuming that the protein structures are both known).

