High performance Integrated Virtual Environment
>> Research Projects >> HTS-CSRS >> Biocompute Objects
Version - prd       

A framework for community-based development of standards - for harmonization of High-throughput Sequencing (HTS) computations and data formats to promote interoperability and bioinformatics verification protocols.  


Overview

A biocompute object is a record that includes all software arguments of the executable program, version information, and a reference to all the inputs including the usability domain. While the uses of such an object are extremely varied and adaptable, in the domain of HTS computation and regulation this would allow for:
  • Harmonization of HTS analysis
  • Evaluation and validation of pipelines
  • Construction of novel pipelines through integration of multiple biocompute objects

(Click Here to view examples of biocompute objects, or here to proceed to the biocompute object portal.)

Actual in vivo (in/with a living organism), in situ (in a specific location/environment), and in vitro (in glass/lab) experiments are unpredictable and extremely variable. While in silico (in a computer/computational) experiments can be as variable, this does not have to be the case. It is possible, and much easier in silico, to “freeze” an experimental instance and make it highly reproducible. This is our goal with biocompute objects and the ensuing database of validated biocompute objects.

An experimental instance in any discipline (physical, biological, or chemical) or any situation (in vivo, in situ, and in vitro) can be regarded as a “generalized scientific experiment” and treated in the same basic manner.

Note: The image has been adapted from Pixabay released under Creative Commons CC0 No attribution or permission is required.

For all instances, the validation protocols regard the experiment as a black box. You put something in the box (input) in a certain way (parameters) and get a reproducible result (output). If the same input and conditions reliably produce the expected results than the instance is a valid. We could even extend this analogy to the kitchen: to bake a loaf of bread (result/output) you need flour and water (inputs), an oven at a certain temperature for a given amount of time (parameters). The "generalized experiment" figure above illustrates how such inputs and parameters also apply to in silico experiments.


Biocompute object (experimental instance), biocompute object template (experimental protocol), and template library

A validated (and easily reproducible) biocompute object record stored in a database would address reproducibility and traceability issues that plague bioinformatic protocols. Parameters can be relaxed or strict, depending on how they are validated.

Our database of biocompute object will be applicable to both federated and integrated systems, although the load within an integrated system, like HIVE, is greatly reduced. Integrated systems do not need to include all the actual input and output data as these data are on the system already, or easily introduced into the environment and are available with the unique system identifiers.

It is possible to join multiple experiments into complex pipelines by joining appropriate nodes of the resulting biocompute objects. From the perspective of validating these objects, a complex web of separate objects is still a single instance where ALL the input, output, and parameters still serve the same function as in a singular instance. In this way, it is possible to extend our validation process of a singular biocompute object to all components and combinations of an in silico protocol: singular algorithms, standalone tools, integrated applications, pipelines, and even whole workflows.

It is also possible to use a validated biocompute object with well-characterized parameters to construct a template, from which other objects can be constructed. These objects can then become re-usable constructs for pipelines or batch computations.


Validation and the biocompute database

In the sciences, peer review serves as the primary validation for an experimental instance. Benchmark datasets exist, and are also useful, but for the testing of an experimental protocol as applied to regulatory bioinformatics requires more vigor.

Proper validation requires a set of test inputs, parameters, expected results, and defined limits of error and/or divergence. Using all of these elements we compare expected results to the actual results and then accept or disqualify the protocol. If any one of these elements is missing or bad then validation cannot occur (bad data = bad test = bad standards).

Requirements and procedure for validation of scientific merit and interpretation of submitted biocompute object:

  • References to publications where the underlying scientific method is discussed are provided
  • A description of the experimental protocol clearly defining: usability domain, parametric space, knowledge domain, error rates (if applicable), prerequisite datasets, and minimal requirements for an execution platform
  • A generated or synthesized in-silico input test set that is well-characterized
  • The results accumulated from an instance of the executed application
  • A detailed analysis of results to ensure the outputs' validity
  • Registration and creation of a biocompute record
  • All valid outcomes associated with the biocompute object
  • A template of the biocompute object for further uses

The validation proceeds as follows

  • A mechanism will be set up for users to create biocompute objects. It will be possible to go one step further and enable direct submission from HIVE and in the future other platforms such as Galaxy.
  • Users will submit information for generating a biocompute object file for human and machine reading.
  • This database will have two sections: A validated and reviewed section of biocompute objects and an un-reviewed and/or partially validated section.

Currently the validation proceeds manually by dedicated curators, but it may be possible to initiate crowd-sourcing and automatic and/or semi-automatic methods. Biocompute objects and biocompute object-templates will be stored and submitted, and will be searchable after validation. Each one will be versioned and archived, while a utility for detection of duplicates will be inherent in the basic format of the database.




Acknowledgements

    Sept. 24-25, 2014 Public Workshop Next Generation Sequencing Standards Organizing Committee members. Sponsor FDA. Committee Chair: Dr. Vahan Simonyan. HTS-standards project lead: Dr. Raja Mazumder.