BIGNASim database structure and analysis portal for nucleic acids simulation data

Frequently Asked Questions

To allow an easy but still representative data download, data directly available from the web portal consist on 5,000 frames of dry, imaged trajectories. Downloadable trajectories are fully consistent with the pre-calculated analysis available at the web site. Full trajectories are available on-request.

In this project we have chosen two of such database managers: Cassandra, to store trajectory data and MongoDB, to store analysis results and metadata. The two kind of database managers show characteristics that are appropriate for the BIGNASim DB purposes. On one side, Cassandra is a column-oriented database, especially efficient in the retrieval of key-value pairs. Simplicity of data structure is key to boost that retrieval efficiency. This makes it ideal to store trajectory data that is composed by a uniform series of Cartesian coordinates that should be obtained in well-defined data blocks. On the other hand, MongoDB is a document oriented database, where data should not follow a rigid schema. MongoDB may store from single values, to 2D or 3D data, or even full length trajectory videos within a single document. Additionally, MongoDB provides GridFS, a file-system like platform to store files in its original format. Its flexibility, especially with respect to the structure of the stored documents, allows using a common data structure both in the database and in the analysis Capacity of both systems scales horizontally and they may coexist in the same computer equipment.

BIGNASim server offers the possibility to upload nucleic acids molecular dynamics trajectories using our user workspace. To upload a trajectory, user must be registered in myBIGNASim as an uploader user (see help section). This service is still under development, and some of the functionalities might still not be enabled.

The number of simulations stored, types of nucleic acid structures, simulation lengths, etc. are graphically summarized in the statistics section of BIGNASim portal. In any case, BIGNASim is already, with a large difference, the largest database of nucleic acid trajectories, and we plan to continue extending it.

Nucleic Acids simulations stored in this initial stage of BIGNASim database come mostly from trajectories prepared during the development and validation of the parmBSC1 forcefield in our group. However, its structure is open to grow to include new simulations and analysis strategies: the system is neither limited in size, as database managers scale horizontally, nor complexity, as MongoDB offers a fully flexible data schema.

As explained in the previous question, simulations stored in this initial stage of BIGNASim database come mostly from trajectories prepared during the development and validation of the parmBSC1 forcefield in our group. As this forcefield was designed to work with DNA structures, most of the simulations stored in the database are from DNA molecules. However, the next step in the parmBSC1 forcefield project is the possibility to use it in RNA simulations, so in a short period of time BIGNASim will store the complete set of RNA simulations used in this new benchmarking, covering most of the representative RNA motifs. For that, we are already working on new RNA-specific analysis to integrate in the server, and we have already enriched our Nucleic Acids Ontology with new RNA-specific terms.

We should be aware that BIGNASim is a large and complex system, including a production installation of two noSQL databases, two web servers, one public, and another for internal communications, and a complex set of analysis software. Also, a significant amount of disk space is required to handle temporary data. It is almost impossible to design an installation procedure that covers all the system, and does not need a skilled system administrator. However, we are preparing a series of pre-packed Docker images (DB and analysis site, analysis software), covering all analysis functionality, which would allow most users to set a local installation, of course with limited capacity. We are still testing the procedure, but detailed information about the download and installation procedure will be available from the site. We understand that BIGNASim infrastructure may have a value not only to share final results, but also to perform in-house analysis, or simply to store data in an ordered manner. Moreover, in the present context, most major journals require for public sharing of results, like is customary with PDB, or sequence databases. BIGNASim will be an excellent place to do so.

BIGNASim user workspace was designed only to keep users' data in a temporary basis to facilitate downloading of potentially large trajectories or meta-trajectories, providing a link to recover this temporary workspace at any moment (within a browser session). Thus, browsing as an Anonymous user is completely fine for a typical use of the server. However, myBIGNASim offers the possibility to register and work as a logged user, with a permanent workspace (with time and size limits), that is maintained and can be accessed from any machine and at any moment. Users interested in uploading simulations must be registered in myBIGNASim.