Data automation in large-scale omics-based biomedical research

Nahnsen, Sven

Data automation in large-scale omics-based biomedical research

1

SHARES

Share via

Posted: 10 April 2015 | Sven Nahnsen (Quantitative Biology Center (QBiC)) | No comments yet

Medicine and life sciences research are being revolutionised by major technological developments that have taken place over the past decade. High-throughput sequencers can now generate high-quality nucleotide sequence data at a small fraction of the cost compared to only a few years back…

Similar developments have been taken place in the field of modern mass spectrometry, driving comprehensive analysis of proteomes and metabolomes. Most notably, however, these technological breakthroughs allow the movement from the analysis of individual genes, transcripts and proteins to whole genome, transcriptome and proteome studies. Such omics studies are increasingly applied in small-scale research, as well as in larger clinical studies.

As a result of this transition researchers now generate an unprecedented amount of primary raw data in many different research projects. Biomedical research has clearly become a big data discipline³. This intense data production generates new challenges for researchers, including the management of petabytes of data and high performance in data processing, analysis and interpretation. Bioinformatics and computational biology workflows need to be implemented in a way to guarantee high turn-around times in generating results. However, the technological advances and their (close to) mainstream applications are widening the gap between data production (increasingly easy) and data analysis (increasingly difficult). Manual data analysis is not an option anymore, but dedicated software tools are chained and executed automatically to achieve the throughput needed to cope with the speed of data production.

Large-scale and data-intensive research in biology and biomedicine will only be possible by implementing a high degree of automation in all steps of large-scale research projects. It will be a competitive advantage to develop automated methods and infrastructures for the management and analysis of the primary research data. This automation will deliver the needed throughput to keep the speed of data analysis on par with its generation. Furthermore, it is the only way to move towards the ambitious goal of reproducibility in modern omics-based biomedical research.

Importance of archiving

Automating data analysis and processing relies on a stringent and machine-readable archiving of all experimental parameters along with the raw and processed data. While there is still a huge collection of different experimental methods for using next generation sequencing (NGS) or other omics technologies, the number of parameters and especially the subjectivity in performing these experiments is reduced compared to classical ‘small scale’ assays. Traditionally, scientists developed strong expertise in small parts of the workflow to achieve adequate results, making their reproducibility and integration with other datasets rather difficult. In contrast, it is an intrinsic property of an omics experiment to keep very well defined technical parameters on the machine (sequencer of mass spectrometer) for analysing the entire set of analytes (e.g., genome or proteome).

Attempts towards an integrated (data generation and analysis) research infrastructure will require the in silico projection of the experimental processes in the laboratory. This projection (data models in database terms) will offer means to generically describe the entire process from the initial individual (for example, the patient) to the resulting biological interpretation of the omics data⁴.

Most importantly, steps that build the fundaments of these processes have to include the biological experiment itself (e.g., a time-course of treatment with chemical A in concentration c in k individuals of species S). Within this biological experiment, sample extraction (e.g., healthy or cancer liver tissue) and preparation (extraction of total DNA with a given protocol P) are the next steps before modeling the NGS experiment with its set of meta information (e.g., chemistry version, sequencing protocol, etc.). An excerpt of such an in silico projection is shown in Figure 1, where the upper panel shows the individual steps that are carried out as part of an experiment and the lower panel exemplifies how data models can be designed to host all relevant parameters for the individual steps.

Finally, upon sequencing, data is subjected to computational processing and analysis. Data processing includes all workflows that take the bulk raw data and generate easy-to-parse files or even human-interpretable spreadsheets. Examples for processed data are variant calling files (VCF) or matrices of transcript or protein expression measures (for RNA-Seq experiments these could be read counts), which are then subjected to further computational analysis. Analysis pipelines vary highly from question to question, hence, the meta information of the biological experiment. They also involve different ways to visualise and statistically assess the multi-dimensional data and to mine for biological findings.

While the automated documentation of laboratory processes is an active research domain (e.g., as demonstrated by Holzmuller-Laue, et al.⁵), it is both particularly challenging and absolutely essential for achieving an integrative and high-performance setup for modern high-throughput technologies.

Requirements for automated analysis

Automated data analysis can only be implemented if the system is able to access the raw and processed data with all parameters of the processing workflow (e.g., the version of reference databases, tools and their parameters). Furthermore, analysis workflows also need to process information of the biological experiment (information on treatment, organisms and tissue) to generate meaningful results. The data processing workflows, in turn, need to query all parameters from the sequencing run (adapter information for de-multiplexing, organism for read alignment, etc.), as well as preceding extraction and preparation steps (e.g., which enzymes or chemistries).

With the current speed of technological development, the full benefit of these quantitative and comprehensive measurements can only be achieved by implementing scalable solutions that will ultimately allow for integrating and automating entire process and analysis workflows for omics-based biomedical research projects.

In the near future, such implementation will allow us to base major research questions (e.g. diagnosis and treatment of diseases) on big multi-omics data. It is now time to integrate disciplines, data and methods to bring this future even closer.

Biography

DR. SVEN NAHNSEN studied Biomathematics and Biotechnology at the Universities of Greifswald (Germany), Strasbourg (France) and Cambridge (UK). He completed his PhD in Computer Science at the University of Tübingen (Germany). His early research focused on bioinformatics analysis of modern high-throughput data, ranging from mass spectrometry-based metabolomics and proteomics to next-generation-sequencing experiments. Since 2012 he has been heading the Quantitative Biology Center (QBiC) in Tübingen, focusing on data management and integration of big data in biomedicine. He and his team have established an efficient bioinformatics infrastructure enabling high-performance data analysis and automated project management.

References

Mardis, E.R., A decade’s perspective on DNA sequencing technology. Nature, 2011. 470(7333): p. 198-203.
Watson, M., Illuminating the future of DNA sequencing. Genome Biol, 2014. 15(2): p. 108.
Marx, V., Biology: The big challenges of big data. Nature, 2013. 498(7453): p. 255-60.
Friedrich, A., et al., Intuitive Web-based Experimental Design for High-throughput Biomedical Data. Biomed Research International, 2015. Accepted, March 9, 2015.
Holzmuller-Laue, S., et al., Improved compliance by BPM-driven workflow automation. J Lab Autom, 2014. 19(6): p. 528-45.

Cookie	Type	Duration	Description
cookielawinfo-checkbox-advertising-targeting	persistent	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	persistent	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	persistent	1 year	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	session	1 year	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	persistent	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	session	1 year	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Type	Duration	Description
advanced_ads_browser_width	persistent	1 month	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	persistent	2 years	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	persistent	1 month	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	persistent	1 year	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	persistent	2 years	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	persistent	2 years	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	persistent	3 months	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	persistent	1 month	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	persistent	5 months	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Type	Duration	Description
bcookie	persistent	2 years	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	persistent	30 minutes	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	session	1 year	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	persistent	1 day	This cookie is set by LinkedIn and used for routing.
lissc	persistent	11 months	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	persistent	2 years	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	persistent	2 years	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	persistent	20 minutes	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	persistent	20 minutes	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	persistent	20 minutes	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	persistent	2 years	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	persistent	1 minute	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	persistent	1 day	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Type	Duration	Description
cf_ob_info	persistent	1 minute	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	persistent	1 minute	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	session	1 year	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	persistent	1 month	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	persistent	Until cleared	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	session	1 year	This cookie is set by Youtube and is used to track the views of embedded videos.

Recommended

Data automation in large-scale omics-based biomedical research

Importance of archiving

Requirements for automated analysis

Biography

References

Leave a Reply Cancel reply