简介 |
内容推荐 基因组学领域的数据正在剧增。在短短几年内,美国国家卫生研究院(National Institutes of Health,NIH)等组织托管的基因组数据已经超过了50PB(5OOO万GB),这些组织正在转向云基础架构,以便将数据提供给研究团体。你该如何调整分析工具和协议来访问和分析云端的海量数据? 通过这本实用书籍,研究人员将学会如何使用基因组分析工具包(Genome Analysis Toolkit,GATK)、Docker、WDL、Terra等开源工具来处理基因组学算法。GATk用户社区的长期监理人Geraldine Van der Auwera和加州大学圣克鲁兹基因组学研究所的Brian O’Connor会指导你完成这一过程。你将通过使用真实数据和相关领域的基因组学算法展开学习。 本书涵盖了: 基本的基因组学和计算技术背景; 基本的云计算操作; GATK入门,加上三个主要的GATK最佳实践; 使用WDL和Cromwell编写的脚本化工作流进行自动分析; 扩展云端的工作流执行,包括并行化和成本优化; 使用Jupyter notebook在云端进行交互式分析; 使用Terra确保协作和计算可重复性。 作者简介 杰拉尔丁·A.范德奥维拉博士是麻省理工学院一哈佛大学博德研究所数据科学平台的外联和沟通负责人。 目录 Foreword Preface 1. Introduction The Promises and Challenges of Big Data in Biology and Life Sciences Infrastructure Challenges Toward a Cloud-Based Ecosystem for Data Sharing and Analysis Cloud-Hosted Data and Compute Platforms for Research in the Life Sciences Standardization and Reuse of Infrastructure Being FAIR Wrap-Up and Next Steps 2. Genomics in a Nutshell: A Primer for Newcomers to the Field Introduction to Genomics The Gene as a Discrete Unit of Inheritance (Sort Of) The Central Dogma of Biology: DNA to RNA to Protein The Origins and Consequences of DNA Mutations Genomics as an Inventory of Variation in and Among Genomes The Challenge of Genomic Scale, by the Numbers Genomic Variation The Reference Genome as Common Framework Physical Classification of Variants Germline Variants Versus Somatic Alterations High-Throughput Sequencing Data Generation From Biological Sample to Huge Pile of Read Data Types of DNA Libraries: Choosing the Right Experimental Design Data Processing and Analysis Mapping Reads to the Reference Genome Variant Calling Data Quality and Sources of Error Functional Equivalence Pipeline Specification Wrap-Up and Next Steps 3. Computing Technology Basics for Life Scientists Basic Infrastructure Components and Performance Bottlenecks Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG Levels of Compute Organization: Core, Node, Cluster, and Cloud Addressing Performance Bottlenecks Parallel Computing Parallelizing a Simple Analysis From Cores to Clusters and Clouds: Many Levels of Parallelism Trade-Offs of Parallelism: Speed, Efficiency, and Cost Pipelining for ParaUelization and Automation Workflow Languages Popular Pipelining Languages for Genomics Workflow Management Systems Virtualization and the CIoud VMs and Containers Introducing the Cloud Categories of Research Use Cases for Cloud Services Wrap-Up and Next Steps 4. First Steps in the Cloud Setting Up Your Google Cloud Account and First Project Creating a Project Checking Your Billing Account and Activating Free Credits Running Basic Commands in Google Cloud Shell Logging in to the Cloud Shell VM Using gsutil to Access and Manage Files Pulling a Docker Image and Spinning Up the Container Mounting a Volume to Access the Filesystem from Within the Container Setting Up Your Own Custom VM Creating and Configuring Your VM Instance Logging into Your VM by Using SSH Checking Your Authentication Copying the Book Materials to Your VM Installing Docker on Your VM Setting Up the GATK Container Image …… 6. GATK Best Practices for Germline Short Variant Discovery 7. GATK Best Practices for Somatic Variant Discovery 8. Automatina Analysis Execution with Workflows 9. Deciphering Real Genomics Workflows 10. Running Single Workflows at Scale with Pipelines API 11. Running Many Workflows Conveniently in Terra 12. Interactive Analysis in Jupyter Notebook 13. Assembling Your Own Workspace in Terra 14. Making a Fully Reproducible Paper Glossary Index |