Scalable Computational Bioinformatics
2021 - 2022
M, W, F, 1:30 - 2:20pm
Students should be comfortable using a command line interface and have previous experience programming in at least one language.
As the size of genomic cohorts continues to expand in size and complexity, many organizations are turning to cloud and high-performance computing (HPC) environments to alleviate computational load. While cloud computing promises elasticity and scalability, traditional bioinformatics tools are designed for single system and single node architectures and cannot effectively leverage cloud computing environments at scale and speed. This leads bioinformatics researchers to spend much of their time wrangling data, crafting complex algorithmic techniques, and single task pipelines which are slow to run and difficult to optimize. Researchers have now turned to scalable systems like Apache Spark.
Discusses a distributed programming paradigm, high level APIs, and scalable analytics platforms that simplify implementing algorithms for analyzing large genomic datasets. Discusses tools built on Apache Spark, enabling students to scale to thousands of cores, achieving a balance necessary for processing genomics data. Discusses how to solve some of these problems by bridging bioinformatics, data science, machine learning and the big data ecosystem. Enables students to leverage statistical methods of bioinformaticians and computational biologists in combination with best practices used by data engineers and data scientists across industry.
Learning ObjectivesUpon successfully completing this course, students will be able to:
- Develop just enough experience with Python to begin using the Apache Spark programming APIs including Spark SQL, Spark R, and PySpark
- Describe the Apache Spark architecture, the DataFrames API and SparkR, covering the fundamentals of the Apache Spark framework
- Describe processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications
- Use Delta Lake for ETL processing on data lakes including using Spark SQL to query from a data lake or Spark DataFrames
- Implement unified analysis of large-scale genomics data through distributed computation on and distributed storage of genotype data containing pre-packaged pipelines to align reads and detect and annotate variants in individual samples, parallelized using Apache Spark
- Develop code to analyze Aggregate genetic variants using the GATK’s GenotypeGVCF implemented on Apache Spark and extract, transform and load (ETL) genomic variant data into Spark DataFrames, enabling seamless manipulation, filtering, quality control and transformation between file formats
- Use Machine Learning fundamentals and Data Science techniques to analyze healthcare and genetics/genomics datasets as well as an overview of deep learning and how to scale it with Apache Spark
Methods of Assessment
This course is evaluated as follows:
- 50% Homework
- 50% Programming project that is presented in class