Analyzing Massive Healthcare Datasets using Apache Spark

November 15, 2018
Free for AMIA members; $50 for non-members.
Frank Austin Nothaft, PhD

The rapid decrease in the cost of sequencing a human genome has made "big data" sequencing studies (>10,000 sample) the new norm. However, working with datasets of this scale requires a new approach to analytics. Bioinformaticians need to harness the power of distributed computing to process these massive datasets for important clinical use cases. Recently, several prominent libraries like GATK4, ADAM, and Hail have taken advantage of Apache Spark, the leading analytics engine for big data processing, to achieve this goal. With Spark, bioinformaticians can write code in Scala, Java, Python, R, or SQL and run it in parallel across a cluster with hundreds to thousands of cores making it easy to scale analytics.

In this talk, you will learn the basics of Apache Spark and how it supports large-scale healthcare informatics.  We will also discuss examples of genomic analyses where Spark dropped latencies from hours to minutes. Finally, best practices for integrating other data sources (clinical measurements, imaging, etc.) with genomics data for popular healthcare use cases. will be explored.

Learning Objectives

After participating in this activity, the learner should be better able to:

• Understand the basics of Apache Spark and how it works
• Recite examples of genomic analyses that can be run on Spark and how to significantly drop latencies on large-scale analytics
• Describe how Apache Spark can be used to integrate other data sources (clinical measurements, imaging) with genomics data
• Discuss best practices for architecting scientific analyses on Apache Spark

Speaker Information

Frank Austin Nothaft, PhD
Go-To-Market Lead for Genomics at Databricks

Frank is the Go-To-Market Lead for Genomics at Databricks. Prior to joining Databricks, Frank was a lead developer on the Big Data Genomics/ADAM project at UC Berkeley, and worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips. Frank holds a PhD and Masters of Science in Computer Science from UC Berkeley, and a Bachelors of Science with Honors in Electrical Engineering from Stanford University.