Arvados: A Free Platform for Big Data Science

Thursday, March 10, 2016

2:30 pm

Clark Center, S360

Sponsored by:
Department of Genetics

Abstract: This talk will introduce the Arvados ( platform for data science. Arvados is a software system for managing compute clusters built around a scale-out content-addressed distributed file system (Arvados Keep) for storage, a cluster job queuing system designed for reproducibility (Arvados Crunch), and a user and group permission system for controlling and sharing access to those resources. Arvados provides web-based and command line tools for transferring, managing, sharing, and computing on very large data sets.

In working with a diverse set of researchers, physicians, and patients that are all examining sequencing data, we have identified a need for a consistent naming scheme for parts of the genome. As an application within the Arvados platform, we invented tiling – a technique that divides the genome into about 10 million overlapping, variable-length sequences, or “tiles”, each with a unique 24-base tag at each end. We use examples from public data to show that tiling supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We support tiling with Arvados Lightning, software which will scale to millions of genomes in a few racks of off-the-shelf hardware.

Bios: Alexander (Sasha) Wait Zaranek, PhD is co-founder and Chief Scientist at Curoverse, a venture-backed company focused on building a free and open-source platform for storing, analyzing and sharing biomedical data. Sasha works on open technologies that are part of the revolution that reduced human DNA sequencing costs by a million-fold since the completion of the Human Genome Project. A current research focus is the development of clinical-quality applications for processing massive data sets spanning millions of individuals across collaborating organizations, eventually encompassing exabytes of data. His contributions have led to highly cited publications in Science, Nature, the Lancet and other leading scientific journals. Sasha is also a co-founder and Director of Informatics at the Harvard Personal Genome Project.

Jonathan Steffi is co-founder and leader of customer & business development at Curoverse, a federated data service for genomics & health. Before cofounding Curoverse, he spent several years in the biotechnology industry, including roles with Novartis Diagnostics, Amgen, and Accenture. Jonathan holds an MBA from Harvard Business School, an MEng focused in computational molecular biology from MIT, and undergraduate degrees in mathematics & computer science, also from MIT.

