Matei Zaharia
Associate Professor of Computer Science and, by courtesy, of Electrical Engineering
Bio
Homepage: https://cs.stanford.edu/~matei/
Academic Appointments
-
Associate Professor, Computer Science
-
Associate Professor (By courtesy), Electrical Engineering
2022-23 Courses
- Principles of Data-Intensive Systems
CS 245 (Win) -
Independent Studies (12)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr, Sum) - Advanced Reading and Research
CS 499P (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390A (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390B (Win, Sum) - Curricular Practical Training
CS 390C (Sum) - Independent Project
CS 399 (Aut, Win, Spr) - Independent Work
CS 199 (Aut, Win, Spr) - Independent Work
CS 199P (Win, Spr) - Part-time Curricular Practical Training
CS 390D (Aut, Win) - Senior Project
CS 191 (Aut, Win, Spr) - Supervised Undergraduate Research
CS 195 (Win) - Writing Intensive Senior Research Project
CS 191W (Win, Spr)
- Advanced Reading and Research
-
Prior Year Courses
2021-22 Courses
- Machine Learning Systems Seminar
CS 528 (Aut, Win, Spr) - Principles of Data-Intensive Systems
CS 245 (Win) - Value of Data and AI
CS 320 (Win)
2020-21 Courses
- Principles of Data-Intensive Systems
CS 245 (Win) - Value of Data and AI
CS 320 (Win)
2019-20 Courses
- Principles of Data-Intensive Systems
CS 245 (Win) - Value of Data and AI
CS 320 (Win)
- Machine Learning Systems Seminar
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Qian Li, Yilong Li, Yawen Wang -
Orals Evaluator
Daniel Kang, Yilong Li -
Doctoral Dissertation Co-Advisor (AC)
Lingjiao Chen, Daniel Kang, Omar Khattab -
Master's Program Advisor
Victoria DiMelis, Rishab Gargeya, Kai Kato, Max Sobol Mark, Megan Worrel, Emily Yang, Eric Zhang -
Doctoral (Program)
Jared Davis, Trevor Gale, Peter Kraft, Liana Patel, Deepti Raghavan, Keshav Santhanam, Gina Yuan
All Publications
-
Machine Learned Cellular Phenotypes Predict Outcome in Ischemic Cardiomyopathy.
Circulation research
2020
Abstract
RATIONALE: Susceptibility to ventricular arrhythmias (VT/VF) is difficult to predict in patients with ischemic cardiomyopathy either by clinical tools or by attempting to translate cellular mechanisms to the bedside.OBJECTIVE: To develop computational phenotypes of patients with ischemic cardiomyopathy, by training then interpreting machine learning (ML) of ventricular monophasic action potentials (MAPs) to reveal phenotypes that predict long-term outcomes.METHODS AND RESULTS: We recorded 5706 ventricular MAPs in 42 patients with coronary disease (CAD) and left ventricular ejection fraction (LVEF) {less than or equal to}40% during steady-state pacing. Patients were randomly allocated to independent training and testing cohorts in a 70:30 ratio, repeated K=10 fold. Support vector machines (SVM) and convolutional neural networks (CNN) were trained to 2 endpoints: (i) sustained VT/VF or (ii) mortality at 3 years. SVM provided superior classification. For patient-level predictions, we computed personalized MAP scores as the proportion of MAP beats predicting each endpoint. Patient-level predictions in independent test cohorts yielded c-statistics of 0.90 for sustained VT/VF (95% CI: 0.76-1.00) and 0.91 for mortality (95% CI: 0.83-1.00) and were the most significant multivariate predictors. Interpreting trained SVM revealed MAP morphologies that, using in silico modeling, revealed higher L-type calcium current or sodium calcium exchanger as predominant phenotypes for VT/VF.CONCLUSIONS: Machine learning of action potential recordings in patients revealed novel phenotypes for long-term outcomes in ischemic cardiomyopathy. Such computational phenotypes provide an approach which may reveal cellular mechanisms for clinical outcomes and could be applied to other conditions.
View details for DOI 10.1161/CIRCRESAHA.120.317345
View details for PubMedID 33167779
-
DIFF: a relational interface for large-scale data explanation
VLDB JOURNAL
2020
View details for DOI 10.1007/s00778-020-00633-6
View details for Web of Science ID 000574078100002
-
Machine Learning to Classify Intracardiac Electrical Patterns during Atrial Fibrillation.
Circulation. Arrhythmia and electrophysiology
2020
Abstract
Background - Advances in ablation for atrial fibrillation (AF) continue to be hindered by ambiguities in mapping, even between experts. We hypothesized that convolutional neural networks (CNN) may enable objective analysis of intracardiac activation in AF, which could be applied clinically if CNN classifications could also be explained. Methods - We performed panoramic recording of bi-atrial electrical signals in AF. We used the Hilbert-transform to produce 175,000 image grids in 35 patients, labeled for rotational activation by experts who showed consistency but with variability (kappa=0.79). In each patient, ablation terminated AF. A CNN was developed and trained on 100,000 AF image grids, validated on 25,000 grids, then tested on a separate 50,000 grids. Results - In the separate test cohort (50,000 grids), CNN reproducibly classified AF image grids into those with/without rotational sites with 95.0% accuracy (CI 94.8-95.2%). This accuracy exceeded that of support vector machines, traditional linear discriminant and k-nearest neighbor statistical analyses. To probe the CNN, we applied Gradient-weighted Class Activation Mapping which revealed that the decision logic closely mimicked rules used by experts (C-statistic 0.96). Conclusions - Convolutional neural networks improved the classification of intracardiac AF maps compared to other analyses, and agreed with expert evaluation. Novel explainability analyses revealed that the CNN operated using a decision logic similar to rules used by experts, even though these rules were not provided in training. We thus describe a scaleable platform for robust comparisons of complex AF data from multiple systems, which may provide immediate clinical utility to guide ablation.
View details for DOI 10.1161/CIRCEP.119.008160
View details for PubMedID 32631100
-
Approximate Selection with Guarantees using Proxies
PROCEEDINGS OF THE VLDB ENDOWMENT
2020; 13 (11): 1990–2003
View details for DOI 10.14778/3407790.3407804
View details for Web of Science ID 000573965600014
-
PREDICTING SUDDEN CARDIAC DEATH BY MACHINE LEARNING OF VENTRICULAR ACTION POTENTIALS
ELSEVIER SCIENCE INC. 2020: 427
View details for Web of Science ID 000522979100416
-
Fleet: A Framework for Massively Parallel Streaming on FPGAs
ASSOC COMPUTING MACHINERY. 2020: 639–51
View details for DOI 10.1145/3373376.3378495
View details for Web of Science ID 000541369300041
-
BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics
PROCEEDINGS OF THE VLDB ENDOWMENT
2019; 13 (4): 533–46
View details for DOI 10.14778/3372716.3372725
View details for Web of Science ID 000573950100009
-
To Index or Not to Index: Optimizing Exact Maximum Inner Product Search
IEEE. 2019: 1250–61
View details for DOI 10.1109/ICDE.2019.00114
View details for Web of Science ID 000477731600107
-
Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations
ASSOC COMPUTING MACHINERY. 2019: 291–305
View details for DOI 10.1145/3341301.3359652
View details for Web of Science ID 000524218600019
-
PipeDream: Generalized Pipeline Parallelism for DNN Training
ASSOC COMPUTING MACHINERY. 2019: 1–15
View details for DOI 10.1145/3341301.3359646
View details for Web of Science ID 000524218600001
-
TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions
ASSOC COMPUTING MACHINERY. 2019: 47–62
View details for DOI 10.1145/3341301.3359630
View details for Web of Science ID 000524218600004
-
From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
USENIX ASSOC. 2019: 475–88
View details for Web of Science ID 000489756800033
-
DIFF: A Relational Interface for Large-Scale Data Explanation
PROCEEDINGS OF THE VLDB ENDOWMENT
2018; 12 (4): 419–32
View details for DOI 10.14778/3297753.3297761
View details for Web of Science ID 000497516500009
-
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark
ASSOC COMPUTING MACHINERY. 2018: 601–13
View details for DOI 10.1145/3183713.3190664
View details for Web of Science ID 000460373700041
-
MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis
ASSOC COMPUTING MACHINERY. 2018: 1285–1300
View details for DOI 10.1145/3183713.3196934
View details for Web of Science ID 000460373700086
-
NoScope: Optimizing Neural Network Queries over Video at Scale
PROCEEDINGS OF THE VLDB ENDOWMENT
2017; 10 (11): 1586–97
View details for Web of Science ID 000416492900036
-
Splinter: Practical Private Queries on Public Data
USENIX ASSOC. 2017: 299–313
View details for Web of Science ID 000427296400019
-
DIY Hosting for Online Privacy
ASSOC COMPUTING MACHINERY. 2017: 1–7
View details for DOI 10.1145/3152434.3152459
View details for Web of Science ID 000440700800001
-
Making Caches Work for Graph Analytics
IEEE. 2017: 293–302
View details for Web of Science ID 000428073700037
-
Apache Spark: A Unified Engine for Big Data Processing
COMMUNICATIONS OF THE ACM
2016; 59 (11): 56-65
View details for DOI 10.1145/2934664
View details for Web of Science ID 000387897700022
-
Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware
PROCEEDINGS OF THE VLDB ENDOWMENT
2016; 9 (14): 1707–18
View details for DOI 10.14778/3007328.3007336
View details for Web of Science ID 000386431500008
-
MLlib: Machine Learning in Apache Spark
JOURNAL OF MACHINE LEARNING RESEARCH
2016; 17
View details for Web of Science ID 000391480800001
-
GraphFrames: An Integrated API for Mixing Graph and Relational Queries
ASSOC COMPUTING MACHINERY. 2016
View details for DOI 10.1145/2960414.2960416
View details for Web of Science ID 000383740900002
-
FairRide: Near-Optimal, Fair Cache Sharing
USENIX ASSOC. 2016: 393–406
View details for Web of Science ID 000385264700026
-
SparkR: Scaling R Programs with Spark
ASSOC COMPUTING MACHINERY. 2016: 1099–1104
View details for DOI 10.1145/2882903.2903740
View details for Web of Science ID 000452538600074
-
Introduction to Spark 2.0 for Database Researchers
ASSOC COMPUTING MACHINERY. 2016: 2193–94
View details for DOI 10.1145/2882903.2912565
View details for Web of Science ID 000452538600169
-
Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
View details for Web of Science ID 000458973703002
-
Matrix Computations and Optimization in Apache Spark
ASSOC COMPUTING MACHINERY. 2016: 31–38
View details for DOI 10.1145/2939672.2939675
View details for Web of Science ID 000485529800009
-
Scaling Spark in the Real World: Performance and Usability
PROCEEDINGS OF THE VLDB ENDOWMENT
2015; 8 (12): 1840–43
View details for Web of Science ID 000386424800046
-
Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis
ASSOC COMPUTING MACHINERY. 2015: 137–52
View details for DOI 10.1145/2815400.2815417
View details for Web of Science ID 000494968800009
-
Spark SQL: Relational Data Processing in Spark
ASSOC COMPUTING MACHINERY. 2015: 1383–94
View details for DOI 10.1145/2723372.2742797
View details for Web of Science ID 000452535700109
-
Optimally designing games for behavioural research
PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES
2014; 470 (2167): 20130828
Abstract
Computer games can be motivating and engaging experiences that facilitate learning, leading to their increasing use in education and behavioural experiments. For these applications, it is often important to make inferences about the knowledge and cognitive processes of players based on their behaviour. However, designing games that provide useful behavioural data are a difficult task that typically requires significant trial and error. We address this issue by creating a new formal framework that extends optimal experiment design, used in statistics, to apply to game design. In this framework, we use Markov decision processes to model players' actions within a game, and then make inferences about the parameters of a cognitive model from these actions. Using a variety of concept learning games, we show that in practice, this method can predict which games will result in better estimates of the parameters of interest. The best games require only half as many players to attain the same level of precision.
View details for DOI 10.1098/rspa.2013.0828
View details for Web of Science ID 000336184600004
View details for PubMedID 25002821
View details for PubMedCentralID PMC4032552
-
A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples
GENOME RESEARCH
2014; 24 (7): 1180–92
Abstract
Unbiased next-generation sequencing (NGS) approaches enable comprehensive pathogen detection in the clinical microbiology laboratory and have numerous applications for public health surveillance, outbreak investigation, and the diagnosis of infectious diseases. However, practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe. Here we describe SURPI ("sequence-based ultrarapid pathogen identification"), a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and demonstrate use of the pipeline in the analysis of 237 clinical samples comprising more than 1.1 billion sequences. Deployable on both cloud-based and standalone servers, SURPI leverages two state-of-the-art aligners for accelerated analyses, SNAP and RAPSearch, which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance. In fast mode, SURPI detects viruses and bacteria by scanning data sets of 7-500 million reads in 11 min to 5 h, while in comprehensive mode, all known microorganisms are identified, followed by de novo assembly and protein homology searches for divergent viruses in 50 min to 16 h. SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients, underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.
View details for DOI 10.1101/gr.171934.113
View details for Web of Science ID 000338185000012
View details for PubMedID 24899342
View details for PubMedCentralID PMC4079973
-
Multi-Resource Fair Queueing for Packet Processing
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW
2012; 42 (4): 1–12
View details for DOI 10.1145/2377677.2377679
View details for Web of Science ID 000309217600001
-
Managing Data Transfers in Computer Clusters with Orchestra
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW
2011; 41 (4): 98–109
View details for DOI 10.1145/2043164.2018448
View details for Web of Science ID 000302124800009