Homepage of Christopher Re (Chris Re)

Christopher Ré
Email: chrismre at cs.stanford.edu
Office in Gates 433

Department of Computer Science
Stanford University
353 Serra Mall
Stanford, CA 94305-9025

I'm an assistant professor in the InfoLab and affiliated with the PPL and SAIL labs, and I work on the fundamentals of the next generation of data management systems (bio here). This means we work on databases, theory, and machine learning, and we worry about hardware trends. A major application of our work is to make it dramatically easier to build high-quality systems to process more of the world's dark data (sql databases, text, and images). Recently, we've shown that our systems can even exceed human volunteer quality in reading scientific journal articles (featured in Nature).
- New Tradeoffs for Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning.
- New Programming Models. The DeepDive system demonstrates that one can build high quality applications that use machine learning without specifying an inference algorithm, which makes it usable by a wider range of people. Our goal for the last few years has been to dramatically reduce the time analyst spend specifying models, maintaining them, and collaboratively building models.
- New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more.
To validate our ideas, we continue to build systems that we hope change the way people do science and improve society. This work is with great partners in areas including paleobiology (Nature), drug repurposing, genomics, material science, and the fight against human trafficking (60 minutes, Forbes, Scientific American, WSJ, BBC, and Wired). Our work is supporting investigations. In the past, we've worked with a neutrino telescope (IceCube Science cover and our modest contribution) and on economic indicators.

DeepDive is tons of fun (one pager). Our code is on github. Data is here. Twitter @HazyResearch sometimes.
News
Upcoming Meetings and Talks
A messy, incomplete log of old updates is here.

I am an assistant professor in Computer Science at Stanford University. I'm in the InfoLab and affiliated with the PPL and SAIL labs. My interests are theoretical and practical problems in data management. Details of my work can be found in my papers and somewhere on github. I believe that the future of computing is in data management. If you agree, send me a note!

I'm an assistant professor in the InfoLab and affiliated with the PPL and SAIL labs, and I work on the fundamentals of the next generation of data management systems (bio here). This means we work on databases, theory, and machine learning, and we worry about hardware trends. A major application of our work is to make it dramatically easier to build high-quality systems to process more of the world's dark data (sql databases, text, and images). Recently, we've shown that our systems can even exceed human volunteer quality in reading scientific journal articles (featured in Nature).
- New Tradeoffs for Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning.
- New Programming Models. The DeepDive system demonstrates that one can build high quality applications that use machine learning without specifying an inference algorithm, which makes it usable by a wider range of people. Our goal for the last few years has been to dramatically reduce the time analyst spend specifying models, maintaining them, and collaboratively building models.
- New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more.
To validate our ideas, we continue to build systems that we hope change the way people do science and improve society. This work is with great partners in areas including paleobiology (Nature), drug repurposing, genomics, material science, and the fight against human trafficking (60 minutes, Forbes, Scientific American, WSJ, BBC, and Wired). Our work is supporting investigations. In the past, we've worked with a neutrino telescope (IceCube Science cover and our modest contribution) and on economic indicators.

DeepDive is tons of fun (one pager). Our code is on github. Data is here. Twitter @HazyResearch sometimes.
News
Upcoming Meetings and Talks
A messy, incomplete log of old updates is here.

Manuscripts

Aggregations over Generalized Hypertree Decompositions. Manas Joglekar Rohan Puttagunta, and C. Ré.
EmptyHeaded: A Relational Engine for Graph Processing Christopher R. Aberger, Susan Tu, Kunle Olukotun, and C. Ré.
GYM: A Multiround Join Algorithm In MapReduce and Its Analysis Foto Afrati, Manas Joglekar, C. Ré, Semih Salihoglu, and Jeffrey D. Ullman.
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa, Kunle Olukotun, C. Ré.

Index by year

2016

Materialization Optimizations for Feature Selection. Ce Zhang, Arun Kumar, and C. Ré. TODS 2016.
High Performance Parallel Stochastic Gradient Descent in Shared Memory. S. Sallinen, N. Satish, M. Smelyanskiy, S. Sury, C. Ré. IPDPS16.
It’s all a matter of degree: Using degree information to optimize multiway joins. Manas Joglekar and C. Ré. ICDT2016.
Weighted SGD for lp Regression with Randomized Preconditioning Jiyan Yang, Yin-Lam Chow, C. Ré, and Michael Mahoney. SODA16.
Old Techniques for New Join Algorithms: A Case Study in RDF Processing. Chris Aberger, Susan Tu, Kunle Olukotun, and C. Ré. DESWEB with ICDE16.
Dark Data: Are We Solving the Right Problems? M. Cafarella, I. Ilyas, M. Kornacker, Tim Kraska, C. Ré. ICDE 2016 (Panel).

2015

Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width Chris De Sa, Ce Zhang, Kunle Olukotun, C. Ré. NIPS15. Spotlight
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms Chris De Sa, Ce Zhang, Kunle Olukotun, and C. Ré. NIPS15
Asynchronous stochastic convex optimization. John C. Duchi, Sorathan Chaturapruek, and C. Ré. NIPS15.
Incremental Knowledge Base Construction Using DeepDive Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, C. De Sa, C. Ré. VLDB15. Best of Issue
Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems Christopher De Sa, Kunle Olukotun, and C. Ré. ICML15.
A Resolution-based Framework for Joins: Worst-case and Beyond. Mahmoud Abo Khamis, Hung Q. Ngo, C. Ré, and Atri Rudra. PODS15. Best of Issue.
Exploiting Correlations for Expensive Predicate Evaluation Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and C. Ré. SIGMOD15
A Demonstration of Data Labeling in Knowledge Base Construction. Jaeho Shin, Mike Cafarella, and C. Ré. VLDB15 (demo).
Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?. SIGMOD15 Panel.
Large-scale extraction of gene interactions from full text literature using DeepDive Emily Mallory, Ce Zhang, C. Ré., Russ Altman. Bioinformatics
Caffe con Troll: Shallow Ideas to Speed up Deep Learning. Firas Abuzaid, Stefan Hadjis, Ce Zhang, and C. Ré. DANAC15.
Join Processing for Graph Patterns: An Old Dog with New Tricks Dung Nguyen, LogicBlox, et al. GRADES15.
DunceCap: Compiling Worst-case Optimal Query Plans. Adam Perelman and C. Ré. Winner of SIGMOD Undergrad Research Competition.
DunceCap: Query Plans Using Generalized Hypertree Decompositions. Susan Tu and C. Ré. Winner of SIGMOD Undergrad Research Competition.
A Database Framework for Classifier Engineering. Benny Kimmelfeld and C. Ré. AMW 2015.
The Mobilize Center: an NIH big data to knowledge center to advance human movement research and improve mobility Ku et al. AMIA.

2014

Materialization Optimizations for Feature Selection. Ce Zhang, Arun Kumar, and C. Ré. SIGMOD 2014. Best Paper Award.
DimmWitted: A Study of Main-Memory Statistical Analytics. Ce Zhang and C. Ré. VLDB 2014.
An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. J. Liu, S. Wright, C. Ré, V. Bittorf, S. Sridhar. ICML 2014. (JMLR version)
Beyond Worst-case Analysis for Joins using Minesweeper. Hung Q. Ngo, Dung Nguyen, C. Ré, and Atri Rudra. PODS 2014. [Full]
Parallel Feature Selection Inspired by Group Testing. Y. Zhou et al. NIPS2014.
The Theory of Zeta Graphs with an Application to Random Networks. C. Ré. ICDT 2014. Invited to "Best of" Special Issue.
Transducing Markov Sequences Benny Kimelfeld and C. Ré. JACM 2014.
A Machine-compiled Macroevolutionary History of Phanerozoic Life. Shanan E. Peters, Ce Zhang, Miron Livny, and C. Ré. PloS ONE.
Using Social Media to Measure Labor Market Flow D. Antenucci, M. Cafarella, M. Levenstein, C. Ré, and M. Shapiro. NBER. Selected for NBER Digest
Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems Christopher De Sa, Kunle Olukotun, and C. Ré.
Preliminary version in Distributed Matrix Computation with NIPS14.
Feature Engineering for Knowledge Base Construction DeepDive Group. Data Engineering Bulletin.
Tradeoffs in Main-Memory Statistical Analytics: Impala to DimmWitted (Invited) V. Bittorf, M. Kornacker, C. Ré, C. Zhang. IMDM with VLDB14.
The Beckman Report on Database Research Mike Carey, AnHai Doan, et al. 2014.
Links between Join Processing and Convex Geometry, C. Ré. ICDT 2014 (Invited Abstract for Keynote) [slides].
Skew Strikes Back: New Developments in the Theory of Join Algorithms. Hung Ngo, C. Ré, and Atri Rudra. SIGMOD Rec. 2013.

2013

Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. Ce Zhang and C. Ré. SIGMOD 2013.
An Approximate, Efficient LP Solver for LP Rounding. Srikrishna Sridhar, Victor Bittorf, Ji Liu, Ce Zhang, C. Ré, and Stephen J. Wright. NIPS 2013
Brainwash: A Data System for Feature Engineering. M. Anderson et al. CIDR Conference 2013 (Vision Track)
Understanding Tables in Context Using Standard NLP Toolkits Vidhya Govindaraju, Ce Zhang, and C. Ré. ACL 2013 (Short Paper)
Hazy: Making it Easier to Build and Maintain Big-data Analytics. Arun Kumar, Feng Niu, and C. Ré ACM Queue, 2013. Invited to CACM March 2013
Ringtail: Nowcasting Made Easy D. Antenucci, M.J. Cafarella, M.C. Levenstein, C. Ré, and M. Shapiro. WebDB 2013 with SIGMOD 2013
Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion B. Recht and C. Ré. Mathematical Programming Computation, 2013.
Ringtail: Nowcasting Made Easy. Dolan Antenucci, Erdong Li, Shaobo Liu, Michael J. Cafarella, and C. Ré. VLDB Demo 2013.
Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System P. Konda, A. Kumar, C. Ré, and V. Sashikanth. VLDB Demo 2013
GeoDeepDive: Statistical Inference using Familiar Data-Processing Languages. Ce Zhang, V. Govindaraju, J. Borchardt, T. Foltz, C. Ré, and S. Peters. SIGMOD 13 (demo).
Building an Entity-Centric Stream Filtering Test Collection for TREC 2102. J.R. Frank, M.Kleiman-Weiner, D. A. Roberts, F.Niu, Ce Zhang, C. Ré, and I. Soboroff. TREC 2013
Improvement in Fast Particle Track Reconstruction with Robust Statistics M. Wellons, IceCube Collaboration, B. Recht, and C. Ré. Nuclear Inst. and Methods in Physics Research, A.
Robust Statistics in IceCube Initial Muon Reconstruction. M. Wellons, IceCube Collaboration, B. Recht, and C. Ré. International Cosmic Ray Conference 2013.

2012

Factoring nonnegative matrices with linear programs. Victor Bittorf, Benjamin Recht, C. Ré, and Joel A. Tropp. NIPS 2012. Revised Version.

The MADlib Analytics Library or MAD Skills, the SQL. Joseph M. Hellerstein et al. PVLDB 2012

Probabilistic Management of OCR using an RDBMS Arun Kumar and C. Ré. PVLDB 2012. [Full Version]
Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen, Xixuan Feng, C. Ré, and Min Wang. ICDE. [Full Version]
Understanding cardinality estimation using entropy maximization C. Ré and Dan Suciu. ACM Trans. Database Syst. Volume 37.
Towards a Unified Architecture for In-Database Analytics Aaron Feng, Arun Kumar, Benjamin Recht, and C. Ré SIGMOD 2012. [Full Version]
Worst-case Optimal Join Algorithms Hung Q. Ngo, Ely Porat, C. Ré, and Atri Rudra. PODS, 2012. Best Paper Award
Big Data versus the Crowd: Looking for Relationships in All the Right Places Ce Zhang, Feng Niu, C. Ré, and Jude Shavlik. ACL, 2012.
Toward a noncommutative arithmetic-geometric mean inequality B. Recht and C. Ré. COLT, 2012 [Full Version]
Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference F. Niu, Ce Zhang, C. Ré, and J. Shavlik. IJSWIS, Special Issue on Knowledge Extraction from the Web, 2012.
DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference F. Niu, C. Zhang, C. Ré, and J. Shavlik. VLDS, 2012.
Scaling Inference for Markov Logic via Dual Decomposition (Short). F. Niu, C. Zhang, C. Ré, and J. Shavlik. ICDM, 2012.

2011

Probabilistic Databases. Dan Suciu, Dan Olteanu, C. Ré, and Christoph Koch. Morgan Claypool's Synthesis Lectures, 2011
Incrementally maintaining classification using an RDBMS Mehmet Levent Koc and C. Ré. PVLDB Volume 4, 2011, p. 302-313
Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS F. Niu, C. Ré, A.Doan, and J.W. Shavlik. PVLDB 11, [Full Version]
Automatic Optimization for MapReduce ProgramsEaman Jahani, Michael J. Cafarella, and C. Ré. PVLDB 2011.
Queries and materialized views on probabilistic databases. Nilesh N. Dalvi, C. Re, and Dan Suciu. JCSS 2011.
Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion B. Recht and C. Ré. 2011.
Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent F. Niu, B. Recht, C. Ré, and S. J. Wright. NIPS, 2011. [Full Version]
Felix: Scaling Inference for Markov Logic with an Operator-based Approach. Feng Niu, Ce Zhang, C. Ré, and Jude Shavlik.

2010

Manimal: Relational Optimization for Data-Intensive Programs Michael J. Cafarella and C. Ré. WebDB, 2010.
Transducing Markov Sequences Benny Kimelfeld and C. Ré. PODS, 2010. Invited to Special Issue
Understanding Cardinality Estimation using Entropy Maximization C. Ré and Dan Suciu. PODS, 2010, Invited to Special Issue
Approximation Trade-Offs in a Markovian Stream Warehouse: An Empirical Study (Short) J. Letchner, C. Ré, M. Balazinska, and M. Philipose. ICDE.

2009 and older

C. Ré

Managing Large-Scale Probabilistic Databases

Winner of SIGMOD Jim Gray Thesis Award

Raghav Kaushik, C. Ré, and Dan Suciu

General Database Statistics Using Entropy Maximization

Talk

Katherine F. Moore, Vibhor Rastogi, C. Ré, and Dan Suciu

Query Containment of Tier-2 Queries over a Probabilistic Database

Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose

Access Methods for Markovian Streams

Arvind Arasu, C. Ré, and Dan Suciu

Large-Scale Deduplication with Constraints Using Dedupalog

Talk

Selected as one of the best papers in ICDE 2009

Nilesh N. Dalvi, C. Ré, and Dan Suciu

Probabilistic databases: Diamonds in the dirt

Full Version

S. Manegold, I. Manolescu, L. Afanasiev, J. Feng, G. Gou, M. Hadjieleftheriou, S. Harizopoulos, P. Kalnis, K. Karanasos, D. Laurent, M. Lupu, N. Onose, C. Ré, V. Sans, P. Senellart, T. Wu, and D. Shasha

Repeatability & Workability Evaluation of SIGMOD 2009

Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose

Lahar Demonstration: Warehousing Markovian Streams

C. Ré and Dan Suciu

The Trichotomy of HAVING Queries on a Probabilistic Database

C. Ré

Managing Probabilistic Data with Mystiq (Plenary Talk)

C. Ré, and Dan Suciu

Advances in Processing SQL Queries on Probabilistic Data

Ting-You Wang, C. Ré, and Dan Suciu

Implementing NOT EXISTS Predicates over a Probabilistic Database

Nodira Khoussainova, Evan Welbourne, Magdalena Balazinska, Gaetano Borriello, Garrett Cole, Julie Letchner, Yang Li, C. Ré, Dan Suciu, and Jordan Walke

A demonstration of Cascadia through a digital diary application

C. Ré, Julie Letchner, Magdalena Balazinska, and Dan Suciu

Event queries on correlated probabilistic streams

C. Ré, and Dan Suciu

Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can't-Do

Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose

Challenges for Event Queries over Markovian Streams

C. Ré, and Dan Suciu

Approximate lineage for probabilistic databases

Full Version

Talk

Magdalena Balazinska, C. Ré, and Dan Suciu

Systems aspects of probabilistic data management (Part I)

Talk

Magdalena Balazinska, C. Ré, and Dan Suciu

Systems aspects of probabilistic data management (Part II)

Talk

Michael J. Cafarella, C. Ré, Dan Suciu, and Oren Etzioni

Structured Querying of Web Text Data: A Technical Challenge

C. Re, and Dan Suciu

Management of data with uncertainties

C. Ré, Dan Suciu, and Val Tannen

Orderings on Annotated Collections

C. Ré, and Dan Suciu

Efficient Evaluation of HAVING Queries

Full Version

Talk

C. Ré, Nilesh N. Dalvi, and Dan Suciu

Efficient Top-k Query Evaluation on Probabilistic Data

Full Version

Talk

C. Re and Dan Suciu

Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization

Full Version

Talk

C. Ré

Applications of Probabilistic Constraints (General Exam Paper)

Eytan Adar and C. Ré

Managing Uncertainty in Social Networks

Giorgio Ghelli, C. Ré, and Jér^ome Sim'eon

XQuery!: An XML Query Language with Side Effects

C. Re, Jér^ome Sim'eon, and Mary F. Fern'andez

A Complete and Efficient Algebraic Compiler for XQuery

C. Ré, Nilesh N. Dalvi, and Dan Suciu

Query Evaluation on Probabilistic Databases

Chavdar Botev, Hubert Chao, Theodore Chao, Yim Cheng, Raymond Doyle, Sergey Grankin, Jon Guarino, Saikat Guha, Pei-Chen Lee, Dan Perry, C. Re, Ilya Rifkin, Tingyan Yuan, Dora Abdullah, Kathy Carpenter, David Gries, Dexter Kozen, Andrew C. Myers, David I. Schwartz, and Jayavel Shanmugasundaram

Supporting workflow in a course management system

Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, C. Ré, and Dan Suciu

MYSTIQ: a system for finding more answers by using probabilities

Nathan Bales, James Brinkley, E. Sally Lee, Shobhit Mathur, C. Re, and Dan Suciu

A Framework for XML-Based Integration of Data, Visualization and Analysis in a Biomedical Domain

C. Ré, Jim Brinkley, Kevin Hinshaw, and Dan Suciu

Distributed XQuery

Werner Vogels and C. Ré

WS-Membership - Failure Management in a Web-Services World

Werner Vogels, C. Ré, Robbert Renesse, and Kenneth P. Birman

A Collaborative Infrastructure for Scalable and Robust News Delivery

Christopher (Chris) Re is an assistant professor in the Department of Computer Science at Stanford University and a Robert N. Noyce Family Faculty Scholar. His work's goal is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from the University of Washington in Seattle under the supervision of Dan Suciu. For his PhD work in probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. He then spent four wonderful years on the faculty of the University of Wisconsin, Madison, before moving to Stanford in 2013. He helped discover the first join algorithm with worst-case optimal running time, which won the best paper at PODS 2012. He also helped develop a framework for feature engineering that won the best paper at SIGMOD 2014. In addition, work from his group has been incorporated into scientific efforts including the IceCube neutrino detector and PaleoDeepDive, and into Cloudera's Impala and products from Oracle, Pivotal, and Microsoft's Adam. He received an NSF CAREER Award in 2011, an Alfred P. Sloan Fellowship in 2013, a Moore Data Driven Investigator Award in 2014, the VLDB early Career Award in 2015, and the MacArthur Foundation Fellowship in 2015.

Download as text file

NIH. Thank you to the NIH Big Data to Knowledge (BD2K) program for supporting our work on big data for mobility data led by Scott Delp.
DARPA. Thank you to DARPA Memex for supporting our work. We are really excited to be part of this program to use extraction for good!
NSF. Thank you to the NSF for supporting New Frontiers in Join Algorithms: Optimality, Noise, and Richer Languages which is joint with Atri Rudra and Hung Ngo!
AirForce. Thank you to the Air Force for supporting Mathematical Foundations of Secure Computing Clouds; this is the hard work of Jordan Ellenberg (Math), Ben Recht (CS), Tom Ristenpart (CS), Rob Nowak (EE), and Steve Wright (CS).
Oracle. Thank you for continuing support to support the Hazy Research group! This gift will be used to continue our work on feature engineering for structured analytics.
ONR Thank you to the Office of Naval Research for supporting Ben Recht, Steve Wright, and my proposal about An Architecture for Integrating Information and Simplifying Large-scale Statistical Data Analysis (Award No. N000141310129)
AmFam Thank you to American Family Insurance for their generous support of the Hazy group's research. We're very excited about the collaboration.
DDR&E Thank you to DDR&E, DARPA, and Raytheon for funding Ben Recht and my proposal about operator splitting for information fusion applications.
DARPA Thank you to DARPA (DEFT) for funding Jude Shavlik, Sriraam Natarajan (Wake Forest), and my proposal Creating Robust Relation Extractors and Anomaly Detectors via Probabilistic Logic-Based Reasoning and Learning.
NSF. Thank you to the NSF for funding an EAGER to work on 'extracting Dark Data' with Shanan Peters from UW Geoscience and Miron Livny from CS.
Google Research Award. Thank you to GOOGLE for supporting our proposal, GeoDeepDive: Machine Reading of Measurements.
Oracle and Oracle Labs. Thank you to the Oracle Labs and the Oracle Analytics team for their generous support of the Hazy group's work! We are really excited to learn what customers need from in-database analytics. This will help support Arun's work. He and I are both very excited!
Greenplum/EMC. Thank you to Greenplum/EMC for their generous support of the Hazy group's work! We are really excited to learn from this collaboration -- and to push some of Aaron and Arun's stuff in to MADlib!, an awesome open-source library for scalable in-database analytics.
Office of Naval Research. Thank you to the ONR for support of my work under award no. N000141210041! This funding will allow our group to embark on a theoretical investigation of the foundations of building a large-scale, easy-to-use data-analysis system.
NSF CAREER. I recently received the NSF CAREER award (IIS-1054009). Thank you to the NSF for their generous support of Hazy.
IceCube. The Hazy group is extremely excited to announce funding for an exploratory data analysis project. The goal of the project is to apply Hazy's ideas to the problem of detecting neutrinos from the Big Bang in collaboration with the IceCube Neutrino Detector and Wisconsin Institutes for Discovery.
LogicBlox The Hazy group is excited to collaborate with LogicBlox! Thank you, LogicBlox, for your generous research gift to support our ongoing work on Tuffy and Felix.
DARPA DARPA's Machine Reading Program has the goal of understanding information expressed as free-form text. We are building a scalable engine to process a probabilistic logic called Markov Logic to support this effort.
Thank You! The Hazy group would like to thank our sponsors in the past and coming year: The Microsoft Jim Gray Lab, DARPA/AFOSR via SRI, the NSF, Google, Johnson Controls Inc., the University of Wisconsin-Madison, the Office of Naval Research, and Physical Layer Systems. In addition, we would like to thank our collaborators at the Wisconsin Institutes for Discovery, HP Labs-China, LogicBlox, Greenplum, Oracle and IBM.

Current PhD Students

Chris Aberger Coadvisor: Kunle Olukotun
Xiao Cheng
Chris De Sa Coadvisor: Kunle Olukotun
Emily Mallory (Biomedical Informatics) Coadvisor: Russ Altman
Stefan Hadjis
Rohan Puttagunta
Alex Ratner
Jaeho Shin
Feiran Wang
Sen Wu

Current MS and Coterm Students

Dan Iter
Thomas Palomares
Susan Tu

Current Postdocs

Stephen Bach Coadvisor: Jure Leskovec
Madalina Fiterau Coadvisor: Scott Delp
Jason Fries Coadvisor: Scott Delp
Ioannis Mitliagkas Coadvisor: Lester Mackey
Theodoros Rekatsinas
Ce Zhang

Stanford Students and Postdocs who I regularly harass:

Manas Joglekar Advisor: Hector Garcia-Molina
Johannes Birgmeier Advisor: Gill Bejerano
Jiyan Yang Advisor: Michael Saunders (ICME) and Michael Mahoney (Berkeley)
Kun-Hsing Yu Advisor: Michael Snyder (BioE)

Alumni (Degree year, First Employment)

Ce Zhang (PhD 2015, Postdoc at Stanford)
Michael Fitzpatrick (MS 2015, Google)
Firas Abuzaid (MS 2015, MIT for PhD)
Zifei Shan (MS 2015, Lattice)
Adam Goldberg (BS 2015, Rubrik)
Adam Perelman (BS 2015, Good Eggs)
Srikrishna Sridhar (PhD 2014, GraphLab) Main Advisor: Stephen J. Wright
Victor Bittorf (MS 2014, Cloudera)
Vidhya Govindaraju (MS 2014, Oracle)
Mark Wellons (MS 2013, Amazon)
Arun Kumar (MS 2013, Wisconsin for PhD)
Feng Niu (PhD 2012, Google)
Xixi Luo (MS in Industrial Engineering 2012, Oracle)
Vinod Ramachandran (MS 2011, Oracle)
M. Levent Koc (MS 2011, Google)
Balaji Gopalan (MS 2010, Google)

We are working on two broad topics:

(1) DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQL-style databases. In turn, these databases can be used to support both SQL-style and predictive analytics. Recently, some DeepDive-based applications have exceeded the quality of human volunteer annotators in both precision and recall for complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation. A one pager with key design highlights is here. PaleoDeepDive is featured in the July 2015 issue of Nature.

the data, the output of various tools, the input from users — including the program the developer writes — are observations from which the system statistically infers the answer

(2) Fundamentals of Data Processing. Almost all data processing systems have their intellectual roots in first order logic. The most computationally expensive (and most interesting) operation in such systems is the relational join. Recently, I helped discover the first join algorithm with optimal worst-case running time. This result uses a novel connection between logic, combinatorics, and geometry. We are using this connection to develop new attacks on classical problems in listing patterns in graphs and in statistical inference. Two threads have emerged:
- Demos, Examples, and Papers.
  - Worst-case Optimal Joins. We have posted a survey for SIGMOD record about recent advances in join algorithms. Our goal is to give a high-level view of the results for practitioners and applied researchers. We also managed to simplify the arguments. A full version of our join algorithm with worst-case optimal running time is here. The LogicBlox guys have their own commercial worst-case optimal algorithm. Our new system, EmptyHeaded is based on this theory.
  - Beyond Worst-case Joins. This work is our attempt to go beyond worst-case analysis for join algorithms. We (with Dung Nguyen) develop a new algorithm that we call Minesweeper based on these ideas. The main theoretical idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query is what Ron Fagin calls beta-acyclic, then Minesweeper runs in time linear in the certificate; if a query is beta-cyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun.
  - Almost to one algorithm to rule them all? We have a much better description of beyond worst-case optimality with a resolution framework and a host of new results for different indexing strategies. This paper supercedes many of the results in Minesweeper and in a much nicer way!. We also hope to connect more of geometry and resolution... but we'll see!
  - A first part of our attack on conditioning for combinatorial problems is in NIPS and on Arxiv.
  - It is not difficult to get me interested in a theory problem. Ask around the Infolab if you don't believe me.
Our goal is to understand the fundamentals of data processing systems.

CS145

here

CS145, Introduction to Databases. Fall 15, Fall 14.
CS346, Database System Implementation. Spring 15 Spring 14.
CS341, Project in Mining Massive Data Sets. Spring 15 Spring 16.
CS345 Advanced Database Systems. Winter 14.

Recent Service
- Editorial Boards
  - Transactions of Database Systems (Associate Editor) 2014-2017
  - Foundations and Trends (Associate Editor) 2014-2017
  - Springer Series in Data Science 2014-2017
- Conferences: Tutorial CoChair at VLDB16; SIGMOD 2013-2015,2017 (External 2016); PODS 2016; ICDT 2015; ACL 2015; CIDR 2015; VLDB 2014-2015; ICDE 2011-2013,2015; EDBT 2012.
- Journal Reviewer: TODS, JACM, CACM, TPDS, TODS, VLDBJ, TKDE,

News!

Students

Papers

Teaching/Service

Current PhD Students

Current MS and Coterm Students

Current Postdocs

Stanford Students and Postdocs who I regularly harass:

Alumni (Degree year, First Employment)