-
I'm an associate professor in the InfoLab affiliated with DAWN, Statistical Machine Learning Group, PPL, and SAIL (bio). I work on the foundations of the next generation of data analytics systems. These systems extend ideas from databases, machine learning, and theory, and our group is active in all areas. An application of our work is to make it dramatically easier to build machine learning systems to process dark data including text, images, and video. Our latest project is Snorkel, our code is on github, and there are blog posts about our work. Lab Twitter: @HazyResearch.
The DeepDive (one pager) project was commercialized as Lattice. As of 2017, Lattice is part of Apple.
News
- Latest crop of professors have started! Chris (Cornell), Ioannis (MILA), and Theo (Wisconsin). We miss them already!
- New work on data augmentation and why weak supervision is a critical problem in AI/ML and systems.
- Recent/Upcoming keynotes and talks: EDBT17, UAI17, ABS East 2017, Cornell, Alibaba, CMU (SiValley), SystemX, KBCOM (WSDM), ITBB18.
- Check out YellowFin by Jian and Ioannis used to scale deep learning to 15 Petaflops with Intel. As per Twitter, "one of the most egregious puns I've seen in software." see blog.
- Our course material from CS145 intro databases is available (send a note), and we'll continue to update it. We're aware of a handful of courses that are using these materials. Drop us a note, if you do!
- A messy, incomplete log of old updates is here.
Research Details
- New Tradeoffs for Machine Learning Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning. There are limits to the robustness of these algorithms, see our ICML 2016 best paper.
- New Programming Models. Our goal for the last few years has been to dramatically reduce the time analyst spend specifying models, maintaining them, and collaboratively building models. Our new effort for lightweight dark data extraction is Snorkel which is built on the idea of weak supervision and data programming, see our blog or video. These systems do not use traditional hand-labeled training data, which removes a fundamental obstacle in using machine learning tools.
- New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more soon.
To validate our ideas, we continue to build systems that we hope change the way people do science and improve society. This work is with great partners in areas including paleobiology (Nature), drug repurposing, genomics, material science, and the fight against human trafficking (60 minutes, Forbes, Scientific American, WSJ, BBC, and Wired). Our work is supporting investigations. In the past, we've worked with a neutrino telescope (IceCube Science cover and our modest contribution) and on economic indicators.
I am an assistant professor in Computer Science at Stanford University. I'm in the InfoLab and affiliated with the PPL and SAIL labs. My interests are theoretical and practical problems in data management. Details of my work can be found in my papers and somewhere on github. I believe that the future of computing is in data management. If you agree, send me a note!
-
I'm an associate professor in the InfoLab affiliated with DAWN, Statistical Machine Learning Group, PPL, and SAIL (bio). I work on the foundations of the next generation of data analytics systems. These systems extend ideas from databases, machine learning, and theory, and our group is active in all areas. An application of our work is to make it dramatically easier to build machine learning systems to process dark data including text, images, and video. Our latest project is Snorkel, our code is on github, and there are blog posts about our work. Lab Twitter: @HazyResearch.
The DeepDive (one pager) project was commercialized as Lattice. As of 2017, Lattice is part of Apple.
News
- Latest crop of professors have started! Chris (Cornell), Ioannis (MILA), and Theo (Wisconsin). We miss them already!
- New work on data augmentation and why weak supervision is a critical problem in AI/ML and systems.
- Recent/Upcoming keynotes and talks: EDBT17, UAI17, ABS East 2017, Cornell, Alibaba, CMU (SiValley), SystemX, KBCOM (WSDM), ITBB18.
- Check out YellowFin by Jian and Ioannis used to scale deep learning to 15 Petaflops with Intel. As per Twitter, "one of the most egregious puns I've seen in software." see blog.
- Our course material from CS145 intro databases is available (send a note), and we'll continue to update it. We're aware of a handful of courses that are using these materials. Drop us a note, if you do!
- A messy, incomplete log of old updates is here.
Research Details
- New Tradeoffs for Machine Learning Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning. There are limits to the robustness of these algorithms, see our ICML 2016 best paper.
- New Programming Models. Our goal for the last few years has been to dramatically reduce the time analyst spend specifying models, maintaining them, and collaboratively building models. Our new effort for lightweight dark data extraction is Snorkel which is built on the idea of weak supervision and data programming, see our blog or video. These systems do not use traditional hand-labeled training data, which removes a fundamental obstacle in using machine learning tools.
- New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more soon.
To validate our ideas, we continue to build systems that we hope change the way people do science and improve society. This work is with great partners in areas including paleobiology (Nature), drug repurposing, genomics, material science, and the fight against human trafficking (60 minutes, Forbes, Scientific American, WSJ, BBC, and Wired). Our work is supporting investigations. In the past, we've worked with a neutrino telescope (IceCube Science cover and our modest contribution) and on economic indicators.
- Recurrence Width for Structured Dense Matrix Vector Multiplication Albert Gu, Rohan Puttagunta, C. Ré, Atri Rudra.
- Socratic Learning: Correcting Misspecified Generative Models using Discriminative Models P. Varma et al.
- Fonduer: Knowledge Base Construction from Richly Formatted Data Sen Wu et al.
Index by year
- Learning to Compose Domain-Specific Transformations for Data Augmentation A. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, C. Ré, NIPS2017.
- Inferring Generative Model Structure with Static Analysis Paroma Varma, Bryan He, Payal Bajaj, C. Ré, NIPS2017.
- Gaussian Quadrature for Kernel Features Tri Dao, Chris De Sa, C. Ré, NIPS2017. spotlight
- HoloClean: Holistic Data Repairs with Probabilistic Inference Theo Rekatsinas, Xu Chu, Ihab F. Ilyas, C. Ré. VLDB 17.
- Weighted SGD for lp regression with Randomized Preconditioning. Jiyan Yang, Yin-Lam Chow, C. Ré, and Michael Mahoney. JMLR 17.
- Learning the Structure of Generative Models without Labeled Data Stephen H. Bach, Bryan He, Alex Ratner, C. Ré. ICML 2017.
-
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent. C. De Sa, Matt Feldman, C. Ré, Kunle Olukotun. ISCA 2017. - GYM: A Multiround Join Algorithm In MapReduce and Its Analysis Foto Afrati, Manas Joglekar, C. Ré, Semih Salihoglu, Jeffrey D. Ullman. ICDT 2017.
- SLiMFast: Guaranteed Results for Data Fusion and Source Reliability. Manas Joglekar, Theodoros Rekatsinas, H. Garcia-Molina, et al. SIGMOD 17.
- A Relational Format for Feature Engineering Benny Kimmelfeld, C. Ré. PODS 2017. Best of PODS.
- Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded. Chris Aberger, Andy Lamb, Kunle Olukotun, C. Ré. VLDB17 (demo)
- Snorkel: Fast Training Set Generation for Information Extraction. Alex Ratner, Stephen Bach, Henry Ehrenberg, C. Ré. SIGMOD 17 (demo).
- Snorkel: A System for Lightweight Extraction Alex Ratner, Stephen Bach et al. CIDR 2017 (one pager)
- Flipper: A Systematic Approach to Debugging Training Sets. Paroma Varma, Dan Iter, C. De Sa and C. Ré. HILDA 2017
- Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features Kun-Hsing Yu et al. Nature Comms. 2016. Parasite Award
- Data Programming: Creating Large Training Sets, Quickly Alex Ratner, Chris De Sa, Sen Wu, Daniel Selsam, and C. Ré. NIPS 2016. video.
- Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much. Bryan He, C. De Sa, I. Mitliagkas, and C. Ré. NIPS 2016. video.
- Sub-sampled Newton Methods with Non-uniform Sampling Peng Xu, Jiyan Yang, Farbod Roosta-Korasani, C. Ré, and Michael Mahoney. NIPS 2016.
- Cyclades: Conflict-free Asynchronous Machine Learning. Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, et al. NIPS 2016.
- Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa, Kunle Olukotun, C. Ré. ICML 2016. Best Paper Award.
- EmptyHeaded: A Relational Engine for Graph Processing Christopher R. Aberger, Susan Tu, Kunle Olukotun, and C. Ré. SIGMOD 2016. Best Of.
- Aggregations over Generalized Hypertree Decompositions. Manas Joglekar, Rohan Puttagunta, and C. Ré. PODS 2016.
- Extracting Databases from Dark Data with DeepDive. Ce Zhang, Michael Cafarella, Feng Niu, C. Ré, Jaeho Shin. SIGMOD 2016 (Industrial Track).
- High Performance Parallel Stochastic Gradient Descent in Shared Memory. S. Sallinen, N. Satish, M. Smelyanskiy, S. Sury, C. Ré. IPDPS16.
- It’s all a matter of degree: Using degree information to optimize multiway joins. Manas Joglekar and C. Ré. ICDT2016. Best Of.
- Weighted SGD for lp Regression with Randomized Preconditioning Jiyan Yang, Yin-Lam Chow, C. Ré, and Michael Mahoney. SODA16.
- Asynchrony begets Momentum, with an Application to Deep Learning. Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and C. Ré. Allerton 16.
- Incremental Knowledge Base Construction Using DeepDive Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, C. De Sa, C. Ré. VLDBJ.
- Materialization Optimizations for Feature Selection. Ce Zhang, Arun Kumar, and C. Ré. TODS 2016.
- A Resolution-based Framework for Joins: Worst-case and Beyond. Mahmoud Abo Khamis, Hung Q. Ngo, C. Ré, and Atri Rudra. TODS 2016.
- DeepDive: Declarative Knowledge Base Construction. Chris De Sa, Alex Ratner, C. Ré, J. Shin, F.Wang, Sen Wu, Ce Zhang. SIGMOD Record 2016.
- Socratic Learning: Empowering the Generative Model Paroma Varma, Rose Yu, Dan Iter, C. De Sa, C. Ré. FiLM-NIPS 2016.
- Data Programming with DDLite: Putting Humans in a Different Part of the Loop. Henry Ehrenberg, J. Shin, A. Ratner, J. Fries, C. Ré. HILDA16
- Parallel SGD: When does Averaging Help? Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas, and C. Ré. OptML16
- Old Techniques for New Join Algorithms: A Case Study in RDF Processing. Chris Aberger, Susan Tu, Kunle Olukotun, and C. Ré. DESWEB. ICDE16.
- Wikipedia Knowledge Graph with DeepDive. Thomas Palomares, Youssef Ahres, Juhana Kangaspunta and C. Ré. In Wiki Workshop at ICSMW 2016.
- Dark Data: Are We Solving the Right Problems? M. Cafarella, I. Ilyas, M. Kornacker, Tim Kraska, C. Ré. ICDE 2016 (Panel).
- Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs. Stefan Hadjis, Ce Zhang, Ioannis Mitliagkas, and C. Ré.
- Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width Chris De Sa, Ce Zhang, Kunle Olukotun, C. Ré. NIPS15. Spotlight
- Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms Chris De Sa, Ce Zhang, Kunle Olukotun, and C. Ré. NIPS15
- Asynchronous stochastic convex optimization. John C. Duchi, Sorathan Chaturapruek, and C. Ré. NIPS15.
- Incremental Knowledge Base Construction Using DeepDive Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, C. De Sa, C. Ré. VLDB15. Best of Issue
- Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems Christopher De Sa, Kunle Olukotun, and C. Ré. ICML15.
- A Resolution-based Framework for Joins: Worst-case and Beyond. Mahmoud Abo Khamis, Hung Q. Ngo, C. Ré, and Atri Rudra. PODS15. Best of Issue.
- Exploiting Correlations for Expensive Predicate Evaluation Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and C. Ré. SIGMOD15
- A Demonstration of Data Labeling in Knowledge Base Construction. Jaeho Shin, Mike Cafarella, and C. Ré. VLDB15 (demo).
- Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?. SIGMOD15 Panel.
- Large-scale extraction of gene interactions from full text literature using DeepDive Emily Mallory, Ce Zhang, C. Ré., Russ Altman. Bioinformatics
- Caffe con Troll: Shallow Ideas to Speed up Deep Learning. Firas Abuzaid, Stefan Hadjis, Ce Zhang, and C. Ré. DANAC15.
- Join Processing for Graph Patterns: An Old Dog with New Tricks Dung Nguyen, LogicBlox, et al. GRADES15.
- DunceCap: Compiling Worst-case Optimal Query Plans. Adam Perelman and C. Ré. Winner of SIGMOD Undergrad Research Competition.
- DunceCap: Query Plans Using Generalized Hypertree Decompositions. Susan Tu and C. Ré. Winner of SIGMOD Undergrad Research Competition.
- A Database Framework for Classifier Engineering. Benny Kimmelfeld and C. Ré. AMW 2015.
- The Mobilize Center: an NIH big data to knowledge center to advance human movement research and improve mobility Ku et al. AMIA.
- Materialization Optimizations for Feature Selection. Ce Zhang, Arun Kumar, and C. Ré. SIGMOD 2014. Best Paper Award.
- DimmWitted: A Study of Main-Memory Statistical Analytics. Ce Zhang and C. Ré. VLDB 2014.
- An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. J. Liu, S. Wright, C. Ré, V. Bittorf, S. Sridhar. ICML 2014. (JMLR version)
- Beyond Worst-case Analysis for Joins using Minesweeper. Hung Q. Ngo, Dung Nguyen, C. Ré, and Atri Rudra. PODS 2014. [Full]
- Parallel Feature Selection Inspired by Group Testing. Y. Zhou et al. NIPS2014.
- The Theory of Zeta Graphs with an Application to Random Networks. C. Ré. ICDT 2014. Invited to "Best of" Special Issue.
- Transducing Markov Sequences Benny Kimelfeld and C. Ré. JACM 2014.
- A Machine-compiled Macroevolutionary History of Phanerozoic Life. Shanan E. Peters, Ce Zhang, Miron Livny, and C. Ré. PloS ONE.
- Using Social Media to Measure Labor Market Flow D. Antenucci, M. Cafarella, M. Levenstein, C. Ré, and M. Shapiro. NBER. Selected for NBER Digest
- Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems Christopher De Sa, Kunle Olukotun, and C. Ré.
Preliminary version in Distributed Matrix Computation with NIPS14. - Feature Engineering for Knowledge Base Construction DeepDive Group. Data Engineering Bulletin.
- Tradeoffs in Main-Memory Statistical Analytics: Impala to DimmWitted (Invited) V. Bittorf, M. Kornacker, C. Ré, C. Zhang. IMDM with VLDB14.
- The Beckman Report on Database Research Mike Carey, AnHai Doan, et al. 2014.
- Links between Join Processing and Convex Geometry, C. Ré. ICDT 2014 (Invited Abstract for Keynote) [slides].
- Skew Strikes Back: New Developments in the Theory of Join Algorithms. Hung Ngo, C. Ré, and Atri Rudra. SIGMOD Rec. 2013.
- Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. Ce Zhang and C. Ré. SIGMOD 2013.
- An Approximate, Efficient LP Solver for LP Rounding. Srikrishna Sridhar, Victor Bittorf, Ji Liu, Ce Zhang, C. Ré, and Stephen J. Wright. NIPS 2013
- Brainwash: A Data System for Feature Engineering. M. Anderson et al. CIDR Conference 2013 (Vision Track)
- Understanding Tables in Context Using Standard NLP Toolkits Vidhya Govindaraju, Ce Zhang, and C. Ré. ACL 2013 (Short Paper)
- Hazy: Making it Easier to Build and Maintain Big-data Analytics. Arun Kumar, Feng Niu, and C. Ré
ACM Queue, 2013. Invited to CACM March 2013
- Ringtail: Nowcasting Made Easy D. Antenucci, M.J. Cafarella, M.C. Levenstein, C. Ré, and M. Shapiro. WebDB 2013 with SIGMOD 2013
- Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion B. Recht and C. Ré. Mathematical Programming Computation, 2013.
- Ringtail: Nowcasting Made Easy. Dolan Antenucci, Erdong Li, Shaobo Liu, Michael J. Cafarella, and C. Ré. VLDB Demo 2013.
- Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System P. Konda, A. Kumar, C. Ré, and V. Sashikanth. VLDB Demo 2013
- GeoDeepDive: Statistical Inference using Familiar Data-Processing Languages. Ce Zhang, V. Govindaraju, J. Borchardt, T. Foltz, C. Ré, and S. Peters. SIGMOD 13 (demo).
- Building an Entity-Centric Stream Filtering Test Collection for TREC 2102. J.R. Frank, M.Kleiman-Weiner, D. A. Roberts, F.Niu, Ce Zhang, C. Ré, and I. Soboroff. TREC 2013
- Improvement in Fast Particle Track Reconstruction with Robust Statistics M. Wellons, IceCube Collaboration, B. Recht, and C. Ré. Nuclear Inst. and Methods in Physics Research, A.
- Robust Statistics in IceCube Initial Muon Reconstruction. M. Wellons, IceCube Collaboration, B. Recht, and C. Ré. International Cosmic Ray Conference 2013.
- Factoring nonnegative matrices with linear programs. Victor Bittorf, Benjamin Recht, C. Ré, and Joel A. Tropp. NIPS 2012. Revised Version.
- The MADlib Analytics Library or MAD Skills, the SQL. Joseph M. Hellerstein et al. PVLDB 2012
- Probabilistic Management of OCR using an RDBMS Arun Kumar and C. Ré. PVLDB 2012. [Full Version]
- Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen, Xixuan Feng, C. Ré, and Min Wang. ICDE. [Full Version]
- Understanding cardinality estimation using entropy maximization C. Ré and Dan Suciu. ACM Trans. Database Syst. Volume 37.
- Towards a Unified Architecture for In-Database Analytics Aaron Feng, Arun Kumar, Benjamin Recht, and C. Ré
SIGMOD 2012. [Full Version]
- Worst-case Optimal Join Algorithms Hung Q. Ngo, Ely Porat, C. Ré, and Atri Rudra. PODS, 2012. Best Paper Award
- Big Data versus the Crowd: Looking for Relationships in All the Right Places Ce Zhang, Feng Niu, C. Ré, and Jude Shavlik. ACL, 2012.
- Toward a noncommutative arithmetic-geometric mean inequality
B. Recht and C. Ré. COLT, 2012 [Full Version]
- Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference F. Niu, Ce Zhang, C. Ré, and J. Shavlik. IJSWIS, Special Issue on Knowledge Extraction from the Web, 2012.
- DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference
F. Niu, C. Zhang, C. Ré, and J. Shavlik.
VLDS, 2012.
- Scaling Inference for Markov Logic via Dual Decomposition (Short). F. Niu, C. Zhang, C. Ré, and J. Shavlik. ICDM, 2012.
- Probabilistic Databases. Dan Suciu, Dan Olteanu, C. Ré, and Christoph Koch. Morgan Claypool's Synthesis Lectures, 2011
- Incrementally maintaining classification using an RDBMS Mehmet Levent Koc and C. Ré. PVLDB Volume 4, 2011, p. 302-313
- Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS F. Niu, C. Ré, A.Doan, and J.W. Shavlik. PVLDB 11, [Full Version]
- Automatic Optimization for MapReduce ProgramsEaman Jahani, Michael J. Cafarella, and C. Ré. PVLDB 2011.
- Queries and materialized views on probabilistic databases. Nilesh N. Dalvi, C. Re, and Dan Suciu. JCSS 2011.
- Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion B. Recht and C. Ré. 2011.
- Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent F. Niu, B. Recht, C. Ré, and S. J. Wright. NIPS, 2011. [Full Version]
- Felix: Scaling Inference for Markov Logic with an Operator-based Approach. Feng Niu, Ce Zhang, C. Ré, and Jude Shavlik.
- Manimal: Relational Optimization for Data-Intensive Programs Michael J. Cafarella and C. Ré. WebDB, 2010.
-
Transducing Markov Sequences
Benny Kimelfeld and C. Ré. PODS, 2010. Invited to Special Issue
-
Understanding Cardinality Estimation using Entropy Maximization
C. Ré and Dan Suciu. PODS, 2010,
Invited to Special Issue
- Approximation Trade-Offs in a Markovian Stream Warehouse: An Empirical Study (Short) J. Letchner, C. Ré, M. Balazinska, and M. Philipose. ICDE.
- C. Ré Managing Large-Scale Probabilistic Databases
University of Washington, Seattle, 2009
Winner of SIGMOD Jim Gray Thesis Award- Raghav Kaushik, C. Ré, and Dan Suciu General Database Statistics Using Entropy Maximization
DBPL, 2009, p. 84-99
[Talk]- Katherine F. Moore, Vibhor Rastogi, C. Ré, and Dan Suciu Query Containment of Tier-2 Queries over a Probabilistic Database
Management of Uncertain Databases (MUD), 2009,- Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose Access Methods for Markovian Streams
ICDE, 2009, p. 246-257- Arvind Arasu, C. Ré, and Dan Suciu Large-Scale Deduplication with Constraints Using Dedupalog
ICDE, 2009, p. 952-963
[Talk]
Selected as one of the best papers in ICDE 2009- Nilesh N. Dalvi, C. Ré, and Dan Suciu Probabilistic databases: Diamonds in the dirt
Commun. ACM Volume 52, 2009, p. 86-94
[Full Version]- S. Manegold, I. Manolescu, L. Afanasiev, J. Feng, G. Gou, M. Hadjieleftheriou, S. Harizopoulos, P. Kalnis, K. Karanasos, D. Laurent, M. Lupu, N. Onose, C. Ré, V. Sans, P. Senellart, T. Wu, and D. Shasha Repeatability & Workability Evaluation of SIGMOD 2009
SIGMOD Record Volume 38, 2009, p. 40-43- Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose Lahar Demonstration: Warehousing Markovian Streams
PVLDB Volume 2, 2009, p. 1610-1613- C. Ré and Dan Suciu The Trichotomy of HAVING Queries on a Probabilistic Database
VLDB Journal 2009,- C. Ré Managing Probabilistic Data with Mystiq (Plenary Talk)
Daghstul Seminar 08421: Uncertainty Management in Information Systems, 2008,- C. Ré, and Dan Suciu Advances in Processing SQL Queries on Probabilistic Data
Invited Abstract in INFORMS 2008, Simulation., 2008,- Ting-You Wang, C. Ré, and Dan Suciu Implementing NOT EXISTS Predicates over a Probabilistic Database
QDB/MUD, 2008, p. 73-86- Nodira Khoussainova, Evan Welbourne, Magdalena Balazinska, Gaetano Borriello, Garrett Cole, Julie Letchner, Yang Li, C. Ré, Dan Suciu, and Jordan Walke A demonstration of Cascadia through a digital diary application
SIGMOD Conference, 2008, p. 1319-1322- C. Ré, Julie Letchner, Magdalena Balazinska, and Dan Suciu Event queries on correlated probabilistic streams
SIGMOD Conference, 2008, p. 715-728- C. Ré, and Dan Suciu Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can't-Do
SUM, 2008, p. 5-18- Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose Challenges for Event Queries over Markovian Streams
IEEE Internet Computing Volume 12, 2008, p. 30-36- C. Ré, and Dan Suciu Approximate lineage for probabilistic databases
PVLDB Volume 1, 2008, p. 797-808
[Full Version][Talk]
The version above corrects an error in the statement of lemma 3.7.- Magdalena Balazinska, C. Ré, and Dan Suciu Systems aspects of probabilistic data management (Part I)
PVLDB Volume 1, 2008, p. 1520-1521
[Talk]- Magdalena Balazinska, C. Ré, and Dan Suciu Systems aspects of probabilistic data management (Part II)
PVLDB Volume 1, 2008, p. 1520-1521
[Talk]
- Michael J. Cafarella, C. Ré, Dan Suciu, and Oren Etzioni Structured Querying of Web Text Data: A Technical Challenge
CIDR, 2007, p. 225-234- C. Re, and Dan Suciu Management of data with uncertainties
CIKM, 2007, p. 3-8- C. Ré, Dan Suciu, and Val Tannen Orderings on Annotated Collections
Liber Amicorum in honor of Jan Paredaens 60th Birthday, 2007,- C. Ré, and Dan Suciu Efficient Evaluation of HAVING Queries
DBPL, 2007, p. 186-200
[Full Version][Talk]- C. Ré, Nilesh N. Dalvi, and Dan Suciu Efficient Top-k Query Evaluation on Probabilistic Data
ICDE, 2007, p. 886-895
[Full Version][Talk]- C. Re and Dan Suciu Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization
VLDB, 2007, p. 51-62
[Full Version][Talk]- C. Ré Applications of Probabilistic Constraints (General Exam Paper)
University of Washington TR#2007-03-03 2007,- Eytan Adar and C. Ré Managing Uncertainty in Social Networks
IEEE Data Eng. Bull. Volume 30, 2007, p. 15-22
- Giorgio Ghelli, C. Ré, and Jér^ome Sim'eon XQuery!: An XML Query Language with Side Effects
EDBT Workshops, 2006, p. 178-191- C. Re, Jér^ome Sim'eon, and Mary F. Fern'andez A Complete and Efficient Algebraic Compiler for XQuery
ICDE, 2006, p. 14- C. Ré, Nilesh N. Dalvi, and Dan Suciu Query Evaluation on Probabilistic Databases
IEEE Data Eng. Bull. Volume 29, 2006, p. 25-31- Chavdar Botev, Hubert Chao, Theodore Chao, Yim Cheng, Raymond Doyle, Sergey Grankin, Jon Guarino, Saikat Guha, Pei-Chen Lee, Dan Perry, C. Re, Ilya Rifkin, Tingyan Yuan, Dora Abdullah, Kathy Carpenter, David Gries, Dexter Kozen, Andrew C. Myers, David I. Schwartz, and Jayavel Shanmugasundaram Supporting workflow in a course management system
SIGCSE, 2005, p. 262-266- Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, C. Ré, and Dan Suciu MYSTIQ: a system for finding more answers by using probabilities
SIGMOD Conference, 2005, p. 891-893- Nathan Bales, James Brinkley, E. Sally Lee, Shobhit Mathur, C. Re, and Dan Suciu A Framework for XML-Based Integration of Data, Visualization and Analysis in a Biomedical Domain
XSym, 2005, p. 207-221- C. Ré, Jim Brinkley, Kevin Hinshaw, and Dan Suciu Distributed XQuery
Workshop on Information Integration on the Web (IIWeb), 2004, p. 116-121- Werner Vogels and C. Ré WS-Membership - Failure Management in a Web-Services World
WWW (Alternate Paper Tracks), 2003,- Werner Vogels, C. Ré, Robbert Renesse, and Kenneth P. Birman A Collaborative Infrastructure for Scalable and Robust News Delivery
ICDCS Workshops, 2002, p. 655-659Christopher (Chris) Ré is an associate professor in the Department of Computer Science at Stanford University in the InfoLab who is affiliated with the Statistical Machine Learning Group, Pervasive Parallelism Lab, and Stanford AI Lab. His work's goal is to enable users and developers to build applications that more deeply understand and exploit data. His contributions span database theory, database systems, and machine learning, and his work has won best paper at a premier venue in each area, respectively, at PODS 2012, SIGMOD 2014, and ICML 2016. In addition, work from his group has been incorporated into major scientific and humanitarian efforts, including the IceCube neutrino detector, PaleoDeepDive and MEMEX in the fight against human trafficking, and into commercial products from major web and enterprise companies. He cofounded a company, based on his research, that was acquired by Apple in 2017. He received a SIGMOD Dissertation Award in 2010, an NSF CAREER Award in 2011, an Alfred P. Sloan Fellowship in 2013, a Moore Data Driven Investigator Award in 2014, the VLDB early Career Award in 2015, the MacArthur Foundation Fellowship in 2015, and an Okawa Research Grant in 2016.
Download as text file
Current PhD Students
- Chris Aberger Coadvisor: Kunle Olukotun
- Xiao Cheng
- Tri Dao Coadvisor: Stefano Ermon
- Emily Mallory (Biomedical Informatics) Principle advisor: Russ Altman
- Albert Gu
- Braden Hancock
- Bryan He
- Alex Ratner
- Paroma Varma
- Sen Wu
- Peng Xu Coadvisor: Michael Mahoney
- Jian Zhang
Current MS and Coterm Students
- Payal Balaji
- Ines Chami
Current Postdocs
- Stephen Bach Coadvisor: Jure Leskovec
- Jared Dunnmon
- Madalina Fiterau Coadvisor: Scott Delp
- Jason Fries Coadvisor: Scott Delp
- Fred Sala
- Virginia Smith
PhD and Postdoc Alumni (Degree year, First Employment)
- Chris De Sa (PhD 2017, Asst. Professor at Cornell) Coadvisor: Kunle Olukotun
- Ioannis Mitliagkas (Postdoc 2017, Asst. Professor at Montréal) Coadvisor: Lester Mackey
- Theodoros Rekatsinas (Postdoc 2017, Asst. Professor at Wisconsin)
- Jaeho Shin (PhD 2016, Lattice)
- Jiyan Yang (PhD 2016, Facebook) Advisor: Michael Saunders (ICME) and Michael Mahoney (Berkeley)
- Kun-Hsing Yu (PhD 2016, Harvard Postdoc) Advisor: Michael Snyder (BioE)
- Manas Joglekar (PhD 2016, Google) Advisor: Hector Garcia-Molina
- Ce Zhang (PhD 2015, Postdoc 2016, Asst. Professor at ETH)
- Srikrishna Sridhar (PhD 2014, GraphLab) Main Advisor: Stephen J. Wright
- Feng Niu (PhD 2012, Google, Lattice Cofounder)
MS Alumni (Degree year, First Employment)
- Henry Ehrenberg (MS 2017, Facebook)
- Andy Lamb (CoTerm MS 2017, Google)
- Rohan Puttagunta (MS 2016, Facebook)
- Thomas Palomares (MS 2016, Startup)
- Susan Tu (CoTerm MS 2016, Stripe)
- Feiran Wang (MS2016, LinkedIn)
- Michael Fitzpatrick (MS 2015, Google)
- Firas Abuzaid (MS 2015, MIT for PhD)
- Zifei Shan (MS 2015, Lattice)
- Adam Goldberg (BS 2015, Rubrik)
- Adam Perelman (BS 2015, Good Eggs)
- Victor Bittorf (MS 2014, Cloudera)
- Vidhya Govindaraju (MS 2014, Oracle)
- Mark Wellons (MS 2013, Amazon)
- Arun Kumar (MS 2013, Wisconsin for PhD, Asst. Professor UCSD)
- Xixi Luo (MS in Industrial Engineering 2012, Oracle)
- Vinod Ramachandran (MS 2011, Oracle)
- M. Levent Koc (MS 2011, Google)
- Balaji Gopalan (MS 2010, Google)
We are working on two broad topics:- (1) DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQL-style databases. In turn, these databases can be used to support both SQL-style and predictive analytics. Recently, some DeepDive-based applications have exceeded the quality of human volunteer annotators in both precision and recall for complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation. A one pager with key design highlights is here. PaleoDeepDive is featured in the July 2015 issue of Nature.
- (2) Fundamentals of Data
Processing.
Almost all data processing systems have their intellectual roots in first order logic. The most computationally expensive (and most interesting) operation in such systems is the relational join. Recently, I helped discover the first join algorithm with optimal worst-case running time. This result uses a novel connection between logic, combinatorics, and geometry. We are using this connection to develop new attacks on classical problems in listing patterns in graphs and in statistical inference. Two threads have emerged:
- The first theme is that these new worst-case-optimal algorithms are fundamentally different from the algorithms used in (most of) today's data processing systems. Although our algorithm is optimal in the worst case, commercial relational database engines have been tuned to work well on real data sets by smart people for about four decades. And so a difficult question is how does one translate these insights into real data processing systems?
- The second theme is that we may need new techniques to get theoretical results strong enough to guide practice. As a result, I've started thinking about "beyond worst-case analysis" and things like conditioning for combinatorial problems to hopefully build theory that can inform practice to a greater extent. The first papers have just been posted.
- Demos, Examples, and Papers.
- Worst-case Optimal Joins. We have posted a survey for SIGMOD record about recent advances in join algorithms. Our goal is to give a high-level view of the results for practitioners and applied researchers. We also managed to simplify the arguments. A full version of our join algorithm with worst-case optimal running time is here. The LogicBlox guys have their own commercial worst-case optimal algorithm. Our new system, EmptyHeaded is based on this theory.
- Beyond Worst-case Joins. This work is our attempt to go beyond worst-case analysis for join algorithms. We (with Dung Nguyen) develop a new algorithm that we call Minesweeper based on these ideas. The main theoretical idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query is what Ron Fagin calls beta-acyclic, then Minesweeper runs in time linear in the certificate; if a query is beta-cyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun.
- Almost to one algorithm to rule them all? We have a much better description of beyond worst-case optimality with a resolution framework and a host of new results for different indexing strategies. This paper supercedes many of the results in Minesweeper and in a much nicer way!. We also hope to connect more of geometry and resolution... but we'll see!
- A first part of our attack on conditioning for combinatorial problems is in NIPS and on Arxiv.
- It is not difficult to get me interested in a theory problem. Ask around the Infolab if you don't believe me.
DeepDive is our attempt to understand a new type of database system. Our new approach can be summarized as follows: the data, the output of various tools, the input from users — including the program the developer writes — are observations from which the system statistically infers the answer. This view is a radical departure from traditional data processing systems, which assume that the data is one-hundred percent correct. A key problem in DeepDive is that the system needs to consider many possible interpretations for each data item. In turn, we need to explore a huge number of combinations during probabilistic inference, which is one of the core technical challenges.
Our goal is to acquire more sources of data for DeepDive to understand more deeply to change the way that science and industry operate.
- CS145, Introduction to Databases. Fall 16, Fall 15, Fall 14.
- CS346, Database System Implementation. Spring 15 Spring 14.
- CS341, Project in Mining Massive Data Sets. Spring 15 Spring 16.
- CS345 Advanced Database Systems. Winter 14.
-
Our course material from CS145 intro databases is here, and we'll continue to update it. We're aware of a handful of courses that are using these materials, drop us a note if you do! We hope to update them throughout the year.
- Recent Service
- Demo Chair, ICDE 2018
- Editorial Boards
- Transactions of Database Systems (Associate Editor) 2014-2017
- Foundations and Trends (Associate Editor) 2014-2017
- Springer Series in Data Science 2014-2017
- Conferences: ICML 2017; Tutorial CoChair at VLDB16; SIGMOD 2013-2015,2017 (External 2016); PODS 2016; ICDT 2015; ACL 2015; CIDR 2015,2017; VLDB 2014-2015; ICDE 2011-2013,2015; EDBT 2012.
- Journal Reviewer: TODS, JACM, CACM, TPDS, TODS, VLDBJ, TKDE.
- Honors and Awards
- Okawa Research Grant, 2016
- ICML Best Paper Award, 2016
- Distinguished Lectures ONR and FDA, 2016
- CACM Research Highlight for DeepDive, 2016.
- MacArthur Foundation Fellowship, 2015
- VLDB Early Career Award, 2015 (talk video)
- Kavli Fellow, NAS, Frontiers of Science, 2015 (unable to attend)
- Gordon & Betty Moore Data-Driven Discovery Award, 2014
- SIGMOD Best Paper Award, 2014
- National Bureau of Economic Review, NBER Digest Highlight, 2014
- Alfred P. Sloan Research Fellowship, 2013
- Robert N. Noyce Faculty Fellowship, 2013
- PODS Best Paper Award, 2012
- NSF CAREER Award, 2011
- ACM SIGMOD Jim Gray Dissertation Award, 2010
- "Best of" Special Issue Paper Awards: Nature Comms 2016 (Research Parasite for Kun) SIGMOD 2016 (TODS); ICML 2016 (IJCAI Best of AI); ICDT 2016 (TOCS); VLDB 2015 (VLDBJ & CACM Resesearch Highlight); PODS 2015 (TODS); SIGMOD 2014 (JACM); ICDT 2014 (TOCS), declined; PODS 2012; PODS 2010, two papers, JACM and TODS; ICDE 2009 (TKDE), declined.