-
DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQL-style databases. In turn, these databases can be used to support both SQL-style and predictive analytics. Recently, some DeepDive-based applications have exceeded the quality of human volunteer annotators in both precision and recall for complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation. A one pager with key design highlights is here. PaleoDeepDive is featured in the July 2015 issue of Nature.
-
News
- Elated that my group's work was honored by a MacArthur Foundation Fellowship. So excited for what's next!
- Ce's thesis describes DeepDive.
-
MEMEX . DeepDive helps power the MEMEX project in the fight against human trafficking. The project was recently featured on 60 minutes, Forbes, Scientific American, Wall St. Journal, BBC, and Wired. It's supporting investigations. - NIPS15
- Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width by Chris De Sa explains a notion of width that allows one to bound mixing times for factor graphs. (Spotlight)
- Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms by Chris De Sa et al. derives results for low precision and non-convex Hogwild! (asynchronous) style algorithms.
- Asynchronous stochastic convex optimization by John C. Duchi and Tum Chaturapruek explore the limits of asynchrony for convex optimization. As John puts it, "Nothing Really Matters".
- SODA16
- Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
- VLDB15
- Incremental Knowledge Base Construction Using DeepDive is our latest description of DeepDive.
- A Demonstration of Data Labeling in Knowledge Base Construction describes Jaeho's Mindtagger tool, which has really been our secret sauce to build DeepDive applications with high quality.
- Honored to receive the VLDB Early Career Award for scalable analytics. talk video.
-
Data . We're giving away data! Big, marked-up datasets. - Manuscripts. In the end, it all goes into DeepDive...
- Joins and Graph Processing. Frank explains our joins work very nicely
- EmptyHeaded: A Relational Algebra for Graph Processing by Chris Aberger discusses how to use SIMD hardware to support our worst-case optimal join algorithms to find graph patterns. It's fast!
- Increasing the parallelism in multi-round MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
- It’s all a matter of degree: Using degree information to optimize multiway joins by Manas Joglekar discusses one technique to use degree information to go faster (asymptotically!).
- Rohan and Manas extend our new join algorithms to message passing and so fast matrix multiplication. A step toward the vision of unifying relational and linear algebra systems using GHDs.
- Joins and Graph Processing. Frank explains our joins work very nicely
-
Upcoming Meetings and Talks
- CS Retreat. Oct 2-3.
- Moore DDD Event. Oct 7-9.
- Accenture. Oct 14.
- ONR. Oct. 27-29.
- GaTech Colloquium. Nov. 20.
- NIPS. Dec 12. Nonconvex Optimization Workshop.
- NIPS. Dec 12. Machine learning systems.
- Chile. Jan 15.
- USC ML. Jan 26.
-
Code
- Our stuff is on github
- DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
- Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of trace-norm regularization called Jellyfish and non-negative matrix factorization called HottTopix.
- Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.
-
Application Overview Videos (See our YouTube channel, HazyResearch)
- GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
- IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
- There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).
- A messy, incomplete log of old updates is here.
Slides for EDBT/ICDT keynote on Joins and Convex Geometry
Our code is on github. Twitter @HazyResearch.