Concept also introduced by Shomit Ghose, Data-X Advisor
This topic has is related to projects that can make predictions on topics where all the data may not be available. For example, the goal may be to predict a feature like “voting preference”, some training data may exist based on name, age, sex, and zip code. However, there may be other macro data available in various zip codes that are not tied directly to the target person. A method of probabilistically joining new features can be used as part of the prediction. For example another there may be a relationship between voting preference and income and there may be another relationship between zip code and income, so in a probabilistic manner, estimates can be sharpened.
Topic: Probabilistic Joins of Disparate Data Sets
Big Data is vast in volume and also vast in variety, being drawn from a seemingly infinite set of sources. But Big Data’s full benefit can only be gained if arbitrary data from arbitrary sources can be stitched together in statistically valid ways. If data from different sources cannot be combined through the existence of common keys, as is traditionally done in database applications, it must instead be stitched together using probability-driven connections. For example, how do you tie a data set that includes gender with one that includes location if there is no common key? What statistical truths can be applied to tie these two data sets together? Probability-driven data joins promise to enable the combining of, and correlation of, data that is drawn from different organizations and different endpoints to yield wholly unique insights.