Concept: Inferred Information via Probabilistic Joins

Concept also introduced by Shomit Ghose, Data-X Advisor

This topic has is related to projects that can make predictions on topics where all the data may not be available. For example, the goal may be to predict a feature like “voting preference”, some training data may exist based on name, age, sex, and zip code.  However,  there may be other macro data available in various zip codes that are not tied directly to the target person. A method of probabilistically joining new features can be used as part of the prediction.  For example another there may be a relationship between voting preference and income and there may be another relationship between zip code and income, so in a probabilistic manner, estimates can be sharpened.

Topic: Probabilistic Joins of Disparate Data Sets

Big Data is vast in volume and also vast in variety, being drawn from a seemingly infinite set of sources.  But Big Data’s full benefit can only be gained if arbitrary data from arbitrary sources can be stitched together in statistically valid ways.  If data from different sources cannot be combined through the existence of common keys, as is traditionally done in database applications, it must instead be stitched together using probability-driven connections.  For example, how do you tie a data set that includes gender with one that includes location if there is no common key?  What statistical truths can be applied to tie these two data sets together?  Probability-driven data joins promise to enable the combining of, and correlation of, data that is drawn from different organizations and different endpoints to yield wholly unique insights.

Concept: Data Engineering via Noise Injection

Introduced by Shomit Ghose, Data-X Advisor

This revolves around the project of creating a code set and experiment to see what level of “fake and automated” requests on internet sites would have the effect of confusing the AI algorithms that track users preferences.  This can be useful to users who want to have increased privacy, and the results can be useful to firms who would want to understand what information would still be reliable about customers if the data collected becomes increasingly noisy.

Data Engineering via Noise Injection

Today, online data trails provide large volumes of private, real-time consumer information to companies doing business on the Internet.  Whether it’s search histories, GPS locations, browsing behavior, or social media content, companies such as Google, Amazon and Facebook are able to mine data streams to gain insights about details that consumers may consider private (and may incorrectly assume is undiscoverable).  The Noise Injection project explores methods by which an individual’s data streams can be obfuscated or rendered statistically invalid via the injection of irrelevant data into an existing data stream.  The project serves to explore the methods by which the data on which machines are trained can be engineered to invalidate the training.  This serves the purpose of both building a mechanism of delivering some amount of Internet privacy to the individual, as well as provide an understanding of how data engineering attacks can be executed so that methods can in turn be developed to defeat malicious attacks by bad-actors.