# Fuzzy Joins – A Modeling Discussion for Probabilistic Joins in Data Tables

By Ikhlaq Sidhu

Discussion version 1.0

A common problem with data algorithms these days to infer information with a probabilistic join.  The goal is typically to guess an outcome related to an element in a data table.  We may have indirect information, but we do not have direct information about the outcome

For example:

Who from this list will vote in the next election?

Table:

Name                         Zip Code        Sex      Age     Will Vote

Sofia Vargas             94599            F          32

Chuck Boyd               235656          M        25       True (confirmed by phone call)

In this example, we have data on one user (Chuck, who says he will vote) and no information on the Adam or Sofia.  However, from unrelated sources, we may be able to obtain indirect information about voter turn-out related to various zip codes and demographics that can help us make an estimate.

Indirect information can be gathered that has nothing specific to either of these 3 people:

• 40% males vote, and q1 = 1/2 = the fraction of the population is male in our data set
• 20% of 23 year olds across the country vote, q2 is fraction of 23 year olds in our data set
• 25% of people in the 60601 zip code voted last time, q3 is probability that in our data set, a person lives in this zip code

If we had more information, we could use Bayes Theorem to estimate actual probability. For example, one would normally find P(voting | male & under 30) = P(voting & male & under 30) / Prob(male and under 30).  But we don’t have this information (i.e. Prob(male and under 30)) because our statistics are not directly from the same population.

So what other options remain:

We could estimate:

• Prob (Adam Votes | male) = 0.4, because he is male
• Prob (Adam Votes | 23 years old) = .2, because he is 23
• Prob(Adam Votes | 60601 zip) = .25, because he lives there

And note, he represents q1 x q2 x q3 of the population of our sample data.

A Mixture is actually the simplest option:
Lets try a mixture and then later see if we can tell which is a stronger signal:

With no info, p1 = p2 = p3, p1+p2+p3 = 1, result = 1/3 x (.4 + .2 + .25) = 0.283..

This means we predict Adam will vote with prob = 0.283.

Aside:

If we want to value rare events greater, we can set p1 to be larger for those terms where the result has lower entropy.   The logic in this case would be that there is the least information when the expected outcome is 1/2 and the most when the outcome is strong like 0 or 1.

By setting p1, p2, and p3 to be proportional to the 1- entropy , that is 1-H(P(votes|male), 1-H(P(votes|23), and 1-H(P(votes|zip code)) respectively.  Recall that H(s) = s log2(1/s) + (1-s) log2(1/(1-s)).

Example H(.4) = = .4 log2(1/.4) + (1-s) log2(1/(1-.4)), and 1-H(.4) would be proportionately scaled coefficient for Prob (Adam Votes | male).

This method will skew probabilities away from ½.  Note, the method is still a mixture.

Classification Option:

Things get a little simpler when we only want to know if it more likely that they will or will not vote.  For this, we can simply compare the product of the two options:

Define:

• Pos = Prob (Votes | male) x Prob (Votes | 23 years) x Prob (Votes | 60601)
• Neg = Prob (Not Vote | male) x Prob (Not Vote | 23 years) x Prob(Not Vote | 60601)

And then our estimator would be:  Estimator = Pos/Neg.

If Estimator > 1, we categorize as will vote

If Estimator < 1, we categorize as will not vote.

Aside: A logit function can be used to map to 0 to 1 probabilities for voting if desired. P(Vote) can be approximated with 1/(1+Exp(-Pos/Neg))

Motivating popular press references provided by Shomit Ghose:

http://newsfeed.time.com/2012/06/19/red-and-blue-brands-how-democrats-and-republicans-shop/