Introduced by Pawel Gniewek, PhD student researcher at UC Berkeley. The benefits of predictive power are of a tremendous interest in our daily life – for example, when we try to make a decision about our career.
To that end, I have obtained a publication record of the American Physical Society. The database contains the papers published in Physical Review journals dating back to the middle of the twentieth century. The published records in the APS database will serve as a proxy for the trends in physics over the last 100 years.
We observe that some individuals are better than others in recognizing prospective and important scientific topics. Some of those individuals are also responsible for the emergence of new scientific fields. Thus it’s tempting to use a comprehensive record of published papers to put this observation under investigation.
The project has two major goals:
- Using natural language processing tools, we will process abstracts and titles of the papers in order to extract keywords. Those keywords are meant to serve as a proxy of the scientific fields in physics – for example: granular materials, quantum dots, dark matter etc. Based on that, we will trace the time evolution of these trends in time and pinpoint the location at which those trends emerge.
- Using the papers’ keywords, authors, and their affiliations (obtained in step 1), we will train a neural network in order to predict the evolution of the fields/subfields in physics over time.
If successful, this model may serve as an advisory tool for young scientists deciding on their future career and for academic boards that are distributing public resources.
Introduced by Antonio Vitti, Chief Financial Officer and Senior Technology Executive formerly with Merchant Atlas, Inc. and Dr. Steven Gustafson, Chief Scientist at Maana. The goal for this project would be to predict future energy prices on a price per kilowatt hour basis for a specific gas fired electric power plant, preferably located in the USA but also open to other countries, assuming fixed competitive supply in its local market over the next 5 years. Government, satellite imagery, weather, economic, demographic, social media, and financial energy market pricing data and other relevant/alternative data sources, could be used as inputs. Building on the Spring class’s work, and feedback I’ve received from 7Puentes, we could focus more on longer term times scales (day vs. week scales or even monthly periods), which would better fit the goal of building a model that focuses more on longer term macro trends and not on short-term daily price changes. Also, would suggest focusing on one simpler (but promising) model and finding/constraining the best data for that model, rather than deep learning models which may require a lot of data that could be difficult to obtain given the time constraints of the class.
GitHub link for the energy pricing project: https://github.com/Jordanwyli/Energy_Prices.
Contact info for Antonio Vitti:
Introduced by Roberto Zicari, visiting Professor at UC Berkeley, Full Professor at J.W. Goethe University Frankfurt. Psychological mood disorders are often described as both social and medical phenomena. Recent studies in suicide prevention make connections between mood disorders and patterns in residential power use, logs of phone calls, and purchasing history, while older studies identify certain demographic groups as being more at risk. The problem of screening for and predicting the risk of mood disorders in the general population could have a major impact on population health. Can a combination of EMRs (containing clinical data) and other data sources improve upon current strategies for predicting the individualized risk for developing a mood disorder?
Introduced by Charalabidis Yannis, Associate Professor at University of the Aegean. A complete statistical database (more than 100 datasheets) can be made available for the islands of the Aegean. GDP ratios, Income per capita, Industry sectors revenue, information on tourists, spending habits, etc. The idea is to utilize a proper knowledge / data / graph visualization tool and produce an online investment support system for different types of investors (property, agrofood, hotel, etc).
Introduced by Charalabidis Yannis, Associate Professor at University of the Aegean. Design and prototyping of a bot for a university community. Places, processes, tips, catalogues at the tips of your finger. A prototype is under construction with node.js and several greek NLP resources for Uni Aegean through a student thesis, due 2/2018.
Introduced by Gerry Pesavento, Sr. Director Yahoo! Inc.
Tensorflow deployed on a Raspberry Pi 3 to automatically sort trash, https://www.youtube.com/watch?v=5OPY9obvC7I&t=1s – an early hack, and much can be done to improve it. Trash, a $75B industry, has virtually no data – it’s an industry that can be disrupted with data and machine learning.
Introduced by Gerry Pesavento, Sr. Director Yahoo! Inc.
From a users photos, one can compute an accurate contextual advertising profile including hobbies, events, age, ethnicity, gender, work/home address, and current product ownership. Currently advertising profiles are done through web clicks and purchase intent; a more accurate profile is possible through photo analysis. This project can be done using photo repositories (Flickr, Facebook, etc) and Tensorflow and AI APIs (Google, Microsoft, etc).
By Ikhlaq Sidhu
Discussion version 1.0
A common problem with data algorithms these days to infer information with a probabilistic join. The goal is typically to guess an outcome related to an element in a data table. We may have indirect information, but we do not have direct information about the outcome
Who from this list will vote in the next election?
Name Zip Code Sex Age Will Vote
Adam Smith 60601 M 23
Sofia Vargas 94599 F 32
Chuck Boyd 235656 M 25 True (confirmed by phone call)
In this example, we have data on one user (Chuck, who says he will vote) and no information on the Adam or Sofia. However, from unrelated sources, we may be able to obtain indirect information about voter turn-out related to various zip codes and demographics that can help us make an estimate.
Indirect information can be gathered that has nothing specific to either of these 3 people:
- 40% males vote, and q1 = 1/2 = the fraction of the population is male in our data set
- 20% of 23 year olds across the country vote, q2 is fraction of 23 year olds in our data set
- 25% of people in the 60601 zip code voted last time, q3 is probability that in our data set, a person lives in this zip code
If we had more information, we could use Bayes Theorem to estimate actual probability. For example, one would normally find P(voting | male & under 30) = P(voting & male & under 30) / Prob(male and under 30). But we don’t have this information (i.e. Prob(male and under 30)) because our statistics are not directly from the same population.
So what other options remain:
We could estimate:
- Prob (Adam Votes | male) = 0.4, because he is male
- Prob (Adam Votes | 23 years old) = .2, because he is 23
- Prob(Adam Votes | 60601 zip) = .25, because he lives there
And note, he represents q1 x q2 x q3 of the population of our sample data.
A Mixture is actually the simplest option:
Lets try a mixture and then later see if we can tell which is a stronger signal:
P(adam votes) = 4. p1 + .2 p2 + .25 p3
With no info, p1 = p2 = p3, p1+p2+p3 = 1, result = 1/3 x (.4 + .2 + .25) = 0.283..
This means we predict Adam will vote with prob = 0.283.
If we want to value rare events greater, we can set p1 to be larger for those terms where the result has lower entropy. The logic in this case would be that there is the least information when the expected outcome is 1/2 and the most when the outcome is strong like 0 or 1.
By setting p1, p2, and p3 to be proportional to the 1- entropy , that is 1-H(P(votes|male), 1-H(P(votes|23), and 1-H(P(votes|zip code)) respectively. Recall that H(s) = s log2(1/s) + (1-s) log2(1/(1-s)).
Example H(.4) = = .4 log2(1/.4) + (1-s) log2(1/(1-.4)), and 1-H(.4) would be proportionately scaled coefficient for Prob (Adam Votes | male).
This method will skew probabilities away from ½. Note, the method is still a mixture.
Things get a little simpler when we only want to know if it more likely that they will or will not vote. For this, we can simply compare the product of the two options:
- Pos = Prob (Votes | male) x Prob (Votes | 23 years) x Prob (Votes | 60601)
- Neg = Prob (Not Vote | male) x Prob (Not Vote | 23 years) x Prob(Not Vote | 60601)
And then our estimator would be: Estimator = Pos/Neg.
If Estimator > 1, we categorize as will vote
If Estimator < 1, we categorize as will not vote.
Aside: A logit function can be used to map to 0 to 1 probabilities for voting if desired. P(Vote) can be approximated with 1/(1+Exp(-Pos/Neg))
Motivating popular press references provided by Shomit Ghose: