AI Music Software development student cooperation opportunity

I’m looking for someone who is familiar with AI technology at UC Berkeley to cooperate with me, a student studying composition at Berklee College of Music, to develop the AI Music Software. I hope in the near future that music can truly belong to everyone, as we all should have the chance to create unique songs to express ourselves and drive human civilization.
The basic idea is to let AI aid in the writing of music by generating part of the sound by selecting the style of a particular composer.  From a musician’s perspective, this software development requires huge data collection and analysis work. We need to analyze the composition technology such as the structure, orchestration, and characters of melody used by different composers. For the sound piece, the sound frequency each composer uses must be collected and analyzed.
For example, Dvorak frequently used brass as a melodic instrument when he
wrote symphonies. The sound frequency for brass instruments is around 400-800 Hz.
So, if a user chooses Dvorak as the musician in the AI Music Software, the system will
automatically generate a group of data, among which most of the sound frequencies will be around 400-800 Hz. As a result, the software will choose brass as a melodic instrument for this unique song.
If interested in cooperating with me on this project, please contact Bodee Borjigin at Thank you!

Concept: Predicting future outcomes based on historical records

Introduced by Pawel Gniewek, PhD student researcher at UC Berkeley. The benefits of predictive power are of a tremendous interest in our daily life – for example, when we try to make a decision about our career.

To that end, I have obtained a publication record of the American Physical Society. The database contains the papers published in Physical Review journals dating back to the middle of the twentieth century. The published records in the APS database will serve as a proxy for the trends in physics over the last 100 years.

We observe that some individuals are better than others in recognizing prospective and important scientific topics. Some of those individuals are also responsible for the emergence of new scientific fields. Thus it’s tempting to use a comprehensive record of published papers to put this observation under investigation.

The project has two major goals:

  1. Using natural language processing tools, we will process abstracts and titles of the papers in order to extract keywords. Those keywords are meant to serve as a proxy of the scientific fields in physics – for example: granular materials, quantum dots, dark matter etc. Based on that, we will trace the time evolution of these trends in time and pinpoint the location at which those trends emerge.
  2. Using the papers’ keywords, authors, and their affiliations (obtained in step 1), we will train a neural network in order to predict the evolution of the fields/subfields in physics over time.

If successful, this model may serve as an advisory tool for young scientists deciding on their future career and for academic boards that are distributing public resources.

Concept: US Power Plant project

Introduced by Antonio Vitti, Chief Financial Officer and Senior Technology Executive formerly with Merchant Atlas, Inc. and Dr. Steven Gustafson, Chief Scientist at Maana. The goal for this project would be to predict future energy prices on a price per kilowatt hour basis for a specific gas fired electric power plant, preferably located in the USA but also open to other countries, assuming fixed competitive supply in its local market over the next 5 years. Government, satellite imagery, weather, economic, demographic, social media, and financial energy market pricing data and other relevant/alternative data sources, could be used as inputs. Building on the Spring class’s work, and feedback I’ve received from 7Puentes, we could focus more on longer term times scales (day vs. week scales or even monthly periods), which would better fit the goal of building a model that focuses more on longer term macro trends and not on short-term daily price changes. Also, would suggest focusing on one simpler (but promising) model and finding/constraining the best data for that model, rather than deep learning models which may require a lot of data that could be difficult to obtain given the time constraints of the class.

GitHub link for the energy pricing project:

Contact info for Antonio Vitti:


Phone: 415-710-9111

Concept: Multi-disciplinary data analysis of common psychological conditions

Introduced by Roberto Zicari, visiting Professor at UC Berkeley, Full Professor at J.W. Goethe University Frankfurt. Psychological mood disorders are often described as both social and medical phenomena. Recent studies in suicide prevention make connections between mood disorders and patterns in residential power use, logs of phone calls, and purchasing history, while older studies identify certain demographic groups as being more at risk. The problem of screening for and predicting the risk of mood disorders in the general population could have a major impact on population health. Can a combination of EMRs (containing clinical data) and other data sources improve upon current strategies for predicting the individualized risk for developing a mood disorder?

Concept: Visualizing investment opportunities in touristic regions (Open Data for Greece 1.0)

Introduced by Charalabidis Yannis, Associate Professor at University of the Aegean. A complete statistical database (more than 100 datasheets) can be made available for the islands of the Aegean. GDP ratios, Income per capita, Industry sectors revenue, information on tourists, spending habits, etc. The idea is to utilize a proper knowledge / data / graph visualization tool and produce an online investment support system for different types of investors (property, agrofood, hotel, etc).

Concept: The University Bot

Introduced by Charalabidis Yannis, Associate Professor at University of the Aegean. Design and prototyping of a bot for a university community. Places, processes, tips, catalogues at the tips of your finger. A prototype is under construction with node.js and several greek NLP resources for Uni Aegean through a student thesis, due 2/2018.

Concept: Insights from Personal Photos

Introduced by Gerry Pesavento, Sr. Director Yahoo! Inc.

From a users photos, one can compute an accurate contextual advertising profile including hobbies, events, age, ethnicity, gender, work/home address, and current product ownership.  Currently advertising profiles are done through web clicks and purchase intent; a more accurate profile is possible through photo analysis.  This project can be done using photo repositories (Flickr, Facebook, etc) and Tensorflow and AI APIs (Google, Microsoft, etc).

Fuzzy Joins – A Modeling Discussion for Probabilistic Joins in Data Tables

By Ikhlaq Sidhu

Discussion version 1.0

A common problem with data algorithms these days to infer information with a probabilistic join.  The goal is typically to guess an outcome related to an element in a data table.  We may have indirect information, but we do not have direct information about the outcome

For example:

Who from this list will vote in the next election?


Name                         Zip Code        Sex      Age     Will Vote

Adam Smith             60601            M        23

Sofia Vargas             94599            F          32

Chuck Boyd               235656          M        25       True (confirmed by phone call)

In this example, we have data on one user (Chuck, who says he will vote) and no information on the Adam or Sofia.  However, from unrelated sources, we may be able to obtain indirect information about voter turn-out related to various zip codes and demographics that can help us make an estimate.

Indirect information can be gathered that has nothing specific to either of these 3 people:

  • 40% males vote, and q1 = 1/2 = the fraction of the population is male in our data set
  • 20% of 23 year olds across the country vote, q2 is fraction of 23 year olds in our data set
  • 25% of people in the 60601 zip code voted last time, q3 is probability that in our data set, a person lives in this zip code

If we had more information, we could use Bayes Theorem to estimate actual probability. For example, one would normally find P(voting | male & under 30) = P(voting & male & under 30) / Prob(male and under 30).  But we don’t have this information (i.e. Prob(male and under 30)) because our statistics are not directly from the same population.

So what other options remain:

We could estimate:

  • Prob (Adam Votes | male) = 0.4, because he is male
  • Prob (Adam Votes | 23 years old) = .2, because he is 23
  • Prob(Adam Votes | 60601 zip) = .25, because he lives there

And note, he represents q1 x q2 x q3 of the population of our sample data.

A Mixture is actually the simplest option:
Lets try a mixture and then later see if we can tell which is a stronger signal:

P(adam votes) = 4. p1 + .2 p2 + .25 p3

With no info, p1 = p2 = p3, p1+p2+p3 = 1, result = 1/3 x (.4 + .2 + .25) = 0.283..

This means we predict Adam will vote with prob = 0.283.


If we want to value rare events greater, we can set p1 to be larger for those terms where the result has lower entropy.   The logic in this case would be that there is the least information when the expected outcome is 1/2 and the most when the outcome is strong like 0 or 1.

By setting p1, p2, and p3 to be proportional to the 1- entropy , that is 1-H(P(votes|male), 1-H(P(votes|23), and 1-H(P(votes|zip code)) respectively.  Recall that H(s) = s log2(1/s) + (1-s) log2(1/(1-s)).

Example H(.4) = = .4 log2(1/.4) + (1-s) log2(1/(1-.4)), and 1-H(.4) would be proportionately scaled coefficient for Prob (Adam Votes | male).

This method will skew probabilities away from ½.  Note, the method is still a mixture.

Classification Option:

Things get a little simpler when we only want to know if it more likely that they will or will not vote.  For this, we can simply compare the product of the two options:


  • Pos = Prob (Votes | male) x Prob (Votes | 23 years) x Prob (Votes | 60601)
  • Neg = Prob (Not Vote | male) x Prob (Not Vote | 23 years) x Prob(Not Vote | 60601)

And then our estimator would be:  Estimator = Pos/Neg.

If Estimator > 1, we categorize as will vote

If Estimator < 1, we categorize as will not vote.

Aside: A logit function can be used to map to 0 to 1 probabilities for voting if desired. P(Vote) can be approximated with 1/(1+Exp(-Pos/Neg))

Motivating popular press references provided by Shomit Ghose:


(or even

Concept: Faculty Research Matching with NLP and ML

Proposed by Luigi Rodrigues, Haas MBA student, start-up founder, and data-x advisor:
This project supports a startup idea, which is to create an effective matching algorithm to find the best academic professors/researchers in a specific domain.
Example: Given a specific knowledge area (e.g.: “information asymmetry in financial markets” or “building a social venture in sub-Saharan Africa“) I would like to be able to suggest what are the 10 BEST professors to teach or do consulting on these subjects.
The main objective is to help organizations to find the specific knowledge they need and potentially hire professors for consulting, teaching or speech work. My hypothesis is that it is possible to design this with the current information professors already have on their personal pages, publications, Google scholar, citations and so on.