Project Ideas – Data-X at Berkeley

Deep Genomic Analytics for Disease Prediction

Contact: Johannes Bhakdi, [email protected]

About Quantgene:

We’re a stealth genomic analytics startup in Berkley, California that leads the way in single- molecule-precision cancer and disease detection. Our data team pushes the frontier of data science in biology to provide a new level of protection based on deep genomics data, statistical learning and cutting-edge sequencing technology. Different from other companies in our field, we don’t revert back to off-the-shelf analytics, but break new ground in computational and statistical biology every day to tackle the most complex data problems in modern bioscience.

About the Problem:

You will work with our bioinformatics team on deep genomic datasets to detect new patterns relevant to cancer and disease detection. Our approach to deep data analytic matches the complexity of our field: we don’t apply off-the-shelf algorithms but create new approaches customized to achieve our objectives: to filter and clean up fuzzy datasets; to isolate signals and to algorithmically detect and refine patterns. The data sets you will work on are complex, unconventional and fuzzy. You’ll participate in cutting-edge research in statistical and computational modeling and inference of biology. This is not a conventional “data scientist” job – it’s a job for true data hackers and masters in recognizing patterns when conventional machine learning fails. Our methods lead the way in deep genomic analytics and are designed to provide maximum health protection in the real world. You don’t need a background in biology, but you do need to love data and pattern recognition.

Using High-Frequency Electricity Consumption data from houses for Smart Home and Micro-Grid Optimization

Contact: [email protected]

Description: Residential and commercial buildings consume around 40% of the total energy generated in the US. In the recent years researches have come up with studies pointing to the inefficient HVAC systems, architecture etc. There have been developments to mitigate these inefficiencies but they are still very basic and there is a lot of room for improvement.

In this project, we have data from 5 houses for electricity consumption by various appliances and occupancy for every second for 8 months. We will also be getting some weather data and some other features to build machine learning models to solve problems related to:
1) inefficient consumption patterns,
2) feedback in case of abnormal use of electricity
3) on understanding the economic feasibility of microgrids through detailed analysis of consumption data and forecasting of various scenarios.

What’s in it for you?
1) Work with various machine learning algorithms like ARIMA, SVM, Random Forrest, BSTS, XGBoost and many others.
2) Gain experience in manipulating and pre processing Big Data as we have 20.7 million rows of data.
3) Understand Building Efficiency and sustainable building design.
4) Understand human machine interaction by deriving behaviorial patterns from just consumption data.
5) Understand how microgrids work and how they are optimized. (I have done micro grid optimization using linear programming so we can explore that as well)
6) Learn Agile Development practices (Optional).

Sports tech

Contacts: Panna Felsen (), Mike Devlin (), Heather Lockwood ()

Problem: The average athlete, who is looking to improve their strength & conditioning, often relies on at least one of the following: 1) information scraped online through YouTube videos, sites like bodybuilding.com, social media, etc.; 2) personal training; or 3) an online coach. Each of these approaches has their own set of various drawbacks, including misinformation, high expenses, and delayed coach interactions.

Solution: Our project is focused on building a platform that will democratize and improve the access to the RIGHT information, in order to enhance performance, safety and overall progress of the average athlete. By leveraging the camera available in most people’s pockets (e.g., a smartphone) and recent research in computer vision and machine learning, we aim to bring the knowledge and technique required for safe & successful lifting to your pocket. The technology we build will capture the intuition and knowledge of personal trainers and strength coaches to help the average athlete improve their performance.

Why should you work on this?

• You’re passionate about sports, athletics, or general fitness

• Opportunity to work alongside leading AI researchers at UC Berkeley

• Opportunity to work with individuals with experience in Venture Capital, Private Equity, and Investment Banking

• Opportunity to work with experienced strength coaches

• An exciting work environment and a chance to help build technology from the ground floor!

PilotCity: Students are from Mars, Employers are from Venus

Contact: Derrick Lee, [email protected]

Imagine an engine for innovation for small-to-medium sized cities… Okay now imagine the local high school students of these cities becoming the protagonist for civic transformation in their cities. PilotCity is creating an engine for innovation for cities by building career pathway systems for students to enter the workforce. We do this by hosting in-classroom project-based challenges driven by employers that lead to at-workplace work-based experiences such as internships and fellowships. How would you simulate the presence of an employer in the classroom to be the best project advisor to the student while saving time for the busy working professional? Developing on platform technologies such as voice-activated assistants such as Google Home, and/or telepresence technology such as Double Robotics – prototype a “wormhole” solution for students and employers to communicate during these multi-week project-based challenges to accelerate project advisory despite the vast connection of space and time between the classroom and the workplace.

NIST Education SuperCluster: Living Blueprint Engine for Education Innovation in Smart Cities

Contact: Derrick Lee, [email protected]

In the early development of smart cities, what does it take to educate a smart and connected citizen? The NIST Education SuperCluster is a consortium of educators, industry leaders, and governmental officials formulating blueprints for education innovation in smart cities under the Global City Teams Challenge, a federal smart city initiative by the National Institute of Standards & Technology (NIST) under the U.S. Department of Commerce. We are seeking for university-level student fellows to prototype a data analysis and algorithmic engine we can implement to streamline the processing of case studies to create a “living blueprint” that will inform leaders across the nation and globe of the best practices in education innovation in smart cities.

Crypto-Currency Exchanges

Contact: Anand Gomes, [email protected] & Elias Humberto, [email protected]

Introduced by the team at Paradigm. About Paradigm: We are powering the $230+ Billion OTC crypto market by building a conversational interface for institutional traders. Our mission is to increase revenue and efficiency which makes traders lives easier by providing AI-driven tools such as automated trading and counter-party discovery within a native chat application.

Problem: We have access to daily pricing data for over 90 crypto-currency exchanges. This is a chance for students to get creative and come up with ideas on what meaningful insight we can extract from this data.

Solution: You tell us what you want to do with it! Want to predict crypto-currency prices, or want to use our data for research? Do you want experiment with cutting edge modeling formulations on real data?

We have so much data for you! If +200 million rows of data is not enough, we can get you thousands of preprocessed news articles. Still want more data? Then reach out to us. We are always looking for novel ways of using our in-house tools and information.

Why should you wanna work on this?

Your work will directly impact the daily trading of billions of dollars of crypto assets globally by making traders lives easier! Does not get more impactful than this!
Opportunity to be mentored by a ML Project Lead who has already worked for over 6 months on the project (avoid the mistakes and focus on the cool stuff)
Opportunity to be mentored by a Senior ML / NLP Research Scientist at Uber AI Labs
Opportunity to work directly with the CEO and CTO of a really cool (and growing!) Fintech company
You get to combine scraping, API accessing, time series modeling, NN implementation, NLP, and complex visualization in one month! Seldom do students get exposed to such a complex, real-time and awesome implementations!

What are the crypto-traders talking about?

Contact: Anand Gomes, [email protected] & Elias Humberto, [email protected]

Problem: In the absence of a proper valuation framework, scammy, sentiment-based speculation is the primary driver of price changes in the crypto-currency market. Therefore, understanding the “zeitgeist” of the market, encapsulated in trader conversations on Twitter, Reddit and Telegram could provide a valuable price signal. Questions such as

“Did you see how ridiculous that ICO was?”
“Did you see how much money they raised?”
“Did you see how shitty their white-paper was?”

Solution: The goal of the project is extract a set of conversational and sentiment based metrics (a few mentioned below) from Twitter, Reddit and Telegram scrapes. These metrics will be the first step towards creating a visualization / dashboard that is easily digestible for traders. The visualization need not be included in the scope of this project.

What is the community talking about and feeling now?
How elated or despondent are they about a particular token? How often is the token being mentioned and where? (i.e. in Pump and Dump Groups?)
Can we identify a developing trend early enough that can be a valuable price signal? What does the noise/signal ratio look like?

Data Set: Scrapes of Twitter, Steemit, Medium, Telegram, Reddit and WeChat groups. Identify the target groups in Telegram

Screen Shot 2018-08-30 at 2.45.25 PM

Predict whether a news article will have a significant impact on the underlying crypto-asset mentioned in the article

Contact: Anand Gomes, [email protected] & Elias Humberto, [email protected]

Problem: The crypto-news cycle generates an overwhelming amount of news, both fake and true. There is no reliable and inexpensive method of filtering which articles will significantly affect the price of underlying crypto-asset it refers to. Traders need an automated tool that is able to read news articles real-time and dynamically assign a score based on the probability of impact on price.

Bloomberg and Reuters, two financial data/news powerhouses cost up to $25K/year for their news feed but do not solve this problem and instead only provide you simple metrics such as whether an article is trending or if it has clocked the most reads over the last day/month/week.

Solution: A beta version of the project already exists. Teams will build on the work done by the Berkeley team from last semester under the guidance of a project lead (Elias Castro – ML Lead), industry mentor (Anand Gomes, CEO of Paradigm), with input from a senior researcher at Uber’s AI Labs. Note that this project will focus on the impact of the story on price and NOT on whether the news is real / fake.

Datasets

1. NewsAPI: A JSON API for live news and blog headlines that aggregates news articles from multiple news sources

2. CoinAPI: Subscription to CoinAPI’s cryptocurrency market-data

Build an AI-powered Renewable Energy Resources Guide

Contact: James Hodson, [email protected]

Short Description: SDG07 aims affordable, reliable, sustainable and modern energy.

Detailed Description: China, Europe and the United States accounted for nearly 75% of the global investment in renewable power and fuels. However, when measured per unit of gross domestic product, the Marshall Islands, Rwanda, the Solomon Islands, Guinea-Bissau and many other developing countries are investing as much as or more in renewables than developed and emerging economies. These positive developments need to be scaled up for a global energy transition. The objective of this project is to model suitable places to build sustainable, environmentally friendly power plants, which would replace non-renewable energy production in the world’s biggest polluters. The model should show which renewable energy resource should be utilized in a specific area with consideration to minimal environmental impact in the immediate area and the highest energy efficiency. The project should simulate changes in the environment after using the renewable resources. Projects will include Machine Learning models, data management platforms, and visualization engines to allow communities to interact with the data and assist in decision making. Successful projects will have the opportunity to present their products in front of community leaders, researchers, and policy-makers at the AI for Good Foundation Global Conference in 2019!

Possible Data Sets:

NASA Earth Data: https://search.earthdata.nasa.gov/search

USGS: https://earthexplorer.usgs.gov/

ESA Copernicus Open Data Hub: https://scihub.copernicus.eu/dhus/#/home

NOAA: Here.

ESA Earth Online: https://earth.esa.int/web/guest/eoli

INPE: http://www.dgi.inpe.br/CDSR/

Indian Geo – Platform of ISRO: http://bhuvan.nrsc.gov.in/data/download/index.php

ALOS: http://www.eorc.jaxa.jp/ALOS/en/aw3d30/

The Aerial Photo Ordering System: https://www.ngs.noaa.gov/web/APOS2/APOS.shtml

VITO: http://www.vito-eodata.be/PDF/portal/Application.html#Home

Global Land Cover Facility: http://landcover.org/

Digital Globe: http://www.digitalglobe.com/

UNAVCO: https://www.unavco.org/data/data.html

Energy.Gov: https://www.energy.gov/data/downloads/open-data-catalogue

Data.Gov.: https://catalog.data.gov/dataset?tags=renewable-energy

Open EI: Here.

World Energy Council: https://www.worldenergy.org/data/

Our World in Data: https://ourworldindata.org/renewables

IRENA: http://resourceirena.irena.org/gateway/dashboard/

More data…

Build an AI-powered Education

Contact: James Hodson, [email protected]

Short Description: SDG04 aims to ensure inclusive and equitable quality education and promote lifelong learning opportunities for all.

Detailed Description: The value of human capital – the share of human capital in total wealth – is 62 percent. That’s four times the value of produced capital and 15 times the value of natural capital. Globally, we – governments, private sector, families, individuals – spend more than $5.6 trillion a year on education and training. Countries spend 5 percent of GDP on education or 20 percent of their national budget. Education employs about five percent of the labor force. Global average cost per child for full course of schooling, including pre-primary education is 5,806.6 USD. The objective of this project is to model the economic loss of developing countries due to the lack of population with secondary education. Projects will include Machine Learning models, data management platforms, and visualization engines to allow communities to interact with the data and assist in decision making. Successful projects will have the opportunity to present their products in front of community leaders, researchers, and policy-makers at the AI for Good Foundation Global Conference in 2019!

Possible Data Sets:

Stat Planet: Here.

World Bank: https://data.worldbank.org/topic/education?view=chart

data.world: https://data.world/datasets/literacy

UNICEF Data: https://data.unicef.org/topic/education/literacy/

NCES: https://nces.ed.gov/datatools/

Global Partnership for Education: https://www.globalpartnership.org/funding/education-costs-per-child

Our World in Data: https://ourworldindata.org/primary-and-secondary-education

More data…