This document summarizes UC Berkeley's AMPLab and its work on big data analytics. It discusses how AMPLab is developing machine learning algorithms and tools like Spark to analyze massive and diverse datasets generated from sources like the internet of things, scientific computing, and social media. It aims to balance the costs, time, and quality of answers from big data. AMPLab researchers are working on tools like MLBase, Shark, and Spark to perform distributed machine learning and data analytics across clusters. The document highlights some of AMPLab's projects and tools to demonstrate faster analytics on large datasets compared to other frameworks like Hadoop and Hive.
2. It’s All Happening On-line User Generated
(Web, Social & Mobile)
Every:
Click
Ad impression
Billing event
…..
Fast Forward, pause,…
Friend Request
Transaction
Network message
Fault
…
Internet of Things / M2M Scientific Computing
3. Volume Petabytes+
Variety Unstructured
Velocity Real-Time
Our view: More data should mean better answers
• Must balance Cost, Time, and Answer Quality
3
5. UC BERKELEY
Algorithms: Machine
Learning and
Analytics
Massive
and Diverse
Data
People:
Machines:
CrowdSourcing &
Cloud Computing
Human Computation
5
7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy)
Ken Goldberg (Crowdsourcing) Randy Katz (Systems)
*Michael Franklin (Databases) Dave Patterson (Systems)
Armando Fox (Systems) *Ion Stoica (Systems)
*Mike Jordan (Machine Learning) Scott Shenker (Networking)
Organized for Collaboration:
7
10. • Sequencing costs (150X) Big Data $100,000.0
$K per genome
$10,000.0
• UCSF cancer researchers + UCSC cancer genetic $1,000.0
$100.0
database + AMP Lab + Intel Cluster $10.0
$1.0
@TCGA: 5 PB = 20 cancers x 1000 genomes $0.1
2001 - 2014
• See Dave Patterson’s Talk: Thursday 3-4, BDT205
David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times,
10 12/5/2011