Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
3. What is Big Data?
• A fashioned term used by some IT vendors to remarked old
fashioned hardware and software
• “The term itself is vague, but it is getting at something that is real…
Big Data is a tagline for a process that has the potential to transform
everything.” John Kleinberg
• What I want to talk about:
– Big Data science, cool use cases
– Access to data, tools to process the data (Hadoop and friends’ ecosystem)
– What’s next (now!)
3
5. Data?
• Advances in digital sensors, communications, computation, and
storage have created huge collections of data, capturing information
of value to business, science, government, and society.
• Example: search engine companies
– transformed how people find and make use of information on a daily basis.
• Other forms of big data are transforming the activities of companies,
scientific researchers.
• Machine learning on large data-sets for decision making, product
shaping.
5
6. Motivation
• BIG DATA is an OPEN SOURCE Software Revolution
• BIG DATA Analytics 2.0
• What is happening right now
• Why do we need new tools?
• Improve decision making:
• Measure and react in REAL-TIME
6
7. Data Explosion
text
audio
video
images
relational
picture from Big Data Integration
7
8. Real Time Decision Making
Companies need to know:
• what is happening right now,
in real time, to be able to
• react
• anticipate and detect new
business opportunities.
8
17. Controversy of Big Data
• All data is BIG now
• Hype to sell Hadoop
based systems
• Ethical concerns about
accessibility
• Limited access to Big
Data creates new digital
divides
17
18. Controversy of Big Data
• Statistical Significance:
– When the number of
variables grow, the
number of fake
correlations also grow
– Leinweber: S&P 500
stock index correlated
with butter production
in Bangladesh
18
19. Need for Big Data
McKinsey Global Institute (MGI) Report on Big Data, 2011
19
• WEF defined data as an asset
just like gold or currency
• Business opportunities to
exploit by companies that can
analyze information in the
right way
• What do your customers
need?
• What will they demand in the
future?
20. Need for Big Data
20
• How do you know the
invest was worth it?
• In the happy success
cases predictive analysis
has led to income
improvement of ~70%
McKinsey Global Institute (MGI) Report on Big Data, 2011
22. Data Analysis
• Most business still running on small data!
• Is more data always better?
– Hardly
– past a certain point, return on adding more data diminishes to the point that
you’re only wasting time gathering more
• Do you need data?
– Of course
– … but the right data (+ interpretation)
• Unbiased, context
• Big data is not a magic wand for inferring causality
• Most AI problems have been tackled from a data perspective
– Still, unsolved (Google’s cat detector).
22
24. Why Machine Learning interest is increasing?
• Data is everywhere
– Increasingly captured
– Increasingly comprehensive
• Storage capabilities are now much cheaper, such is processing
– In-house Hadoop clusters
– Cloud-based processing (Amazon EC2)
• Data is important
– Machine learning provides effective development methodology
– … when you cannot program a solution by hand
– … but you have data available
• Let the data figure out the program
• Any company with large data sets will have an interest
24
26. Big Data Challenges
Sort 10TB on 1 node = 2 days
100-node cluster = 30 min
26
27. Big Data Challenges
“Fat” servers implies high cost
– use cheap commodity nodes instead
commodity
Large number of cheap nodes implies frequent failures
– leverage automatic fault-tolerance
fault-tolerance
27
28. Big Data Challenges
We need new data-parallel programming model for clusters of commodity
machines
data-parallel
28
29. MapReduce
Published in 2004 by Google
– MapReduce: Simplified Data Processing on Large Clusters
Popularized by Apache Hadoop project started by Yahoo!
– Now used by virtually everybody else Facebook, Twitter,
Amazon, …
29
31. Map Reduce Philosophy
– hide complexity
– make it scalable
– make it cheap
1. System Shall Manage and Heal
Itself
2. Performance Shall Scale
Linearly
3. Compute Should Move to Data
4. Simple Core, Modular and
Extensible
31
32. Hadoop High-Level Architecture
Name Node
Maintains mapping of file blocks
to data node slaves
Job Tracker
Schedules jobs across
task tracker slaves
Data Node
Stores and serves
blocks of data
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs
Task Tracker
Runs tasks (work units)
within a job
Share Physical Node
32
33. Pig
33
Pig
A = LOAD ’data’ USING PigStorage() AS
(f1:int, f2:int, f3:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);
DUMP C;
Pig: Similar to SQL
21 / 55
Pig Similar to SQL
35. HBase
• Apache HBase™ is the
Hadoop database, a
distributed, scalable, big key-value
35
store
– Linear and modular
scalability.
– Strictly consistent reads
and writes.
– Automatic and configurable
sharding of tables
– Failover support
– Interoperable with Java,
Hadoop
36. Hive
• Apache project for querying
and analyzing datasets in
HDFS
– Tools to enable easy data
extract/transform/load (ETL)
– A mechanism to impose
structure on a variety of
data formats
– Access to files stored either
directly in Apache HDFSTM
or in other data storage
systems such as Apache
HBaseTM
– Query execution via
MapReduce
36
42. Future
• Process data fast enough
– BI analytics
• Key drivers: connected devices/services
– Tablets, smartphones, etc.
– Your data is “always connected to the cloud”
– Low latency (again)/enormous amount of data
• User data
– Categorize data to infer knowledge about a user
• Targeting, personalization
• 100B events per day
– ML: from information to knowledge
– Behavioral targeting (user features)
• How likely am I to be interested in fashion? For how long?
• Map to behavioral targeting categories, segment for targeting
42
43. Future (II)
• Data processed in batches
– There are gaps!
– Things you’ve calculated half an hour ago
– Ok for monthly reports, not for online NRT prediction
– Think of GEO targeting
• You can’t go fast enough with MR
– From big long windows to small incremental iterations
– Micro-batches updating user knowledge
• Use cases
– Ad campaign allocation
• Delay between click and deducting budget from an advertiser (overspending)
– Personalization and targeting
• Y! Homepage
• Use every event on the stream to detect the interest
– How do we train machine learning models when the data is arriving non-stop?
• You want parameters to adapt, to change slowly
• Maybe 99% of the data is the same! Incrementally is better
43
44. Beyond Hadoop
• YARN
– Why if you just want to interact with the data in Hadoop?
• Hive (SQL-like), Hbase (NoSQL) and Pig (scripted data access)
– Those apps are great but limited to running as a single application system with
MapReduce at the core
– Spark (see below) and Storm have been ported to YARN already
• Streaming
– SAMOA
• RDDs
– Spark
• Shark (Hive on Spark)
• Analytics Architecture
– Visualization http://visualize.yahoo.com/mail/
44
45. Future Challenges for Big Data
• Evaluation
• Time evolving data
• Distributed mining
• Compression
• Visualization
• Hidden Big Data
45
46. Hadoop 2.0
• No longer “only” running MR jobs
– MR + processing low latency and streaming
• Iterative processing
– Hold data in memory to re-process
• Figure the questions of what to do with data
– BI that want to do exploration of the data really fast
• Possible thanks to YARN + Storm(S4) + Spark + … ?
– 350PB of data
– >30K nodes with Yarn
– 400K per day (6 jobs/sec)
– 10M hours of compute with YARN
46
48. Big Data Myths
• Big Data is new
• Big Data is objective
• Big Data doesn’t discriminate
• Big Data makes things smart
• Big Data is anonymous
• You can opt-out
48
49. Big Data vs Big Reality
• Big Data is an oxymoron
• Big Data raises bigger issues. The term suggests assembling many
facts to create greater, previously unseen truths. It suggests the
certainty of math.
• It's not the data itself but what you do with it that counts.
49