SlideShare uma empresa Scribd logo
1 de 49
Baixar para ler offline
Introduction to Big Data 
Roi Blanco
2
What is Big Data? 
• A fashioned term used by some IT vendors to remarked old 
fashioned hardware and software 
• “The term itself is vague, but it is getting at something that is real… 
Big Data is a tagline for a process that has the potential to transform 
everything.” John Kleinberg 
• What I want to talk about: 
– Big Data science, cool use cases 
– Access to data, tools to process the data (Hadoop and friends’ ecosystem) 
– What’s next (now!) 
3
Now, that’s Big data 
4
Data? 
• Advances in digital sensors, communications, computation, and 
storage have created huge collections of data, capturing information 
of value to business, science, government, and society. 
• Example: search engine companies 
– transformed how people find and make use of information on a daily basis. 
• Other forms of big data are transforming the activities of companies, 
scientific researchers. 
• Machine learning on large data-sets for decision making, product 
shaping. 
5
Motivation 
• BIG DATA is an OPEN SOURCE Software Revolution 
• BIG DATA Analytics 2.0 
• What is happening right now 
• Why do we need new tools? 
• Improve decision making: 
• Measure and react in REAL-TIME 
6
Data Explosion 
text 
audio 
video 
images 
relational 
picture from Big Data Integration 
7
Real Time Decision Making 
Companies need to know: 
• what is happening right now, 
in real time, to be able to 
• react 
• anticipate and detect new 
business opportunities. 
8
Wal-Mart 
9
LHC 
10
WWW 
11
Mobile 
12
Intelligence agencies 
13
Social media 
14
Big Data 3(+3) Vs 
• Volume 
• Variety 
• Velocity 
• Value 
• Variability 
• Veracity 
15
Volume vs Velocity 
16
Controversy of Big Data 
• All data is BIG now 
• Hype to sell Hadoop 
based systems 
• Ethical concerns about 
accessibility 
• Limited access to Big 
Data creates new digital 
divides 
17
Controversy of Big Data 
• Statistical Significance: 
– When the number of 
variables grow, the 
number of fake 
correlations also grow 
– Leinweber: S&P 500 
stock index correlated 
with butter production 
in Bangladesh 
18
Need for Big Data 
McKinsey Global Institute (MGI) Report on Big Data, 2011 
19 
• WEF defined data as an asset 
just like gold or currency 
• Business opportunities to 
exploit by companies that can 
analyze information in the 
right way 
• What do your customers 
need? 
• What will they demand in the 
future?
Need for Big Data 
20 
• How do you know the 
invest was worth it? 
• In the happy success 
cases predictive analysis 
has led to income 
improvement of ~70% 
McKinsey Global Institute (MGI) Report on Big Data, 2011
Crude Oil 
21
Data Analysis 
• Most business still running on small data! 
• Is more data always better? 
– Hardly 
– past a certain point, return on adding more data diminishes to the point that 
you’re only wasting time gathering more 
• Do you need data? 
– Of course 
– … but the right data (+ interpretation) 
• Unbiased, context 
• Big data is not a magic wand for inferring causality 
• Most AI problems have been tackled from a data perspective 
– Still, unsolved (Google’s cat detector). 
22
What is data science? 
23
Why Machine Learning interest is increasing? 
• Data is everywhere 
– Increasingly captured 
– Increasingly comprehensive 
• Storage capabilities are now much cheaper, such is processing 
– In-house Hadoop clusters 
– Cloud-based processing (Amazon EC2) 
• Data is important 
– Machine learning provides effective development methodology 
– … when you cannot program a solution by hand 
– … but you have data available 
• Let the data figure out the program 
• Any company with large data sets will have an interest 
24
(HADOOP) 
25
Big Data Challenges 
Sort 10TB on 1 node = 2 days 
100-node cluster = 30 min 
26
Big Data Challenges 
“Fat” servers implies high cost 
– use cheap commodity nodes instead 
commodity 
Large number of cheap nodes implies frequent failures 
– leverage automatic fault-tolerance 
fault-tolerance 
27
Big Data Challenges 
We need new data-parallel programming model for clusters of commodity 
machines 
data-parallel 
28
MapReduce 
Published in 2004 by Google 
– MapReduce: Simplified Data Processing on Large Clusters 
Popularized by Apache Hadoop project started by Yahoo! 
– Now used by virtually everybody else Facebook, Twitter, 
Amazon, … 
29
Who uses Hadoop? 
30
Map Reduce Philosophy 
– hide complexity 
– make it scalable 
– make it cheap 
1. System Shall Manage and Heal 
Itself 
2. Performance Shall Scale 
Linearly 
3. Compute Should Move to Data 
4. Simple Core, Modular and 
Extensible 
31
Hadoop High-Level Architecture 
Name Node 
Maintains mapping of file blocks 
to data node slaves 
Job Tracker 
Schedules jobs across 
task tracker slaves 
Data Node 
Stores and serves 
blocks of data 
Hadoop Client 
Contacts Name Node for data 
or Job Tracker to submit jobs 
Task Tracker 
Runs tasks (work units) 
within a job 
Share Physical Node 
32
Pig 
33 
Pig 
A = LOAD ’data’ USING PigStorage() AS 
(f1:int, f2:int, f3:int); 
B = GROUP A BY f1; 
C = FOREACH B GENERATE COUNT ($0); 
DUMP C; 
Pig: Similar to SQL 
21 / 55 
Pig Similar to SQL
Pig powers 
34
HBase 
• Apache HBase™ is the 
Hadoop database, a 
distributed, scalable, big key-value 
35 
store 
– Linear and modular 
scalability. 
– Strictly consistent reads 
and writes. 
– Automatic and configurable 
sharding of tables 
– Failover support 
– Interoperable with Java, 
Hadoop
Hive 
• Apache project for querying 
and analyzing datasets in 
HDFS 
– Tools to enable easy data 
extract/transform/load (ETL) 
– A mechanism to impose 
structure on a variety of 
data formats 
– Access to files stored either 
directly in Apache HDFSTM 
or in other data storage 
systems such as Apache 
HBaseTM 
– Query execution via 
MapReduce 
36
Apache S4 
37
Twitter Storm 
38
Apache Mahout 
39
MOVING TOWARDS (NEAR)REALTIME
Runaway Complexity 
41
Future 
• Process data fast enough 
– BI analytics 
• Key drivers: connected devices/services 
– Tablets, smartphones, etc. 
– Your data is “always connected to the cloud” 
– Low latency (again)/enormous amount of data 
• User data 
– Categorize data to infer knowledge about a user 
• Targeting, personalization 
• 100B events per day 
– ML: from information to knowledge 
– Behavioral targeting (user features) 
• How likely am I to be interested in fashion? For how long? 
• Map to behavioral targeting categories, segment for targeting 
42
Future (II) 
• Data processed in batches 
– There are gaps! 
– Things you’ve calculated half an hour ago 
– Ok for monthly reports, not for online NRT prediction 
– Think of GEO targeting 
• You can’t go fast enough with MR 
– From big long windows to small incremental iterations 
– Micro-batches updating user knowledge 
• Use cases 
– Ad campaign allocation 
• Delay between click and deducting budget from an advertiser (overspending) 
– Personalization and targeting 
• Y! Homepage 
• Use every event on the stream to detect the interest 
– How do we train machine learning models when the data is arriving non-stop? 
• You want parameters to adapt, to change slowly 
• Maybe 99% of the data is the same! Incrementally is better 
43
Beyond Hadoop 
• YARN 
– Why if you just want to interact with the data in Hadoop? 
• Hive (SQL-like), Hbase (NoSQL) and Pig (scripted data access) 
– Those apps are great but limited to running as a single application system with 
MapReduce at the core 
– Spark (see below) and Storm have been ported to YARN already 
• Streaming 
– SAMOA 
• RDDs 
– Spark 
• Shark (Hive on Spark) 
• Analytics Architecture 
– Visualization http://visualize.yahoo.com/mail/ 
44
Future Challenges for Big Data 
• Evaluation 
• Time evolving data 
• Distributed mining 
• Compression 
• Visualization 
• Hidden Big Data 
45
Hadoop 2.0 
• No longer “only” running MR jobs 
– MR + processing low latency and streaming 
• Iterative processing 
– Hold data in memory to re-process 
• Figure the questions of what to do with data 
– BI that want to do exploration of the data really fast 
• Possible thanks to YARN + Storm(S4) + Spark + … ? 
– 350PB of data 
– >30K nodes with Yarn 
– 400K per day (6 jobs/sec) 
– 10M hours of compute with YARN 
46
Future key take-aways 
• Scalability 
• Performance 
• Flexibility 
• Programming paradigms 
– MAP/MAP/MAP .. OR REDUCE/REDUCE/ 
REDUCE 
47
Big Data Myths 
• Big Data is new 
• Big Data is objective 
• Big Data doesn’t discriminate 
• Big Data makes things smart 
• Big Data is anonymous 
• You can opt-out 
48
Big Data vs Big Reality 
• Big Data is an oxymoron 
• Big Data raises bigger issues. The term suggests assembling many 
facts to create greater, previously unseen truths. It suggests the 
certainty of math. 
• It's not the data itself but what you do with it that counts. 
49

Mais conteúdo relacionado

Mais procurados

Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
Trey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Trey Grainger
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
sssw2011
 

Mais procurados (20)

Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic Search
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Brave new search world
Brave new search worldBrave new search world
Brave new search world
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
Alamw2013
Alamw2013Alamw2013
Alamw2013
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
 

Destaque

Top+5+world+flatness 1
Top+5+world+flatness 1Top+5+world+flatness 1
Top+5+world+flatness 1
IUisawesome
 
Tech training 7.17.13 pm session
Tech training 7.17.13 pm sessionTech training 7.17.13 pm session
Tech training 7.17.13 pm session
Leah Vestal
 
Presentación1
Presentación1Presentación1
Presentación1
Vicky
 
Top+5+world+flatness 4
Top+5+world+flatness 4Top+5+world+flatness 4
Top+5+world+flatness 4
IUisawesome
 
Sm250rink
Sm250rinkSm250rink
Sm250rink
Regina
 

Destaque (20)

A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon DunwoodyLayering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
Layering Common Sense on Top of all that Rocket Science by Prof. Sharon Dunwoody
 
The EB-5 Visa Program
The EB-5 Visa ProgramThe EB-5 Visa Program
The EB-5 Visa Program
 
Top+5+world+flatness 1
Top+5+world+flatness 1Top+5+world+flatness 1
Top+5+world+flatness 1
 
Tech training 7.17.13 pm session
Tech training 7.17.13 pm sessionTech training 7.17.13 pm session
Tech training 7.17.13 pm session
 
Ruby Midi - Euruko 2009 Conferente
Ruby Midi - Euruko 2009 ConferenteRuby Midi - Euruko 2009 Conferente
Ruby Midi - Euruko 2009 Conferente
 
Basicgrammar1
Basicgrammar1Basicgrammar1
Basicgrammar1
 
Presentación1
Presentación1Presentación1
Presentación1
 
Top+5+world+flatness 4
Top+5+world+flatness 4Top+5+world+flatness 4
Top+5+world+flatness 4
 
N.u. fichas setembro 2011
N.u. fichas setembro 2011N.u. fichas setembro 2011
N.u. fichas setembro 2011
 
Hire Immigrants Halifax Allies Report 2010
Hire Immigrants Halifax Allies Report 2010Hire Immigrants Halifax Allies Report 2010
Hire Immigrants Halifax Allies Report 2010
 
Profound logic 2012
Profound logic 2012Profound logic 2012
Profound logic 2012
 
V crm
V crmV crm
V crm
 
PRBS - Where YOU can make a difference
PRBS - Where YOU can make a differencePRBS - Where YOU can make a difference
PRBS - Where YOU can make a difference
 
Sm250rink
Sm250rinkSm250rink
Sm250rink
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 
Physical Science: Chapter 5, sec3
Physical Science: Chapter 5, sec3Physical Science: Chapter 5, sec3
Physical Science: Chapter 5, sec3
 
Shipbuilding in Halifax
Shipbuilding in HalifaxShipbuilding in Halifax
Shipbuilding in Halifax
 
Beyond xUnit example-based testing: property-based testing with ScalaCheck
Beyond xUnit example-based testing: property-based testing with ScalaCheckBeyond xUnit example-based testing: property-based testing with ScalaCheck
Beyond xUnit example-based testing: property-based testing with ScalaCheck
 
AOMi Simulation Training Brochure
AOMi Simulation Training Brochure AOMi Simulation Training Brochure
AOMi Simulation Training Brochure
 

Semelhante a Introduction to Big Data

Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
dickonsondorris
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
kalai75
 

Semelhante a Introduction to Big Data (20)

Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Big-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-KoenigBig-Data-Seminar-6-Aug-2014-Koenig
Big-Data-Seminar-6-Aug-2014-Koenig
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big Data
Big DataBig Data
Big Data
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
 
Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101
 
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 

Mais de Roi Blanco

Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
Roi Blanco
 

Mais de Roi Blanco (7)

Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Introduction to Big Data

  • 1. Introduction to Big Data Roi Blanco
  • 2. 2
  • 3. What is Big Data? • A fashioned term used by some IT vendors to remarked old fashioned hardware and software • “The term itself is vague, but it is getting at something that is real… Big Data is a tagline for a process that has the potential to transform everything.” John Kleinberg • What I want to talk about: – Big Data science, cool use cases – Access to data, tools to process the data (Hadoop and friends’ ecosystem) – What’s next (now!) 3
  • 5. Data? • Advances in digital sensors, communications, computation, and storage have created huge collections of data, capturing information of value to business, science, government, and society. • Example: search engine companies – transformed how people find and make use of information on a daily basis. • Other forms of big data are transforming the activities of companies, scientific researchers. • Machine learning on large data-sets for decision making, product shaping. 5
  • 6. Motivation • BIG DATA is an OPEN SOURCE Software Revolution • BIG DATA Analytics 2.0 • What is happening right now • Why do we need new tools? • Improve decision making: • Measure and react in REAL-TIME 6
  • 7. Data Explosion text audio video images relational picture from Big Data Integration 7
  • 8. Real Time Decision Making Companies need to know: • what is happening right now, in real time, to be able to • react • anticipate and detect new business opportunities. 8
  • 15. Big Data 3(+3) Vs • Volume • Variety • Velocity • Value • Variability • Veracity 15
  • 17. Controversy of Big Data • All data is BIG now • Hype to sell Hadoop based systems • Ethical concerns about accessibility • Limited access to Big Data creates new digital divides 17
  • 18. Controversy of Big Data • Statistical Significance: – When the number of variables grow, the number of fake correlations also grow – Leinweber: S&P 500 stock index correlated with butter production in Bangladesh 18
  • 19. Need for Big Data McKinsey Global Institute (MGI) Report on Big Data, 2011 19 • WEF defined data as an asset just like gold or currency • Business opportunities to exploit by companies that can analyze information in the right way • What do your customers need? • What will they demand in the future?
  • 20. Need for Big Data 20 • How do you know the invest was worth it? • In the happy success cases predictive analysis has led to income improvement of ~70% McKinsey Global Institute (MGI) Report on Big Data, 2011
  • 22. Data Analysis • Most business still running on small data! • Is more data always better? – Hardly – past a certain point, return on adding more data diminishes to the point that you’re only wasting time gathering more • Do you need data? – Of course – … but the right data (+ interpretation) • Unbiased, context • Big data is not a magic wand for inferring causality • Most AI problems have been tackled from a data perspective – Still, unsolved (Google’s cat detector). 22
  • 23. What is data science? 23
  • 24. Why Machine Learning interest is increasing? • Data is everywhere – Increasingly captured – Increasingly comprehensive • Storage capabilities are now much cheaper, such is processing – In-house Hadoop clusters – Cloud-based processing (Amazon EC2) • Data is important – Machine learning provides effective development methodology – … when you cannot program a solution by hand – … but you have data available • Let the data figure out the program • Any company with large data sets will have an interest 24
  • 26. Big Data Challenges Sort 10TB on 1 node = 2 days 100-node cluster = 30 min 26
  • 27. Big Data Challenges “Fat” servers implies high cost – use cheap commodity nodes instead commodity Large number of cheap nodes implies frequent failures – leverage automatic fault-tolerance fault-tolerance 27
  • 28. Big Data Challenges We need new data-parallel programming model for clusters of commodity machines data-parallel 28
  • 29. MapReduce Published in 2004 by Google – MapReduce: Simplified Data Processing on Large Clusters Popularized by Apache Hadoop project started by Yahoo! – Now used by virtually everybody else Facebook, Twitter, Amazon, … 29
  • 31. Map Reduce Philosophy – hide complexity – make it scalable – make it cheap 1. System Shall Manage and Heal Itself 2. Performance Shall Scale Linearly 3. Compute Should Move to Data 4. Simple Core, Modular and Extensible 31
  • 32. Hadoop High-Level Architecture Name Node Maintains mapping of file blocks to data node slaves Job Tracker Schedules jobs across task tracker slaves Data Node Stores and serves blocks of data Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Task Tracker Runs tasks (work units) within a job Share Physical Node 32
  • 33. Pig 33 Pig A = LOAD ’data’ USING PigStorage() AS (f1:int, f2:int, f3:int); B = GROUP A BY f1; C = FOREACH B GENERATE COUNT ($0); DUMP C; Pig: Similar to SQL 21 / 55 Pig Similar to SQL
  • 35. HBase • Apache HBase™ is the Hadoop database, a distributed, scalable, big key-value 35 store – Linear and modular scalability. – Strictly consistent reads and writes. – Automatic and configurable sharding of tables – Failover support – Interoperable with Java, Hadoop
  • 36. Hive • Apache project for querying and analyzing datasets in HDFS – Tools to enable easy data extract/transform/load (ETL) – A mechanism to impose structure on a variety of data formats – Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM – Query execution via MapReduce 36
  • 42. Future • Process data fast enough – BI analytics • Key drivers: connected devices/services – Tablets, smartphones, etc. – Your data is “always connected to the cloud” – Low latency (again)/enormous amount of data • User data – Categorize data to infer knowledge about a user • Targeting, personalization • 100B events per day – ML: from information to knowledge – Behavioral targeting (user features) • How likely am I to be interested in fashion? For how long? • Map to behavioral targeting categories, segment for targeting 42
  • 43. Future (II) • Data processed in batches – There are gaps! – Things you’ve calculated half an hour ago – Ok for monthly reports, not for online NRT prediction – Think of GEO targeting • You can’t go fast enough with MR – From big long windows to small incremental iterations – Micro-batches updating user knowledge • Use cases – Ad campaign allocation • Delay between click and deducting budget from an advertiser (overspending) – Personalization and targeting • Y! Homepage • Use every event on the stream to detect the interest – How do we train machine learning models when the data is arriving non-stop? • You want parameters to adapt, to change slowly • Maybe 99% of the data is the same! Incrementally is better 43
  • 44. Beyond Hadoop • YARN – Why if you just want to interact with the data in Hadoop? • Hive (SQL-like), Hbase (NoSQL) and Pig (scripted data access) – Those apps are great but limited to running as a single application system with MapReduce at the core – Spark (see below) and Storm have been ported to YARN already • Streaming – SAMOA • RDDs – Spark • Shark (Hive on Spark) • Analytics Architecture – Visualization http://visualize.yahoo.com/mail/ 44
  • 45. Future Challenges for Big Data • Evaluation • Time evolving data • Distributed mining • Compression • Visualization • Hidden Big Data 45
  • 46. Hadoop 2.0 • No longer “only” running MR jobs – MR + processing low latency and streaming • Iterative processing – Hold data in memory to re-process • Figure the questions of what to do with data – BI that want to do exploration of the data really fast • Possible thanks to YARN + Storm(S4) + Spark + … ? – 350PB of data – >30K nodes with Yarn – 400K per day (6 jobs/sec) – 10M hours of compute with YARN 46
  • 47. Future key take-aways • Scalability • Performance • Flexibility • Programming paradigms – MAP/MAP/MAP .. OR REDUCE/REDUCE/ REDUCE 47
  • 48. Big Data Myths • Big Data is new • Big Data is objective • Big Data doesn’t discriminate • Big Data makes things smart • Big Data is anonymous • You can opt-out 48
  • 49. Big Data vs Big Reality • Big Data is an oxymoron • Big Data raises bigger issues. The term suggests assembling many facts to create greater, previously unseen truths. It suggests the certainty of math. • It's not the data itself but what you do with it that counts. 49