SlideShare a Scribd company logo
1 of 30
BIG DATA
MODERN TECHNOLOGIES
György Balogh
LogDrill Ltd.
SECWorld – 7 May 2014
AGENDA
• What is Big Data?
• Why do we have to talk about it?
• Paradigm shift in informationmanagement
• Technology and efficiency
WHAT IS BIG DATA?
• Data volume cannot be handled traditional
solutions (eg.: relational database)
• More than 100 million data rows, typically multi
billion
GLOBAL RATE OF DATA
PRODUCTION (PER SECOND)
• 30 TB/sec (22000 films)
• Digital media
• 2 hours of YouTube video
• Communication
• 3000 business emails
• 300000 SMS
• Web
• Half million page views
• Logs
• Billions
BIG DATA MARKET
HYPE OR REALITY?
WHY NOW?
● Long term trends
○ Size of stored data doubles every 40 months since
1980s
○ Moore’s law: number of transistors on integrated
circuits doubles every 18 months
DIFFERENT EXPONENTIAL
TRENDS
HARD DRIVES IN 1991 AND 2012
● 1991
● 40 MB
● 3500 RPM
● 0.7 MB/sec
● full scan: 1 minutes
● 2012
● 4 TB ( x 100000)
● 7200 RPM
● 120 MB/sec ( x 170)
● full scan: 8 hours (x 480)
DATA ACCESS BECOMES THE
SCARCE RESOURCE!
GOOGLE’S HARDWARE IN 1998
GOOGLE’S HARDWARE IN 2013
• 12 data centers worldwide
• More than a million nodes
• A data center costs $600 million to build
• Oregon data center
• 15000 m2
• power of 30 000 homes
GOOGLE’S HARDWARE IN 2013
• Cheap commodity hardware
• each has its own battery!
• Modular data centers
• Standard container
• 1160 servers per container
• Efficiency: 11% overhead
(power transformation, cooling)
THE BIG DATA PARADIGM SHIFT
TECHNOLOGIES
• Hadoop 2.0
• Google BigQuery
• Cloudera Impala
• Apache Spark
HADOOP DISTRIBUTED FILE
SYSTEM (HDFS)
HADOOP MAP REDUCE
HADOOP
• Who uses Hadoop?
• Facebook: 100 PB
• Yahoo: 4000 nodes
• More than half of Fortune 50 companies!
• History
• Replica of Google architecture (GFS, BigTable) in
Java under Apache licence
• Hadoop 2.0
• Full High Availability
• Advanced resource managements (YARN)
GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few
milliseconds
select year, SUM(mother_age *
record_weight) /
SUM(record_weight) as age
from
publicdata:samples.natality
where ever_born = 1 group by
year order by year;
GOOGLE BIG QUERY
• SQL queries on terabytes of data in seconds
• Data is distributed over thousands of nodes
• Each node processes one part of the dataset
• Thousands of nodes work for us for a few
milliseconds
CLOUDERA IMPALA
• Same as BigQuery on top of Hadoop
• Standard SQL on Big Data.
• On a 10 million Ft cluster terabytes of data can
be analyzed interactively
• Scales to thousands of nodes
• Technology sugars
• Run-time code generation with LLVM
• Parquet format (column oriented)
APACHE SPARK
• Berkeley University
• Achieves 100 times speed up compared to
Hadoop on certain tasks
• In cluster memory computation
INEFFICIENCY CAN WASTE
HUGE AMOUNT OF RESOURCES
• 300 node cluster
• Hadoop
• Hive
= • 300 node cluster
• One node
• Vectorwise
• Vectorwise holds world
speed record in analytical
database queries on a single
node
CLEVER WAYS TO IMPROVE
EFFICIENCY
• Lossless data compression (even 50x!)
• Clever lossy compression of data (e.g.: olap
cubes)
• Cache aware implementations (asymmetric
trends, memory access bottleneck)
LOSSLESS DATA COMPRESSION
• compression can boost sequential data
access even 50 times! (100 MB/sec -> 5
GB/sec)
• Less data -> less I/O operation
• One CPU can decompress data even at 5
GB/sec
• gzip decompression is very slow
• snappy, lzo, lz4 can reach 1 GB/sec
decompression speed
• decompression used by column oriented
databases can reach 5 GB/sec (PFOR)
• two billion integers per second! (almost one
integer per clock cycle!!!)
EXAMPLE: LOGDRILL
2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562
2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321
2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522
2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425
2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432
2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134
2011-01-08 00:00 GET 200 2
2011-01-08 00:01 GET 200 2
2011-01-08 00:02 GET 404 1
2011-01-08 00:02 POST 200 1
CAHE AWARE PROGRAMMING
• CPU speed increasing about 60% a year
• Memory speed increasing only 10% a year
• The increasing gap is covered with multi level
cache memories
• Cache is under-exploited
100x speed up!!!
LESSONS LEARNED
• Big Data is not a hype at least from the
technological viewpoint
• Modern technologies (Impala, Spark) can
reach theoretical limits of the cluster hardware
configuration
• Deep understanding of both the problem and
the technologies are required to create
efficient Big Data solutions
THANK YOU!
Q&A?

More Related Content

Similar to BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS

Balogh gyorgy big_data
Balogh gyorgy big_dataBalogh gyorgy big_data
Balogh gyorgy big_dataLogDrill
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
Big data computing overview
Big data computing overviewBig data computing overview
Big data computing overviewYoung Sung Son
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye TechnologiesAshish Jha
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologieswisdomeye
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?CQD
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015spectralogic
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...Imperva Incapsula
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
The future of tape april 16
The future of tape april 16The future of tape april 16
The future of tape april 16Josef Weingand
 
Long and winding road - Chile 2014
Long and winding road - Chile 2014Long and winding road - Chile 2014
Long and winding road - Chile 2014Connor McDonald
 
Guy Coates
Guy CoatesGuy Coates
Guy CoatesEduserv
 

Similar to BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS (20)

Balogh gyorgy big_data
Balogh gyorgy big_dataBalogh gyorgy big_data
Balogh gyorgy big_data
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Google file system
Google file systemGoogle file system
Google file system
 
Big data
Big dataBig data
Big data
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Big data computing overview
Big data computing overviewBig data computing overview
Big data computing overview
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
 
WisdomEye Technologies
WisdomEye TechnologiesWisdomEye Technologies
WisdomEye Technologies
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
The future of tape april 16
The future of tape april 16The future of tape april 16
The future of tape april 16
 
Long and winding road - Chile 2014
Long and winding road - Chile 2014Long and winding road - Chile 2014
Long and winding road - Chile 2014
 
Guy Coates
Guy CoatesGuy Coates
Guy Coates
 

Recently uploaded

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

BIG DATA TECHNOLOGIES FOR MODERN ANALYTICS

  • 1. BIG DATA MODERN TECHNOLOGIES György Balogh LogDrill Ltd. SECWorld – 7 May 2014
  • 2. AGENDA • What is Big Data? • Why do we have to talk about it? • Paradigm shift in informationmanagement • Technology and efficiency
  • 3. WHAT IS BIG DATA? • Data volume cannot be handled traditional solutions (eg.: relational database) • More than 100 million data rows, typically multi billion
  • 4.
  • 5. GLOBAL RATE OF DATA PRODUCTION (PER SECOND) • 30 TB/sec (22000 films) • Digital media • 2 hours of YouTube video • Communication • 3000 business emails • 300000 SMS • Web • Half million page views • Logs • Billions
  • 8. WHY NOW? ● Long term trends ○ Size of stored data doubles every 40 months since 1980s ○ Moore’s law: number of transistors on integrated circuits doubles every 18 months
  • 10. HARD DRIVES IN 1991 AND 2012 ● 1991 ● 40 MB ● 3500 RPM ● 0.7 MB/sec ● full scan: 1 minutes ● 2012 ● 4 TB ( x 100000) ● 7200 RPM ● 120 MB/sec ( x 170) ● full scan: 8 hours (x 480)
  • 11. DATA ACCESS BECOMES THE SCARCE RESOURCE!
  • 13. GOOGLE’S HARDWARE IN 2013 • 12 data centers worldwide • More than a million nodes • A data center costs $600 million to build • Oregon data center • 15000 m2 • power of 30 000 homes
  • 14. GOOGLE’S HARDWARE IN 2013 • Cheap commodity hardware • each has its own battery! • Modular data centers • Standard container • 1160 servers per container • Efficiency: 11% overhead (power transformation, cooling)
  • 15. THE BIG DATA PARADIGM SHIFT
  • 16. TECHNOLOGIES • Hadoop 2.0 • Google BigQuery • Cloudera Impala • Apache Spark
  • 19. HADOOP • Who uses Hadoop? • Facebook: 100 PB • Yahoo: 4000 nodes • More than half of Fortune 50 companies! • History • Replica of Google architecture (GFS, BigTable) in Java under Apache licence • Hadoop 2.0 • Full High Availability • Advanced resource managements (YARN)
  • 20. GOOGLE BIG QUERY • SQL queries on terabytes of data in seconds • Data is distributed over thousands of nodes • Each node processes one part of the dataset • Thousands of nodes work for us for a few milliseconds select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;
  • 21. GOOGLE BIG QUERY • SQL queries on terabytes of data in seconds • Data is distributed over thousands of nodes • Each node processes one part of the dataset • Thousands of nodes work for us for a few milliseconds
  • 22. CLOUDERA IMPALA • Same as BigQuery on top of Hadoop • Standard SQL on Big Data. • On a 10 million Ft cluster terabytes of data can be analyzed interactively • Scales to thousands of nodes • Technology sugars • Run-time code generation with LLVM • Parquet format (column oriented)
  • 23. APACHE SPARK • Berkeley University • Achieves 100 times speed up compared to Hadoop on certain tasks • In cluster memory computation
  • 24. INEFFICIENCY CAN WASTE HUGE AMOUNT OF RESOURCES • 300 node cluster • Hadoop • Hive = • 300 node cluster • One node • Vectorwise • Vectorwise holds world speed record in analytical database queries on a single node
  • 25. CLEVER WAYS TO IMPROVE EFFICIENCY • Lossless data compression (even 50x!) • Clever lossy compression of data (e.g.: olap cubes) • Cache aware implementations (asymmetric trends, memory access bottleneck)
  • 26. LOSSLESS DATA COMPRESSION • compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec) • Less data -> less I/O operation • One CPU can decompress data even at 5 GB/sec • gzip decompression is very slow • snappy, lzo, lz4 can reach 1 GB/sec decompression speed • decompression used by column oriented databases can reach 5 GB/sec (PFOR) • two billion integers per second! (almost one integer per clock cycle!!!)
  • 27. EXAMPLE: LOGDRILL 2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562 2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321 2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522 2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425 2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432 2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134 2011-01-08 00:00 GET 200 2 2011-01-08 00:01 GET 200 2 2011-01-08 00:02 GET 404 1 2011-01-08 00:02 POST 200 1
  • 28. CAHE AWARE PROGRAMMING • CPU speed increasing about 60% a year • Memory speed increasing only 10% a year • The increasing gap is covered with multi level cache memories • Cache is under-exploited 100x speed up!!!
  • 29. LESSONS LEARNED • Big Data is not a hype at least from the technological viewpoint • Modern technologies (Impala, Spark) can reach theoretical limits of the cluster hardware configuration • Deep understanding of both the problem and the technologies are required to create efficient Big Data solutions