Introduction to Big Data

•

0 likes•627 views

Kristof Jozsa

Introduction to Big Data - how we got here, how we can't avoid the topic anymore.

Technology

Definition of Big Data
 "Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.“
 "Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the potential to
be mined for information.“
 Data growing way faster than computation speeds
 A single machine can no longer process or even store all this data!
The Big Data problem

Where does Big Data come from?
 Online recorded content:
 Clicks
 Ad views
 Server requests
 .. everything what happens online can potentially be recorded
 User generated content (Facebook, Twitter, Instagram, etc)
 Smartphone users reach to their phone 150 times a day (2013)
 Health and scientific computing
 Large Hadron Collider produces about double amount of data than Twitter every year
 Internet of Things (IoT)
 smart thermostat systems
 automobiles with built-in sensors
 all kind of “smart” devices of various sizes

Example scales of Big Data
 EIR communication logs: 1.4 TB / day
 Facebook logs: 60 TB / day
 Google total web index: ~10+ PB (10000TB)
 Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)
 ..as a reminder..
 time to read 1TB from disk: 3 hours (100MB/s)
 Google web index could be read from disk serialized in ~3.4 years

Startup example
 Let’s design a simple web tracker from scratch
 Register and count each page view for a number of clients
 “Keep simple things simple”
 Version 1.0:
 Problem?
 Huge number of page views => massive DB load on concurrent updates => DB
timeouts => FAIL

Version 2.0
 Why write each count?!
 Let’s introduce a queue and buffer updates
 Problem?
 # of page views and # of clients keep increasing => DB overload => FAIL

Version 3.0
 The bottleneck is the write-heavy DB
 Let’s shard the database!
 Problems?!
 Have to keep adding new servers and re-sharding existing databases
 Re-sharding online is tricky (maybe introduce pending queues?)
 A single code failure corrupts a huge set of data collected over years
 Maintenance nightmare

Is there a way out?
 We need new tools which handle:
 automatic sharding and re-sharding
 automatic replication and rebalancing
 fault tolerance
 effortless horizontal scaling
 But we need to adapt ourselves as well. We need:
 a new definition of “data” (data ≠ information)
 new architectures (Lambda Architecture)
 immutable data (for scaling and fault tolerance)
 functional programming concepts
 No, writing 25 years old structural code in this year’s favorite language
won’t cut it anymore

Big Data tooling
 Apache Hadoop distributed filesystem (HDFS)
 Distributed, scalable, portable filesytem written in Java
 Open source, 10 years old (!) project
 Handles files in the gigabytes-terabytes range
 Manages automatic replication and rebalancing of data
 Facebook had 21 PB of storage on HDFS in 2010
 Yahoo had a cluster of 10 000 Hadoop nodes in 2008
 Apache Spark
 Next generation data processing engine written in Scala
 Open source, 5 years old project
 Up to 100 times faster than Hadoop MapReduce
 Uses functional programming techniques to process data
 Can scale down to get run in an IDE!

The good news
 The right tools are available and open-source
 The knowledge is available and mostly free
 It’s all ready to get learned!

What's hot

Introduction to Big DataVipin Batra

Big Data Analytics - IntroductionAlex Meadows

The Future Of Big DataMatthew Dennis

A Big Data ConceptDharmesh Tank

Big Data Analysis Patterns - TriHUG 6/27/2013boorad

Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn

Bigdata NithiDazz

Big data analytics with Apache HadoopSuman Saurabh

Big Data AnalyticsTyrone Systems

Introduction to Big DataHaluan Irsad

Big Data - Applications and Technologies OverviewSivashankar Ganapathy

Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi

Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281

Motivation for big dataArockiaraj Durairaj

Big DataNeha Mehta

Big Tools for Big DataLewis Crawford

Introduction to Big DataAmpoolIO

Big data unit 2RojaT4

Big data analytics, survey r.nabatinabati

Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine

What's hot (20)

Introduction to Big Data

Big Data Analytics - Introduction

The Future Of Big Data

A Big Data Concept

Big Data Analysis Patterns - TriHUG 6/27/2013

Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...

Bigdata

Big data analytics with Apache Hadoop

Big Data Analytics

Introduction to Big Data

Big Data - Applications and Technologies Overview

Guest Lecture: Introduction to Big Data at Indian Institute of Technology

Lesson 1 introduction to_big_data_and_hadoop.pptx

Motivation for big data

Big Data

Big Tools for Big Data

Introduction to Big Data

Big data unit 2

Big data analytics, survey r.nabati

Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...

Viewers also liked

Big data and its applicationsali easazadeh

What is big data?David Wellman

Big data pptIDBI Bank Ltd.

What is Big Data?Bernard Marr

Big data pptNasrin Hussain

Virtualization, the cloud enablerPraveen Hanchinal

Big data pptThirunavukkarasu Ps

Big DataNGDATA

Big Data Analytics with HadoopPhilippe Julio

Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr

Big Data Hadoop Tutorial by Easylearning GuruKCC Software Ltd. & Easylearning.guru

Big Data and Social MediaAmy Shuen

#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...Santiago Castelo

Social media & big dataNatalie Orcutt Brilliant, MBA

Big Data from Social Media and Crowdsourcing in EmergenciesThomas Dybro Lundorf

Klarity - Asia digital analytic summitNDN Group

Introduction to Social MediaGerald Hensel

Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner

Product Placement: The Present & The Futureitandlaw

Big Data Social Media & Smart AppsGiacomo Nasilli

Viewers also liked (20)

Big data and its applications

What is big data?

Big data ppt

What is Big Data?

Big data ppt

Virtualization, the cloud enabler

Big data ppt

Big Data

Big Data Analytics with Hadoop

Big Data - 25 Amazing Facts Everyone Should Know

Big Data Hadoop Tutorial by Easylearning Guru

Big Data and Social Media

#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...

Social media & big data

Big Data from Social Media and Crowdsourcing in Emergencies

Klarity - Asia digital analytic summit

Introduction to Social Media

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Product Placement: The Present & The Future

Big Data Social Media & Smart Apps

Similar to Introduction to Big Data

00 hadoop welcome_transcriptGuru Janbheshver University, Hisar

Bigdata and Hadoop BootcampSpotle.ai

Data infrastructure at Facebook AhmedDoukh

Big Data And HadoopAnkur Tripathi

The Internet as a Single DatabaseDatafiniti

Hadoop at Yahoo! -- University Talksyhadoop

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza

Big Data - Need of Converged Data PlatformGeekNightHyderabad

UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin

Big data and APIs for PHP developers - SXSW 2011Eli White

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

PUC Masterclass Big DataArjen de Vries

1. what is hadoop part 1wintersnow181189

The Future of Data Sciencesarith divakar

Final deckSteve Watt

HadoopMayuri Gupta

Hadoop introduction , Why and What is Hadoop ?sudhakara st

Hadoop Online training by KeylabsSiva Sankar

Big data and hadoop introductionAjay Mittal

Hadoop basicsAntonio Silveira

Similar to Introduction to Big Data (20)

00 hadoop welcome_transcript

Bigdata and Hadoop Bootcamp

Data infrastructure at Facebook

Big Data And Hadoop

The Internet as a Single Database

Hadoop at Yahoo! -- University Talks

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...

Big Data - Need of Converged Data Platform

UnConference for Georgia Southern Computer Science March 31, 2015

Big data and APIs for PHP developers - SXSW 2011

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

PUC Masterclass Big Data

1. what is hadoop part 1

The Future of Data Science

Final deck

Hadoop

Hadoop introduction , Why and What is Hadoop ?

Hadoop Online training by Keylabs

Big data and hadoop introduction

Hadoop basics

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Why Teams call analytics are critical to your entire businesspanagenda

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

A Year of the Servo Reboot: Where Are We Now?Igalia

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

A Beginners Guide to Building a RAG App Using Open Source Milvus

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu

Axa Assurance Maroc - Insurer Innovation Award 2024

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

presentation ICT roal in 21st century education

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

How to Troubleshoot Apps for the Modern Connected Worker

Boost Fertility New Invention Ups Success Rates.pdf

Why Teams call analytics are critical to your entire business

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Artificial Intelligence Chap.5 : Uncertainty

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

A Year of the Servo Reboot: Where Are We Now?

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Introduction to Big Data

1. Introduction to Big Data

2. Definition of Big Data  "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.“  "Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.“  Data growing way faster than computation speeds  A single machine can no longer process or even store all this data! The Big Data problem

3. Where does Big Data come from?  Online recorded content:  Clicks  Ad views  Server requests  .. everything what happens online can potentially be recorded  User generated content (Facebook, Twitter, Instagram, etc)  Smartphone users reach to their phone 150 times a day (2013)  Health and scientific computing  Large Hadron Collider produces about double amount of data than Twitter every year  Internet of Things (IoT)  smart thermostat systems  automobiles with built-in sensors  all kind of “smart” devices of various sizes

5. Example scales of Big Data  EIR communication logs: 1.4 TB / day  Facebook logs: 60 TB / day  Google total web index: ~10+ PB (10000TB)  Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)  ..as a reminder..  time to read 1TB from disk: 3 hours (100MB/s)  Google web index could be read from disk serialized in ~3.4 years

6. How do we program this thing? 6

7. OK but I don’t work at Google yet ...

8. Startup example  Let’s design a simple web tracker from scratch  Register and count each page view for a number of clients  “Keep simple things simple”  Version 1.0:  Problem?  Huge number of page views => massive DB load on concurrent updates => DB timeouts => FAIL

9. Version 2.0  Why write each count?!  Let’s introduce a queue and buffer updates  Problem?  # of page views and # of clients keep increasing => DB overload => FAIL

10. Version 3.0  The bottleneck is the write-heavy DB  Let’s shard the database!  Problems?!  Have to keep adding new servers and re-sharding existing databases  Re-sharding online is tricky (maybe introduce pending queues?)  A single code failure corrupts a huge set of data collected over years  Maintenance nightmare

11. Is there a way out?  We need new tools which handle:  automatic sharding and re-sharding  automatic replication and rebalancing  fault tolerance  effortless horizontal scaling  But we need to adapt ourselves as well. We need:  a new definition of “data” (data ≠ information)  new architectures (Lambda Architecture)  immutable data (for scaling and fault tolerance)  functional programming concepts  No, writing 25 years old structural code in this year’s favorite language won’t cut it anymore

12. Big Data tooling  Apache Hadoop distributed filesystem (HDFS)  Distributed, scalable, portable filesytem written in Java  Open source, 10 years old (!) project  Handles files in the gigabytes-terabytes range  Manages automatic replication and rebalancing of data  Facebook had 21 PB of storage on HDFS in 2010  Yahoo had a cluster of 10 000 Hadoop nodes in 2008  Apache Spark  Next generation data processing engine written in Scala  Open source, 5 years old project  Up to 100 times faster than Hadoop MapReduce  Uses functional programming techniques to process data  Can scale down to get run in an IDE!

13. Apache Spark by a glance 13

14. The good news  The right tools are available and open-source  The knowledge is available and mostly free  It’s all ready to get learned!

Introduction to Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Big Data

Similar to Introduction to Big Data (20)

Recently uploaded

Recently uploaded (20)

Introduction to Big Data