SlideShare uma empresa Scribd logo
1 de 54
Baixar para ler offline
1




 Big Data
 the next frontier

RVC Seminar                                Leonid Zhukov
Moscow, 08/02/2013   Professor Higher School of Economics
2
Big data




+ Graph of terms popularity




                              www.visibletechologies.com
3
McKinsey, May 2011




                     www.mckinsey.com
4
Headlines




            Data driven business

            Data democratization

            Data scientists
5
The White House



+ $200M initiative
+ NSF: core techniques
+ NIH: 1000 genomes
+ DOE: advanced computing
+ DOD: data to decisions
+ USGS: Earth system


                            www.whitehouse.gov
6
Gartner Hype Cycle




                     www.gartner.com
7
 Market Forecast




                         + Venture money invested (Reuters):
+ Market forecasts:        + 2009 - $1.1B
 + IDC: 2015 - $16.9B      + 2010 - $1.53B
 + Gartner: 2016- $55B     + 2011 - $2.47B
                                                      www.wikibon.com
8
Big Data Revenue 2012




 + Big Business:
    +   IBM
    +   HP
    +   Oracle
    +   Teradata
    +   EMC             www.wikibon.com
9
Big Data Vendors!




    + Hadoop:
      + Cloudera
      + MapR Techonologies
      + HortonWorks          www.wikibon.com
10
Forrester Wave




                 www.forrester.com
What is big data                                                    11




+ Big data:
  + “Data you can’t process by traditional tools”
  + “A phenomenon defined by the rapid acceleration in the
     expanding volume of high velocity, complex and diverse
     types of data.”

  + “Refers to a collection of tools, techniques and technologies
     for working with data productively, at any scale.”
12
What is Big data

 + 3V
    + Volume: petabytes (1000TB) to exabytes (1000PB)
    + Variety: structured, semi-structured, unstructured
    + Velocity: Tb/s data streams
 + Requires distributed processing
 + Big data = storage + processing
 + Big data = Hadoop (not only)
13
Big data Glossary


+ Hadoop, MapReduce, Hive, Pig, Cascading,
  HBase, Hypertable, Cassandra, Flume, Sqoop,
  Mongo, Voldemort, Storm, Kafka, Drill, Dremmel,
  Impala, Zookeeper, Ambari, Oozi, Yarn, Redis,
  Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R,
  Mahout, Weka,
14
How big is Big?

+ Google
  + 24 PB data processed daily
+ Twitter
  + 340 mln daily tweets
  + 1.6 bln search queries
  + 7 TB added daily
+ Facebook
  + 750 mln users
  + 12 TB daily daily content
  + 2.7 bln “likes” and comments daily
15
Sources of Big Data




                      www.ibm.com
16
Supercomputing


+ National Labs, Universities, Military
+ Processing power, flops, MPI
+ Parallel computing:
   + Cray, IBM SP, SGI
   + Beowulf cluster (Linux commodity)
17
New realities


+ Yahoo, AltaVista, Inktomi, Google
+ Consumer web companies:
   + web search (crawling, indexing)
   + advertising
   + email services
   + ecommerce


   + Commodity hardware
18
Google




  2003   2004
19
GFS/HDFS

+ Distributed replicated data blocks (64Mb)
+ Master-slave architecture (Name Node, Data Nodes)
+ Not a general file system
+ Access via command line utils and API
+ Can’t modify after files written
20
  MapReduce


                                                    + Scalable:
                                                      + no file IO
                                                      + no networking
                                                      + no synchronization




                                 + Master-slave architecture
+ MapReduce programming model:
                                   + Master: divide, schedule, monitor work
  + functional programming
                                   + Slave: actual processing
  + like UNIX pipeline
21
 Data movement




+ store and process data on the same nodes
+ bring code to data, data “locality”
                                             www.cloudera.com
22
Hadoop
+ Doug Cutting
  + Search indexer - Lucene
  + Web crawler - Nutch
  + Hadoop
     + HDFS
     + MapReduce
23
Yahoo!
+ 40,000 servers
+ 170PB storage
+ 1000+ active users
+ 5M+ monthly jobs
+ email spam filters
+ categorization, personalization
+ computational advertising
Data Base NoSQL                   24

Revolution
+ Needed:
   + fast read/write time
   + high concurrency
   + easy horizontally scalable
+ Flat data structure
+ Sacrificed:
   + DB Schema
   + SQL
   + Transactions
25
NoSQL World

+ Key-value: Dynamo, Voldemort, Redis, Riak
+ Column (tabular): HBase, Hypertable, Cassandra
+ Document store: CouchDB, MongoDB
+ Graph: Neo4J, FlockDB
+ 120+ products (2012)
26
Hadoop stack




               www.hortonworks.com
27
Hadoop tools

+ Pig
  + high level scripting language (PigLatin)
  + converts to MapReduce jobs
+ Hive
  + SQL like queries on dat in HDFS
  + converts in MapReduce jobs
28

Hadoop data movement




                       www.cloudera.com
29
Typical hadoop usage
 +   Text mining
 +   Pattern recognition
 +   Recommendation systems (collaborative filtering)
 +   Prediction models
 +   Risk assessment
 +   Sentiment analysis
 +   Customer churn prediction
 +   Customer segmentation
 +   Point of Sale Transaction analysis
 +   Data “sandbox”
30

Application fields

+ Science: sensors, genome, weather, satellite,
   imaging

+ Engineering: log analytics, status feeds, network
   messages, spam filters..

+ Product: financial, pharmaceutical, insurance,
   energy, retail, ecommerce, healthcare, telecom

+ Business: analytics, BI
31
Business analytics



+ Analytic
+ Operational




        Capture, analyze, learn from data
                                            www.datasciencecentral.com
32
Who uses Hadoop?




                   www.cloudera.com
33
Why Hadoop?




              www.thinkbiganalytics.com
34
Cloudera




+ Enterprise support for Apache Hadoop
+ Founded 2008, funding $141 M
+ Employee 230
+ Products:
  + CDH 4 (cloudera distrobution hadoop)
  + Impala
  + Consulting and training
                                           www.cloudera.com
35
MapR




+ Founded 2009, funding $20M
+ MapR Technologies is engineering game-
  changing Map/Reduce related technologies

+ Products:
  + M3,M5,M7
  + NFS, no single node failure
  + NOT open source !
                                             www.mapr.com
36
HortonWorks




+ Founded 2011
+ Yahoo spin-off
+ Products:
  + HDP distribution
  + tools

                       www.hortonworks.com
37
Hadoop Ecosystem




                   www.datameer.com
38
Big Data Landscape




                     www.bigdatalandscape.com
39
Splunk




+ Founded 2003, raised $230M, IPO 2011, Market cap $3.35B
+ Machine logs analysis, operational intelligence
+ Collecting, searching, monitoring




                                                            www.splunk.com
40
Datameer




+ Founded 2009,
  Funding $17,8M

+ Big data:
  + Data integration
  + Data Analytics
  + Data Visualization
                         www.datameer.com
41
Datasift




+ Founded 2010, funding $29.7M
+ Data platform for social web
+ Aggregate and filter data



                                 www.datasift.com
42
Infochimps




+ Founded 2009, funding $5.5M
+ Transitioned from data marketpalce to big data platform
+ End-to-end big data solution, real time




                                                        www.infochimps.com
43
Tableau software




+ Founded 2003, funding $15M
+ Big data analytics
+ Big data visualization

                               www.tableau.com
Big data Startups                       44

 2012

+ Platfora, in memory BI on Hadoop
+ Sumologic, log file analysis
+ Hadapt, Hadoop+RDBSM
+ Metamarkets, patterns in data flow
+ DataStax, consulting, training
+ Karmasphere, BI, analytics on Hadoop
Big data startups                               45

 2013!


+ 10gen, MongoDB
+ ClearStory, big data aggregation + analytics
+ Continuuity, Hadoop API
+ Parstream, database analytics
+ Zoomdata, data visualization
+ Climate corporation, predictive analytics
46
Big data by industry




                       www.gartner.com
47
Big data Processing

                 Batch
                             interactive       stream
               processing



               minutes to   Millisecond to
 Query time                                   continues
                 hours         seconds



 data volume    TB to PT      GB to PB        continues



programming
               MapReduce       Queries           DAG
   model




   Users       Developers     Analysts       Developers




                Hadoop
Open Source                  Drill, Impala   Storm, Kafka
               mapreduce
48
New technologies

+ Real time quering
  + Drill (based on Google Dremmel)
  + Impala (Cloudera)


+ Data stream processing
  + Storm (Twitter), real time analytics
  + Kafka (LinkedIn), messaging system
49
Machine learning

 + Predictive analytics
 + Patterns discovery
 + Data mining
 + Tools:
    + Mahout
    + R
50
Big data revolution

+ Google: GFS, MapReduce, BigTable,
+ Yahoo: Hadoop
+ Amazon: DynamoDB
+ Facebook: Cassandra, HBase
+ Twitter: FlockDB, Storm
+ LinkedIn: Vondelmort, Kafka
51
Observations

+ Game changing technologies come from big companies
+ Open Source (!)
+ Start-up ecosystem
+ Less general, more specialized
+ Next step: big data analytics and visualization
52
Data scientist

+ Machine Learning
+ Data Mining
+ Statistics
+ Software Engineering
+ Hadoop/MapReduce/HBase/Hive/Pig
+ Java, Python, C/C+, SQL

“By 2018, the United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills as well as 1.5 million managers and analysts with
the know-how to use the analysis of big data to make effective decisions.”
Big Data Products                  53

MindMap




                    www.garycrawford.co.uk
54
Contacts


+ Leonid Zhukov, Ph.D.
+ School of Applied Mathematics and Information Science
   Higher School of Economics, NRU-HSE

+ lzhukov@hse.ru
+ www.leonidzhukov.ru

Mais conteúdo relacionado

Mais procurados

관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)Myungjin Lee
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREJazz Yao-Tsung Wang
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUBAhmed Salman
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehousetervela
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB WorkshopAhmed Salman
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013Brian Crotty
 
Big data overview external
Big data overview externalBig data overview external
Big data overview externalBrett Colbert
 

Mais procurados (14)

HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
관광 지식베이스와 스마트 관광 서비스 (Knowledge base and Smart Tourism)
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTUREBig Data Tools : PAST, NOW and FUTURE
Big Data Tools : PAST, NOW and FUTURE
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big Data Course - BigData HUB
Big Data Course - BigData HUBBig Data Course - BigData HUB
Big Data Course - BigData HUB
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013BBDO Proximity: Big-data May 2013
BBDO Proximity: Big-data May 2013
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 

Destaque

Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationLeonid Zhukov
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013Leonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisLeonid Zhukov
 
socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukovLeonid Zhukov
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesBen Siscovick
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges Experian_US
 
A Primer on Big Data for Business
A Primer on Big Data for BusinessA Primer on Big Data for Business
A Primer on Big Data for BusinessLeslie Bradshaw
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business AdvantageTeradata Aster
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRILeonid Zhukov
 

Destaque (11)

CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"CAATs - a way to avoid becoming a "TV star"
CAATs - a way to avoid becoming a "TV star"
 
Vis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI VisualizationVis03 Workshop. DT-MRI Visualization
Vis03 Workshop. DT-MRI Visualization
 
ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013ancestry-bigdatasummit-april2013
ancestry-bigdatasummit-april2013
 
Numerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link AnalysisNumerical Linear Algebra for Data and Link Analysis
Numerical Linear Algebra for Data and Link Analysis
 
socialnetworkszhukov
socialnetworkszhukovsocialnetworkszhukov
socialnetworkszhukov
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA Ventures
 
Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges   Trends in Big Data & Business Challenges
Trends in Big Data & Business Challenges
 
A Primer on Big Data for Business
A Primer on Big Data for BusinessA Primer on Big Data for Business
A Primer on Big Data for Business
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business Advantage
 
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRIOriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
Oriented Tensor Reconstruction. Tracing Neural Pathways from DT-MRI
 

Semelhante a Business of Big Data

Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013Big Data Spain
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data TrendsIMC Institute
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup Faizan Javed
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...BigDataEverywhere
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)Sascha Dittmann
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013nkabra
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopOCTO Technology
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven businessOpenDataSoft
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowMapR Technologies
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overviewDorai Thodla
 

Semelhante a Business of Big Data (20)

Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013 Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
Separating Hadoop Myths from Reality by ROB ANDERSON at Big Data Spain 2013
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Forecast of Big Data Trends
Forecast of Big Data TrendsForecast of Big Data Trends
Forecast of Big Data Trends
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big data
Big dataBig data
Big data
 
1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup 1st Birmingham Big Data Science Group meetup
1st Birmingham Big Data Science Group meetup
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013Future of big data nick kabra speaker compendium march 2013
Future of big data nick kabra speaker compendium march 2013
 
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with HadoopCafé da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
Café da manhã - São Paulo - Use-cases and opportunities in BigData with Hadoop
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Big data – a brief overview
Big data – a brief overviewBig data – a brief overview
Big data – a brief overview
 

Mais de Leonid Zhukov

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data useLeonid Zhukov
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorLeonid Zhukov
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.comLeonid Zhukov
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data StartupsLeonid Zhukov
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших ДанныхLeonid Zhukov
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data ScientistLeonid Zhukov
 
Большие Данные
Большие ДанныеБольшие Данные
Большие ДанныеLeonid Zhukov
 
Information cascades
Information cascadesInformation cascades
Information cascadesLeonid Zhukov
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскадыLeonid Zhukov
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network AnalysisLeonid Zhukov
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Leonid Zhukov
 

Mais de Leonid Zhukov (13)

Ecosystem challenges around data use
Ecosystem challenges around data useEcosystem challenges around data use
Ecosystem challenges around data use
 
Social Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to MacrobehaviorSocial Networks: from Micromotives to Macrobehavior
Social Networks: from Micromotives to Macrobehavior
 
Big Data at Ancestry.com
Big Data at Ancestry.comBig Data at Ancestry.com
Big Data at Ancestry.com
 
Russian Big Data Startups
Russian Big Data StartupsRussian Big Data Startups
Russian Big Data Startups
 
Революция Больших Данных
Революция Больших ДанныхРеволюция Больших Данных
Революция Больших Данных
 
Профессия Data Scientist
 Профессия Data Scientist Профессия Data Scientist
Профессия Data Scientist
 
Большие Данные
Большие ДанныеБольшие Данные
Большие Данные
 
Information cascades
Information cascadesInformation cascades
Information cascades
 
Инфорамционные каскады
Инфорамционные каскадыИнфорамционные каскады
Инфорамционные каскады
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.Numerical Linear Algebra for Data and Link Analysis.
Numerical Linear Algebra for Data and Link Analysis.
 
Monitorium DLP
Monitorium DLPMonitorium DLP
Monitorium DLP
 

Último

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Último (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Business of Big Data

  • 1. 1 Big Data the next frontier RVC Seminar Leonid Zhukov Moscow, 08/02/2013 Professor Higher School of Economics
  • 2. 2 Big data + Graph of terms popularity www.visibletechologies.com
  • 3. 3 McKinsey, May 2011 www.mckinsey.com
  • 4. 4 Headlines Data driven business Data democratization Data scientists
  • 5. 5 The White House + $200M initiative + NSF: core techniques + NIH: 1000 genomes + DOE: advanced computing + DOD: data to decisions + USGS: Earth system www.whitehouse.gov
  • 6. 6 Gartner Hype Cycle www.gartner.com
  • 7. 7 Market Forecast + Venture money invested (Reuters): + Market forecasts: + 2009 - $1.1B + IDC: 2015 - $16.9B + 2010 - $1.53B + Gartner: 2016- $55B + 2011 - $2.47B www.wikibon.com
  • 8. 8 Big Data Revenue 2012 + Big Business: + IBM + HP + Oracle + Teradata + EMC www.wikibon.com
  • 9. 9 Big Data Vendors! + Hadoop: + Cloudera + MapR Techonologies + HortonWorks www.wikibon.com
  • 10. 10 Forrester Wave www.forrester.com
  • 11. What is big data 11 + Big data: + “Data you can’t process by traditional tools” + “A phenomenon defined by the rapid acceleration in the expanding volume of high velocity, complex and diverse types of data.” + “Refers to a collection of tools, techniques and technologies for working with data productively, at any scale.”
  • 12. 12 What is Big data + 3V + Volume: petabytes (1000TB) to exabytes (1000PB) + Variety: structured, semi-structured, unstructured + Velocity: Tb/s data streams + Requires distributed processing + Big data = storage + processing + Big data = Hadoop (not only)
  • 13. 13 Big data Glossary + Hadoop, MapReduce, Hive, Pig, Cascading, HBase, Hypertable, Cassandra, Flume, Sqoop, Mongo, Voldemort, Storm, Kafka, Drill, Dremmel, Impala, Zookeeper, Ambari, Oozi, Yarn, Redis, Rajak, Pregel, Gremlin, Giraph, Solr, Lucene, R, Mahout, Weka,
  • 14. 14 How big is Big? + Google + 24 PB data processed daily + Twitter + 340 mln daily tweets + 1.6 bln search queries + 7 TB added daily + Facebook + 750 mln users + 12 TB daily daily content + 2.7 bln “likes” and comments daily
  • 15. 15 Sources of Big Data www.ibm.com
  • 16. 16 Supercomputing + National Labs, Universities, Military + Processing power, flops, MPI + Parallel computing: + Cray, IBM SP, SGI + Beowulf cluster (Linux commodity)
  • 17. 17 New realities + Yahoo, AltaVista, Inktomi, Google + Consumer web companies: + web search (crawling, indexing) + advertising + email services + ecommerce + Commodity hardware
  • 19. 19 GFS/HDFS + Distributed replicated data blocks (64Mb) + Master-slave architecture (Name Node, Data Nodes) + Not a general file system + Access via command line utils and API + Can’t modify after files written
  • 20. 20 MapReduce + Scalable: + no file IO + no networking + no synchronization + Master-slave architecture + MapReduce programming model: + Master: divide, schedule, monitor work + functional programming + Slave: actual processing + like UNIX pipeline
  • 21. 21  Data movement + store and process data on the same nodes + bring code to data, data “locality” www.cloudera.com
  • 22. 22 Hadoop + Doug Cutting + Search indexer - Lucene + Web crawler - Nutch + Hadoop + HDFS + MapReduce
  • 23. 23 Yahoo! + 40,000 servers + 170PB storage + 1000+ active users + 5M+ monthly jobs + email spam filters + categorization, personalization + computational advertising
  • 24. Data Base NoSQL 24 Revolution + Needed: + fast read/write time + high concurrency + easy horizontally scalable + Flat data structure + Sacrificed: + DB Schema + SQL + Transactions
  • 25. 25 NoSQL World + Key-value: Dynamo, Voldemort, Redis, Riak + Column (tabular): HBase, Hypertable, Cassandra + Document store: CouchDB, MongoDB + Graph: Neo4J, FlockDB + 120+ products (2012)
  • 26. 26 Hadoop stack www.hortonworks.com
  • 27. 27 Hadoop tools + Pig + high level scripting language (PigLatin) + converts to MapReduce jobs + Hive + SQL like queries on dat in HDFS + converts in MapReduce jobs
  • 28. 28 Hadoop data movement www.cloudera.com
  • 29. 29 Typical hadoop usage + Text mining + Pattern recognition + Recommendation systems (collaborative filtering) + Prediction models + Risk assessment + Sentiment analysis + Customer churn prediction + Customer segmentation + Point of Sale Transaction analysis + Data “sandbox”
  • 30. 30 Application fields + Science: sensors, genome, weather, satellite, imaging + Engineering: log analytics, status feeds, network messages, spam filters.. + Product: financial, pharmaceutical, insurance, energy, retail, ecommerce, healthcare, telecom + Business: analytics, BI
  • 31. 31 Business analytics + Analytic + Operational Capture, analyze, learn from data www.datasciencecentral.com
  • 32. 32 Who uses Hadoop? www.cloudera.com
  • 33. 33 Why Hadoop? www.thinkbiganalytics.com
  • 34. 34 Cloudera + Enterprise support for Apache Hadoop + Founded 2008, funding $141 M + Employee 230 + Products: + CDH 4 (cloudera distrobution hadoop) + Impala + Consulting and training www.cloudera.com
  • 35. 35 MapR + Founded 2009, funding $20M + MapR Technologies is engineering game- changing Map/Reduce related technologies + Products: + M3,M5,M7 + NFS, no single node failure + NOT open source ! www.mapr.com
  • 36. 36 HortonWorks + Founded 2011 + Yahoo spin-off + Products: + HDP distribution + tools www.hortonworks.com
  • 37. 37 Hadoop Ecosystem www.datameer.com
  • 38. 38 Big Data Landscape www.bigdatalandscape.com
  • 39. 39 Splunk + Founded 2003, raised $230M, IPO 2011, Market cap $3.35B + Machine logs analysis, operational intelligence + Collecting, searching, monitoring www.splunk.com
  • 40. 40 Datameer + Founded 2009, Funding $17,8M + Big data: + Data integration + Data Analytics + Data Visualization www.datameer.com
  • 41. 41 Datasift + Founded 2010, funding $29.7M + Data platform for social web + Aggregate and filter data www.datasift.com
  • 42. 42 Infochimps + Founded 2009, funding $5.5M + Transitioned from data marketpalce to big data platform + End-to-end big data solution, real time www.infochimps.com
  • 43. 43 Tableau software + Founded 2003, funding $15M + Big data analytics + Big data visualization www.tableau.com
  • 44. Big data Startups 44 2012 + Platfora, in memory BI on Hadoop + Sumologic, log file analysis + Hadapt, Hadoop+RDBSM + Metamarkets, patterns in data flow + DataStax, consulting, training + Karmasphere, BI, analytics on Hadoop
  • 45. Big data startups 45 2013! + 10gen, MongoDB + ClearStory, big data aggregation + analytics + Continuuity, Hadoop API + Parstream, database analytics + Zoomdata, data visualization + Climate corporation, predictive analytics
  • 46. 46 Big data by industry www.gartner.com
  • 47. 47 Big data Processing Batch interactive stream processing minutes to Millisecond to Query time continues hours seconds data volume TB to PT GB to PB continues programming MapReduce Queries DAG model Users Developers Analysts Developers Hadoop Open Source Drill, Impala Storm, Kafka mapreduce
  • 48. 48 New technologies + Real time quering + Drill (based on Google Dremmel) + Impala (Cloudera) + Data stream processing + Storm (Twitter), real time analytics + Kafka (LinkedIn), messaging system
  • 49. 49 Machine learning + Predictive analytics + Patterns discovery + Data mining + Tools: + Mahout + R
  • 50. 50 Big data revolution + Google: GFS, MapReduce, BigTable, + Yahoo: Hadoop + Amazon: DynamoDB + Facebook: Cassandra, HBase + Twitter: FlockDB, Storm + LinkedIn: Vondelmort, Kafka
  • 51. 51 Observations + Game changing technologies come from big companies + Open Source (!) + Start-up ecosystem + Less general, more specialized + Next step: big data analytics and visualization
  • 52. 52 Data scientist + Machine Learning + Data Mining + Statistics + Software Engineering + Hadoop/MapReduce/HBase/Hive/Pig + Java, Python, C/C+, SQL “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
  • 53. Big Data Products 53 MindMap www.garycrawford.co.uk
  • 54. 54 Contacts + Leonid Zhukov, Ph.D. + School of Applied Mathematics and Information Science Higher School of Economics, NRU-HSE + lzhukov@hse.ru + www.leonidzhukov.ru