SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
Cloud Computing
BigData
                      2011.12




              -   -        2.0
          .
babokim@gmail.com
  )     (www.gruter.com)
     SDS, NHN
www.jaso.co.kr
www.cloudata.org
www.cloumon.org
www.twitter.com/babokim
www.facebook.com/babokim
BigData Definition(1)
   Big Data(BD)                                               /     /
                              ,                                                                         ,


                  DB                                          (McKinsey, 2011)
    What          -                                  SW           ,     ,
      is
                  DB                                                     (IDC, 2011)
   BigData?       - Big Data
                    (        )                   ,        ,



                          ,                                                                         ,




                                                                           Gartner              McKinsey
                  Economist                                               (2011.03)             (2011.05)
                  (2010.05)

                                                              21                                        /
  SNS   M2M                                                                                                 ,
                                                                                           /
                    ,             ,   ,                                                     ,   5
                                                Information silo                       6

    : Big Data,                           (KT                       )
BigData Definition(2)

     Very large, distributed aggregations of loosely
     structured data
           Petabytes/exabytes of data,
           Millions/billions of people,
           Billions/trillions of records,
           Loosely-structured and often distributed data,
           Flat schemas with few complex interrelationships,
           Often involving time-stamped events,
           Often made up of incomplete data,
           Often including connections between data elements that must
           be probabilistically inferred,
     Applications that involved Big-data can be
           Transactional (e.g., Facebook, PhotoBox), or,
           Analytic (e.g., ClickFox, Merced Applications).
  http://wikibon.org/wiki/v/Enterprise_Big-data
Big-data Analytics Complements Data Warehouse


    Traditional Data Warehouse

      -   Complete record from transactional system
      -   All data centralized
      -   Analytics designed against stable environment
      -   Many reports run on a production basis



                        Big-data Analytic Environment
                          - Data from many sources inside and outside of organization
                            (including traditional DW)
                          - Data often physically distributed
                          - Need to iteration solution to test/improve models
                          - Large-memory analytics also part of iteration
                          - Every iteration usually requires complete reload of information


  http://wikibon.org/wiki/v/Enterprise_Big-data
Facebook Social plug-in
  Transactional
                             process over 20 billion events per
                             day (200,000 events per second)
                             with a lag of less than 30 seconds.




                  Feedback




                      Analytic
BigData


   Collecting         Store                             Analysis                       Reporting/Searching

   , SNS
                                                    /


                                                                 Senti-
                                         Cluster-   Classifi-
                                                                mental      Indexing
                                           ing       cation
                                                                Analysis

                        (            )
                                                                                                    /

                                                           Repository/
                                                                                            User Define
                                                                                            Query Script
  Robot

      RSS Reader            (DBMS,
                                                                                              ETL
                            NoSQL)
            OpenAPI

                                                                    Index
  Data Aggregator
Twitter                 :    backtype
                    Workers choose queue to enqueue       All updates for same URL
                    to using hash/mod of URL              guaranteed to go to same worker


          Workers share the load of
          schemifying tweets




Distribute tweets randomly    Workers schemify tweets   Workers update statistics on URLs by
    on multiple queues        and append to Hadoop      incrementing counters in Cassandra
BigData

 Architectural Requirements                              ?
 Scalability
  - Scale-out                          -
  - Elasticity                         -
 Reliability                           -
  -                                    - Hadoop
  -
 Flexibility                                         Component   ,
  - Easy for adding Analysis Rule
  - Support various data format
 Latency
  - Real time, Near Real time, Batch
 High Throughput                       IBM, HP, Oracle
  - Global web scale traffic
  -        ~    /sec
                                       -
                                       -     BI/DW
BigData


                                                                     Flume, Scribe, Chukwa

                                                                     Hadoop FileSystem
                                                                     MogileFS
                                                   ,                 NoSQL(Cloudata, HBase,
                                                                     Cassandra)
                                                                     Katta, ElasticSearch


                                      count, sum       aggregation   S4, Storm


                                                                     Hadoop MapReduce(Hive,
                                              ,
                                                                     Pig)
                                                                     Giraph, GoldenOrb

           /        Cluster, Classification
                                                                     Mahout, R


                                                                     ZooKeeper, HUE, Cloumon


    Serialization                                                    Thrift, Avro, ProtoBuf
Hadoop Echo System




 http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
Software Stack

                                     Interface




                                                                                                        Rule Management
               Web                         Phone                             Pad

                                  Data Visualization


    (Near)Real-time                                Batch Analysis
       Analysis




                                                                                                                          Management
                                                                     Analysis Job




                                                                                                      Monitoring
                                                                                                      (cloumon)
        Analysis Job                                             Mining Lib         Statistics Lib
                              Script Language(Hive, Pig)         (Mahout)                (R)

     Real-time Analysis
          Platform              Job Workflow Engine(oozie, cascade)

        CEP Engine
                                             Data Analysis Platform(hadoop)
          (Esper)




                                                                                                     Management
                                                                                                     (ZooKeeper)

                                                                                                        Cluster
       Aggregator                                      Data Store
          Collector          File System                 NoSQL                         Search
   (flume, scribe, chukwa)   (HadoopFS)       (Cloudata, HBase, Cassandra)         (ElasticSearch)
Application
   Application         Server                                   Collector #1
     Server             Log4j
                                                                               Centralized
                                                                                Storage
                        Agent
        log                                                                      (HDFS)
                        (local)
                                                                Collector #2

                      Temp Log


Chukwa(Yahoo)
                                        Hadoop FileSystem
                     HDFS
                        MapReduce                  (        )
Scribe(Facebook)


                                                 (thrift)
     Hadoop                       JNI
Flume(Cloudera)

                 ,                       ,
     Hadoop, HBase, Search Engine
- Esper       Event

- Gruter ClouStream, Yahoo S4, Twitter Storm, Facebook Puma

                                                              ClouStream




                                                                           Puma
: Hadoop File System

BigData                        Defacto Standard
                         x86


     /


NameNode   SPOF(Single Point Of Failure)
: MapReduce
: Hadoop MapReduce
MapReduce                    ,
    MapReduce
    MapReduce
Hadoop FileSystem
        /
                    DB, FTP Server

            FIFO,          Fair, Capacity
            /
    MapReduce                               ,
             (streaming)
: Script Language

          Hive

Hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE invites;
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SEL
ECT t1.bar, t1.foo, t2.foo;




          Pig

Visits = load /data/visits as (user, url, time);
Visits = foreach Visits generate user, Canonicalize(url), time;
Pages = load /data/pages as (url, pagerank);
VP = join Visits by url, Pages by url;
UserVisits = group VP by user;
UserPageranks = foreach UserVisits generate user,
AVG(VP.pagerank) as avgpr;
GoodUsers = filter UserPageranks by avgpr > 0.5 ;
store GoodUsers into '/data/good_users';
Next Generation Hadoop(0.23)
                               YARN
                               (Next MapReduce Framework)




         HDFS Federation
NoSQL
                            ,    , Scale-out
             ,


      Key/value, Document       , Simple Column
      Schema Free
  Big Data
                 x86
                                                    CAP(Brewers Conjecture)
  Eventually consistent / BASE (not ACID)
  Simple API


      Twitter: Cassandra, HBase, Hadoop, Scribe,
      FlockDB, Redis
      Facebook: Cassandra, HBase, Hadoop, Scribe,
      Hive
      Netflix: Amazon SimpleDB, Cassandra
      Digg: Cassandra
      SimpleGeo: Cassandra
      StumbleUpon: HBase, OpenTSDB
      Yahoo!: Hadoop, HBase, PNUTS
      Rackspace: Cassandra
      DAUM: MongoDB
      NCSoft: Cassandra
NoSQL: Cloudata/HBase
   Distributed Data Storage                                   Create, drop, modify table schema
          semi-structured data store(not file system)
                                                              Single row operation
                /                                             Multi row operation: like, between
   Google Bigtable clone
          Data Model, Architecture, Features                  Scanner, Direct Uploader, MapReduce Adapter
   Open source
          http://www.cloudata.org                             Automatic table split & re-assignment

   Goal                                                                                      (Hadoop)
          500 nodes                                     Failover
          300 GB    /node, Peta bytes                                            ~
seenal.com
10.29
BigData                                                           .
                                    ,                   BigData
                  .
     ,                          ,           ,           BigData
              .
BigData                                                               , Data
                                                    .
                                                ,
                        .                                                      .

                            .
     (6               ~ 1               )
      .

                  .

          .
                                                                      .
.

Facebook:
      babokim@gruter.com
          www.jaso.co.kr

Mais conteúdo relacionado

Semelhante a 제1회 Korea Community Day 발표자료 Bigdata

Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Ensuring Mobile BI Success
Ensuring Mobile BI SuccessEnsuring Mobile BI Success
Ensuring Mobile BI SuccessBirst
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureOdinot Stanislas
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data cloudsdamienjoyce
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big DataJean-Marc Desvaux
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataSitaram Kotnis
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Big Data
Big DataBig Data
Big DataNGDATA
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 

Semelhante a 제1회 Korea Community Day 발표자료 Bigdata (20)

Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Ensuring Mobile BI Success
Ensuring Mobile BI SuccessEnsuring Mobile BI Success
Ensuring Mobile BI Success
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Big Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the FutureBig Data Beyond Hadoop*: Research Directions for the Future
Big Data Beyond Hadoop*: Research Directions for the Future
 
Enterprise linked data clouds
Enterprise linked data cloudsEnterprise linked data clouds
Enterprise linked data clouds
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big Data
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Big Data
Big DataBig Data
Big Data
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 

Mais de Gruter

MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기Gruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoGruter
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache TajoGruter
 
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWSGruter
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondGruter
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Gruter
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoGruter
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSGruter
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with TajoGruter
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Gruter
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014Gruter
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Gruter
 

Mais de Gruter (20)

MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache Tajo
 
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWS
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with Tajo
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigData
 
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

제1회 Korea Community Day 발표자료 Bigdata

  • 1. Cloud Computing BigData 2011.12 - - 2.0 .
  • 2.
  • 3. babokim@gmail.com ) (www.gruter.com) SDS, NHN www.jaso.co.kr www.cloudata.org www.cloumon.org www.twitter.com/babokim www.facebook.com/babokim
  • 4. BigData Definition(1) Big Data(BD) / / , , DB (McKinsey, 2011) What - SW , , is DB (IDC, 2011) BigData? - Big Data ( ) , , , , Gartner McKinsey Economist (2011.03) (2011.05) (2010.05) 21 / SNS M2M , / , , , , 5 Information silo 6 : Big Data, (KT )
  • 5. BigData Definition(2) Very large, distributed aggregations of loosely structured data Petabytes/exabytes of data, Millions/billions of people, Billions/trillions of records, Loosely-structured and often distributed data, Flat schemas with few complex interrelationships, Often involving time-stamped events, Often made up of incomplete data, Often including connections between data elements that must be probabilistically inferred, Applications that involved Big-data can be Transactional (e.g., Facebook, PhotoBox), or, Analytic (e.g., ClickFox, Merced Applications). http://wikibon.org/wiki/v/Enterprise_Big-data
  • 6. Big-data Analytics Complements Data Warehouse Traditional Data Warehouse - Complete record from transactional system - All data centralized - Analytics designed against stable environment - Many reports run on a production basis Big-data Analytic Environment - Data from many sources inside and outside of organization (including traditional DW) - Data often physically distributed - Need to iteration solution to test/improve models - Large-memory analytics also part of iteration - Every iteration usually requires complete reload of information http://wikibon.org/wiki/v/Enterprise_Big-data
  • 7. Facebook Social plug-in Transactional process over 20 billion events per day (200,000 events per second) with a lag of less than 30 seconds. Feedback Analytic
  • 8. BigData Collecting Store Analysis Reporting/Searching , SNS / Senti- Cluster- Classifi- mental Indexing ing cation Analysis ( ) / Repository/ User Define Query Script Robot RSS Reader (DBMS, ETL NoSQL) OpenAPI Index Data Aggregator
  • 9. Twitter : backtype Workers choose queue to enqueue All updates for same URL to using hash/mod of URL guaranteed to go to same worker Workers share the load of schemifying tweets Distribute tweets randomly Workers schemify tweets Workers update statistics on URLs by on multiple queues and append to Hadoop incrementing counters in Cassandra
  • 10. BigData Architectural Requirements ? Scalability - Scale-out - - Elasticity - Reliability - - - Hadoop - Flexibility Component , - Easy for adding Analysis Rule - Support various data format Latency - Real time, Near Real time, Batch High Throughput IBM, HP, Oracle - Global web scale traffic - ~ /sec - - BI/DW
  • 11. BigData Flume, Scribe, Chukwa Hadoop FileSystem MogileFS , NoSQL(Cloudata, HBase, Cassandra) Katta, ElasticSearch count, sum aggregation S4, Storm Hadoop MapReduce(Hive, , Pig) Giraph, GoldenOrb / Cluster, Classification Mahout, R ZooKeeper, HUE, Cloumon Serialization Thrift, Avro, ProtoBuf
  • 12. Hadoop Echo System http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
  • 13. Software Stack Interface Rule Management Web Phone Pad Data Visualization (Near)Real-time Batch Analysis Analysis Management Analysis Job Monitoring (cloumon) Analysis Job Mining Lib Statistics Lib Script Language(Hive, Pig) (Mahout) (R) Real-time Analysis Platform Job Workflow Engine(oozie, cascade) CEP Engine Data Analysis Platform(hadoop) (Esper) Management (ZooKeeper) Cluster Aggregator Data Store Collector File System NoSQL Search (flume, scribe, chukwa) (HadoopFS) (Cloudata, HBase, Cassandra) (ElasticSearch)
  • 14. Application Application Server Collector #1 Server Log4j Centralized Storage Agent log (HDFS) (local) Collector #2 Temp Log Chukwa(Yahoo) Hadoop FileSystem HDFS MapReduce ( ) Scribe(Facebook) (thrift) Hadoop JNI Flume(Cloudera) , , Hadoop, HBase, Search Engine
  • 15. - Esper Event - Gruter ClouStream, Yahoo S4, Twitter Storm, Facebook Puma ClouStream Puma
  • 16. : Hadoop File System BigData Defacto Standard x86 / NameNode SPOF(Single Point Of Failure)
  • 18. : Hadoop MapReduce MapReduce , MapReduce MapReduce Hadoop FileSystem / DB, FTP Server FIFO, Fair, Capacity / MapReduce , (streaming)
  • 19. : Script Language Hive Hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE invites; hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SEL ECT t1.bar, t1.foo, t2.foo; Pig Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5 ; store GoodUsers into '/data/good_users';
  • 20. Next Generation Hadoop(0.23) YARN (Next MapReduce Framework) HDFS Federation
  • 21. NoSQL , , Scale-out , Key/value, Document , Simple Column Schema Free Big Data x86 CAP(Brewers Conjecture) Eventually consistent / BASE (not ACID) Simple API Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis Facebook: Cassandra, HBase, Hadoop, Scribe, Hive Netflix: Amazon SimpleDB, Cassandra Digg: Cassandra SimpleGeo: Cassandra StumbleUpon: HBase, OpenTSDB Yahoo!: Hadoop, HBase, PNUTS Rackspace: Cassandra DAUM: MongoDB NCSoft: Cassandra
  • 22. NoSQL: Cloudata/HBase Distributed Data Storage Create, drop, modify table schema semi-structured data store(not file system) Single row operation / Multi row operation: like, between Google Bigtable clone Data Model, Architecture, Features Scanner, Direct Uploader, MapReduce Adapter Open source http://www.cloudata.org Automatic table split & re-assignment Goal (Hadoop) 500 nodes Failover 300 GB /node, Peta bytes ~
  • 23.
  • 25.
  • 26.
  • 27. 10.29
  • 28.
  • 29. BigData . , BigData . , , , BigData . BigData , Data . , . . . (6 ~ 1 ) . . . .
  • 30. . Facebook: babokim@gruter.com www.jaso.co.kr