SlideShare uma empresa Scribd logo
1 de 55
Baixar para ler offline
How to collect Big Data

into Hadoop
Big Data processing to collect Big Data



                                            fluentd.org
                                   Sadayuki Furuhashi
Self-introduction

>   Sadayuki Furuhashi
>   Treasure Data, Inc.
    Founder & Software Architect

>   Open source projects
    MessagePack - efficient serializer (original author)
    Fluentd - event collector (original author)
We’re Hiring!
sf@treasure-data.com
Today’s topic
Big Data




Report &
Monitor
Big Data



Collect   Store   Process   Visualize




             Report &
             Monitor
easier & shorter time



Collect   Store Process             Visualize




          Cloudera                Excel
          Horton Works            Tableau
          MapR                    R
How to shorten here?   easier & shorter time



       Collect         Store Process Visualize




                       Cloudera                Excel
                       Horton Works            Tableau
                       MapR                    R
Problems to collect data
Poor man’s data collection

1. Copy files from servers using rsync


2. Create a RegExp to parse the files


3. Parse the files and generate a 10GB CSV file


4. Put it into HDFS
Problems to collect “big data”
>   Includes broken values
    >   needs error handling & retrying
>   Time-series data are changing and uncler
    >   parse logs before storing
>   Takes time to read/write
    >   tools have to be optimized and parallelized
>   Takes time for trial & error
>   Causes network traffic spikes
Problem of poor man’s data collection


>   Wastes time to implement error handling
>   Wastes time to maintain a parser
>   Wastes time to debug the tool
>   Not reliable
>   Not efficient
Basic theories
to collect big data
Divide & Conquer


         error




         error
Divide & Conquer & Retry


          error      retry


                     retry




          error      retry   retry
Streaming




Don’t handle big files here   Do it here
Apache Flume and Fluentd
Apache Flume
Apache Flume


access logs    Agent
   app logs    Agent   Collector

system logs    Agent   Collector
               Agent
         ...
Apache Flume - network topology
                     Master
           Agent                       ack

           Agent    Collector
Flume OG                               Collector
           Agent    Collector   send
           Agent



           Agent
                                send/ack
           Agent    Collector
Flume NG                               Collector
           Agent    Collector
           Agent
Apache Flume - pipeline


                 Source           Sink
Flume OG


                                         plugin


                 Source Channel   Sink
Flume NG
Apache Flume - configuration

                    Master

                                  Master manages
                                  all configuration
                                         (optional)
           Agent
           Agent      Collector
Flume NG                                  Collector
           Agent      Collector
           Agent
Apache Flume - configuration
# source
host1.sources = avro-source1
host1.sources.avro-source1.type = avro
host1.sources.avro-source1.bind = 0.0.0.0
host1.sources.avro-source1.port = 41414
host1.sources.avro-source1.channels = ch1

# channel
host1.channels = ch_avro_log
host1.channels.ch_avro_log.type = memory

# sink
host1.sinks = log-sink1
host1.sinks.log-sink1.type = logger
host1.sinks.log-sink1.channel = ch1
Fluentd
Fluentd - network topology

           Agent
                                 send/ack
           Agent     Collector
Flume NG                              Collector
           Agent     Collector
           Agent



           fluentd
                                 send/ack
           fluentd     fluentd
Fluentd                                fluentd
           fluentd     fluentd
           fluentd
Fluentd - pipeline


                     Source Channel    Sink
Flume NG


                                               plugin


                     Input   Buffer   Output
Fluentd
Fluentd - configuration

           fluentd
           fluentd         fluentd
Fluentd                                  fluentd
           fluentd         fluentd
           fluentd


          Use chef, puppet, etc. for configuration
          (they do things better)
          No central node - keep things simple
Fluentd - configuration


<source>
  type forward
  port 24224
</source>

<match **>
  type file
  path /var/log/logs
</match>
Fluentd - configuration
                       # source
                       host1.sources = avro-source1
                       host1.sources.avro-source1.type = avro
<source>               host1.sources.avro-source1.bind = 0.0.0.0
  type forward         host1.sources.avro-source1.port = 41414
  port 24224           host1.sources.avro-source1.channels = ch1
</source>
                       # channel
<match **>             host1.channels = ch_avro_log
  type file            host1.channels.ch_avro_log.type = memory
  path /var/log/logs
</match>               # sink
                       host1.sinks = log-sink1
                       host1.sinks.log-sink1.type = logger
                       host1.sinks.log-sink1.channel = ch1
Fluentd - Users
Fluentd - plugin distribution platform



$ fluent-gem search -rd fluent-plugin


$ fluent-gem install fluent-plugin-mongo
Fluentd - plugin distribution platform



$ fluent-gem search -rd fluent-plugin


$ fluent-gem install fluent-plugin-mongo




                               94 plugins!
Concept of Fluentd


Customization is essential
 >   small core + many plugins


Fluentd core helps to implement plugins
 >   common features are already implemented
Fluentd core       Plugins


Divide & Conquer
Retrying
                   read / receive data
Parallelize
                   write / send data
Error handling
Message routing
Fluentd plugins
in_tail

          apache
                                fluentd



                   access.log



                                 ✓ read a log file
                                 ✓ custom regexp
                                 ✓ custom parser in Ruby
out_mongo

    apache
                           fluentd
                 in_tail


             access.log    buffer
out_mongo

    apache
                           fluentd
                 in_tail


             access.log    buffer



                                    ✓ retry automatically
                                    ✓ exponential retry wait
                                    ✓ persistent on a file
out_s3

      apache
                                fluentd
                     in_tail


                 access.log     buffer       Amazon S3

   ✓ slice files based on time
                                         ✓ retry automatically
         2013-01-01/01/access.log.gz     ✓ exponential retry wait
         2013-01-01/02/access.log.gz     ✓ persistent on a file
         2013-01-01/03/access.log.gz
         ...
out_hdfs                                   ✓ custom text formater



      apache
                                fluentd
                     in_tail


                access.log      buffer            HDFS

   ✓ slice files based on time
                                         ✓ retry automatically
       2013-01-01/01/access.log.gz       ✓ exponential retry wait
       2013-01-01/02/access.log.gz       ✓ persistent on a file
       2013-01-01/03/access.log.gz
       ...
out_hdfs                                 ✓ automatic fail-over
                                         ✓ load balancing

                                                   fluentd
      apache
                                fluentd             fluentd
                     in_tail
                                                   fluentd

                access.log      buffer


   ✓ slice files based on time
                                          ✓ retry automatically
       2013-01-01/01/access.log.gz        ✓ exponential retry wait
       2013-01-01/02/access.log.gz        ✓ persistent on a file
       2013-01-01/03/access.log.gz
       ...
Fluentd examples
Fluentd at Treasure Data - REST API logs
                       fluent-logger-ruby
API servers            + in_forward


  Rails app
              fluentd



  Rails app
              fluentd


         out_forward

        watch server             fluentd
Fluentd at Treasure Data - backend logs
                       fluent-logger-ruby
API servers            + in_forward        worker servers


  Rails app                                  Ruby app
              fluentd                                    fluentd



  Rails app                                  Ruby app
              fluentd                                    fluentd


         out_forward

        watch server             fluentd
Fluentd at Treasure Data - monitoring
                       fluent-logger-ruby
API servers            + in_forward           worker servers


  Rails app                      Queue          Ruby app
              fluentd                                         fluentd


                               PerfectQueue
  Rails app                                     Ruby app
              fluentd                                         fluentd


         out_forward              script
                                                 in_exec
                                 fluentd       watch server
Fluentd at Treasure Data - Hadoop logs


✓ resource consumption
  statistics for each user    Hadoop
✓ capacity monitoring        JobTracker



                                        thrift API call



                               script
                                                     in_exec
                              fluentd             watch server
Fluentd at Treasure Data - store & analyze


                                  fluentd        watch server

                      out_tdlog            out_metricsense
                                           ✓ streaming aggregation



     Treasure Data                                    Librato Metrics
for historical analysis                               for realtime analysis
Plugin development
class SomeInput < Fluent::Input
  Fluent::Plugin.register_input('myin', self)

  config_param :tag, :string

  def start
    Thread.new {
      while true
        time = Engine.new
        record = {“user”=>1, “size”=>1}
        Engine.emit(@tag, time, record)
      end
    }
  end

  def shutdown
    ...
  end
end



<source>
  type myin
  tag myapp.api.heartbeat
</source>
class SomeOutput < Fluent::BufferedOutput
  Fluent::Plugin.register_output('myout', self)

  config_param :myparam, :string

  def format(tag, time, record)
    [tag, time, record].to_json + "n"
  end

  def write(chunk)
    puts chunk.read
  end
end


<match **>
  type myout
  myparam foobar
</match>
class MyTailInput < Fluent::TailInput
  Fluent::Plugin.register_input('mytail', self)

  def configure_parser(conf)
    ...
  end

  def parse_line(line)
    array = line.split(“t”)
    time = Engine.now
    record = {“user”=>array[0], “item”=>array[1]}
    return time, record
  end
end




<source>
  type mytail
</source>
Fluentd v11


Error stream
Streaming processing
Better DSL
Multiprocess

Mais conteúdo relacionado

Mais procurados

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...
アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...
アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...Google Cloud Platform - Japan
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...Timothy Spann
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo ProductsMikio Hirabayashi
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Securing your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggSecuring your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggStreamNative
 
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1オラクルエンジニア通信
 
Introduction to OpenStack Storage
Introduction to OpenStack StorageIntroduction to OpenStack Storage
Introduction to OpenStack StorageNetApp
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLJonathan Katz
 
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...Nebucom
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기NAVER D2
 
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...HostedbyConfluent
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 

Mais procurados (20)

Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...
アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...
アプリ開発者、DB 管理者視点での Cloud Spanner 活用方法 | 第 10 回 Google Cloud INSIDE Games & App...
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
 
Introduction to Tokyo Products
Introduction to Tokyo ProductsIntroduction to Tokyo Products
Introduction to Tokyo Products
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Securing your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris KelloggSecuring your Pulsar Cluster with Vault_Chris Kellogg
Securing your Pulsar Cluster with Vault_Chris Kellogg
 
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1
しばちょう先生が語る!オラクルデータベースの進化の歴史と最新技術動向#1
 
Introduction to OpenStack Storage
Introduction to OpenStack StorageIntroduction to OpenStack Storage
Introduction to OpenStack Storage
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Vectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQLVectors are the new JSON in PostgreSQL
Vectors are the new JSON in PostgreSQL
 
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기
 
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
Developer’s guide to contributing code to Kafka with Mickael Maison and Tom B...
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 

Destaque

Big Data 대충 알아보기
Big Data 대충 알아보기Big Data 대충 알아보기
Big Data 대충 알아보기iron han
 
Pragmatic steps to implement big data analytics
Pragmatic steps to implement big data analyticsPragmatic steps to implement big data analytics
Pragmatic steps to implement big data analyticsAlton Alexander
 
Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30
Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30
Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30Donghan Kim
 
Hadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッドHadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッドTatsuya Sasaki
 
ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例知教 本間
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleHortonworks
 
Apache Flume 1.5を活⽤したAmebaにおけるログのシステム連携
Apache  Flume 1.5を活⽤したAmebaにおけるログのシステム連携Apache  Flume 1.5を活⽤したAmebaにおけるログのシステム連携
Apache Flume 1.5を活⽤したAmebaにおけるログのシステム連携cyberagent
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)NTT DATA OSS Professional Services
 
Best Practices for Configuring Your OSSIM Installation
Best Practices for Configuring Your OSSIM InstallationBest Practices for Configuring Your OSSIM Installation
Best Practices for Configuring Your OSSIM InstallationAlienVault
 
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012Treasure Data, Inc.
 
Amebaにおけるログ解析基盤Patriotの活用事例
Amebaにおけるログ解析基盤Patriotの活用事例Amebaにおけるログ解析基盤Patriotの活用事例
Amebaにおけるログ解析基盤Patriotの活用事例cyberagent
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoopMinJae Kang
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013SATOSHI TAGOMORI
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Taro L. Saito
 
Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析shuichi iida
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話Yahoo!デベロッパーネットワーク
 

Destaque (20)

Big Data 대충 알아보기
Big Data 대충 알아보기Big Data 대충 알아보기
Big Data 대충 알아보기
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Pragmatic steps to implement big data analytics
Pragmatic steps to implement big data analyticsPragmatic steps to implement big data analytics
Pragmatic steps to implement big data analytics
 
Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30
Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30
Hadoop 기반 빅 데이터 처리 플랫폼-NDAP소개-2012-5-30
 
Hadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッドHadoop導入事例 in クックパッド
Hadoop導入事例 in クックパッド
 
ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例ソーシャルゲームログ解析基盤のHadoop活用事例
ソーシャルゲームログ解析基盤のHadoop活用事例
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Apache Flume 1.5を活⽤したAmebaにおけるログのシステム連携
Apache  Flume 1.5を活⽤したAmebaにおけるログのシステム連携Apache  Flume 1.5を活⽤したAmebaにおけるログのシステム連携
Apache Flume 1.5を活⽤したAmebaにおけるログのシステム連携
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
Hadoopエコシステムの最新動向とNTTデータの取り組み (OSC 2016 Tokyo/Spring 講演資料)
 
Best Practices for Configuring Your OSSIM Installation
Best Practices for Configuring Your OSSIM InstallationBest Practices for Configuring Your OSSIM Installation
Best Practices for Configuring Your OSSIM Installation
 
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
 
Amebaにおけるログ解析基盤Patriotの活用事例
Amebaにおけるログ解析基盤Patriotの活用事例Amebaにおけるログ解析基盤Patriotの活用事例
Amebaにおけるログ解析基盤Patriotの活用事例
 
An introduction to hadoop
An introduction to hadoopAn introduction to hadoop
An introduction to hadoop
 
Hiveを高速化するLLAP
Hiveを高速化するLLAPHiveを高速化するLLAP
Hiveを高速化するLLAP
 
Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013Log analysis with Hadoop in livedoor 2013
Log analysis with Hadoop in livedoor 2013
 
金融機関でのHive/Presto事例紹介
金融機関でのHive/Presto事例紹介金融機関でのHive/Presto事例紹介
金融機関でのHive/Presto事例紹介
 
Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例Presto As A Service - Treasure DataでのPresto運用事例
Presto As A Service - Treasure DataでのPresto運用事例
 
Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析Hadoopを用いた大規模ログ解析
Hadoopを用いた大規模ログ解析
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
 

Semelhante a How to collect Big Data into Hadoop

Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreSadayuki Furuhashi
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collectorMuga Nishizawa
 
Fluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, ScalableFluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, ScalableShu Ting Tseng
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOSconN Masahiro
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4N Masahiro
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)Alexandre Rafalovitch
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 

Semelhante a How to collect Big Data into Hadoop (20)

Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
Fluentd meetup in japan
Fluentd meetup in japanFluentd meetup in japan
Fluentd meetup in japan
 
upload test 1
upload test 1upload test 1
upload test 1
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
fluentd -- the missing log collector
fluentd -- the missing log collectorfluentd -- the missing log collector
fluentd -- the missing log collector
 
Fluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, ScalableFluentd - Flexible, Stable, Scalable
Fluentd - Flexible, Stable, Scalable
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOScon
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 

Mais de Sadayuki Furuhashi

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11Sadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダSadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderSadayuki Furuhashi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 

Mais de Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 

How to collect Big Data into Hadoop

  • 1. How to collect Big Data into Hadoop Big Data processing to collect Big Data fluentd.org Sadayuki Furuhashi
  • 2. Self-introduction > Sadayuki Furuhashi > Treasure Data, Inc. Founder & Software Architect > Open source projects MessagePack - efficient serializer (original author) Fluentd - event collector (original author)
  • 6. Big Data Collect Store Process Visualize Report & Monitor
  • 7. easier & shorter time Collect Store Process Visualize Cloudera Excel Horton Works Tableau MapR R
  • 8. How to shorten here? easier & shorter time Collect Store Process Visualize Cloudera Excel Horton Works Tableau MapR R
  • 10. Poor man’s data collection 1. Copy files from servers using rsync 2. Create a RegExp to parse the files 3. Parse the files and generate a 10GB CSV file 4. Put it into HDFS
  • 11. Problems to collect “big data” > Includes broken values > needs error handling & retrying > Time-series data are changing and uncler > parse logs before storing > Takes time to read/write > tools have to be optimized and parallelized > Takes time for trial & error > Causes network traffic spikes
  • 12. Problem of poor man’s data collection > Wastes time to implement error handling > Wastes time to maintain a parser > Wastes time to debug the tool > Not reliable > Not efficient
  • 14. Divide & Conquer error error
  • 15. Divide & Conquer & Retry error retry retry error retry retry
  • 16. Streaming Don’t handle big files here Do it here
  • 17. Apache Flume and Fluentd
  • 19. Apache Flume access logs Agent app logs Agent Collector system logs Agent Collector Agent ...
  • 20. Apache Flume - network topology Master Agent ack Agent Collector Flume OG Collector Agent Collector send Agent Agent send/ack Agent Collector Flume NG Collector Agent Collector Agent
  • 21. Apache Flume - pipeline Source Sink Flume OG plugin Source Channel Sink Flume NG
  • 22. Apache Flume - configuration Master Master manages all configuration (optional) Agent Agent Collector Flume NG Collector Agent Collector Agent
  • 23. Apache Flume - configuration # source host1.sources = avro-source1 host1.sources.avro-source1.type = avro host1.sources.avro-source1.bind = 0.0.0.0 host1.sources.avro-source1.port = 41414 host1.sources.avro-source1.channels = ch1 # channel host1.channels = ch_avro_log host1.channels.ch_avro_log.type = memory # sink host1.sinks = log-sink1 host1.sinks.log-sink1.type = logger host1.sinks.log-sink1.channel = ch1
  • 25. Fluentd - network topology Agent send/ack Agent Collector Flume NG Collector Agent Collector Agent fluentd send/ack fluentd fluentd Fluentd fluentd fluentd fluentd fluentd
  • 26. Fluentd - pipeline Source Channel Sink Flume NG plugin Input Buffer Output Fluentd
  • 27. Fluentd - configuration fluentd fluentd fluentd Fluentd fluentd fluentd fluentd fluentd Use chef, puppet, etc. for configuration (they do things better) No central node - keep things simple
  • 28. Fluentd - configuration <source> type forward port 24224 </source> <match **> type file path /var/log/logs </match>
  • 29. Fluentd - configuration # source host1.sources = avro-source1 host1.sources.avro-source1.type = avro <source> host1.sources.avro-source1.bind = 0.0.0.0 type forward host1.sources.avro-source1.port = 41414 port 24224 host1.sources.avro-source1.channels = ch1 </source> # channel <match **> host1.channels = ch_avro_log type file host1.channels.ch_avro_log.type = memory path /var/log/logs </match> # sink host1.sinks = log-sink1 host1.sinks.log-sink1.type = logger host1.sinks.log-sink1.channel = ch1
  • 31. Fluentd - plugin distribution platform $ fluent-gem search -rd fluent-plugin $ fluent-gem install fluent-plugin-mongo
  • 32. Fluentd - plugin distribution platform $ fluent-gem search -rd fluent-plugin $ fluent-gem install fluent-plugin-mongo 94 plugins!
  • 33. Concept of Fluentd Customization is essential > small core + many plugins Fluentd core helps to implement plugins > common features are already implemented
  • 34. Fluentd core Plugins Divide & Conquer Retrying read / receive data Parallelize write / send data Error handling Message routing
  • 36. in_tail apache fluentd access.log ✓ read a log file ✓ custom regexp ✓ custom parser in Ruby
  • 37. out_mongo apache fluentd in_tail access.log buffer
  • 38. out_mongo apache fluentd in_tail access.log buffer ✓ retry automatically ✓ exponential retry wait ✓ persistent on a file
  • 39. out_s3 apache fluentd in_tail access.log buffer Amazon S3 ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  • 40. out_hdfs ✓ custom text formater apache fluentd in_tail access.log buffer HDFS ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  • 41. out_hdfs ✓ automatic fail-over ✓ load balancing fluentd apache fluentd fluentd in_tail fluentd access.log buffer ✓ slice files based on time ✓ retry automatically 2013-01-01/01/access.log.gz ✓ exponential retry wait 2013-01-01/02/access.log.gz ✓ persistent on a file 2013-01-01/03/access.log.gz ...
  • 43. Fluentd at Treasure Data - REST API logs fluent-logger-ruby API servers + in_forward Rails app fluentd Rails app fluentd out_forward watch server fluentd
  • 44. Fluentd at Treasure Data - backend logs fluent-logger-ruby API servers + in_forward worker servers Rails app Ruby app fluentd fluentd Rails app Ruby app fluentd fluentd out_forward watch server fluentd
  • 45. Fluentd at Treasure Data - monitoring fluent-logger-ruby API servers + in_forward worker servers Rails app Queue Ruby app fluentd fluentd PerfectQueue Rails app Ruby app fluentd fluentd out_forward script in_exec fluentd watch server
  • 46. Fluentd at Treasure Data - Hadoop logs ✓ resource consumption statistics for each user Hadoop ✓ capacity monitoring JobTracker thrift API call script in_exec fluentd watch server
  • 47. Fluentd at Treasure Data - store & analyze fluentd watch server out_tdlog out_metricsense ✓ streaming aggregation Treasure Data Librato Metrics for historical analysis for realtime analysis
  • 48.
  • 49.
  • 51. class SomeInput < Fluent::Input Fluent::Plugin.register_input('myin', self) config_param :tag, :string def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end def shutdown ... end end <source> type myin tag myapp.api.heartbeat </source>
  • 52. class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output('myout', self) config_param :myparam, :string def format(tag, time, record) [tag, time, record].to_json + "n" end def write(chunk) puts chunk.read end end <match **> type myout myparam foobar </match>
  • 53. class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input('mytail', self) def configure_parser(conf) ... end def parse_line(line) array = line.split(“t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record end end <source> type mytail </source>
  • 54.
  • 55. Fluentd v11 Error stream Streaming processing Better DSL Multiprocess