Hadoop Conf 2014 - Hadoop BigQuery Connector

Hadoop
BigQuery Connector
Simon Su & Sunny Hu @ MiCloud

I am Simon Su
var simon = {};
simon.aboutme = 'http://about.me/peihsinsu';
simon.nodejs = ‘http://opennodes.arecord.us';
simon.googleshare = 'http://gappsnews.blogspot.tw'
simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';
simon.blog = ‘http://peihsinsu.blogspot.com';
simon.slideshare = ‘http://slideshare.net/peihsinsu/';
simon.email = ‘simonsu.mail@gmail.com’;
simon.say(‘Good luck to everybody!');

I am Sunny Hu
var sunny = {};
sunny.aboutme = 'https://plus.google.com/u/0/+sunnyHU/posts';
sunny.email = sunnyhu@mitac.com.tw’;
sunny.language =[‘Java’,’.NET’,’NodeJS’,’SQL’ ]
sunny.skill = [ ‘Project management’,’System Analysis’,
’System design’,’Car ho lan’]
sunny.say(‘寫code太苦悶，心情要sunny');

● We are 蘇胡二人組 ...

We are MiCloud
● 2011/11 MiCloud Launch
● 2013/2 Google Apps Partner
● 2013/9 Google Cloud Partner
● 2014/4 Google Cloud Launch

緣起
● Dremel (BigQuery) 能提供大量及穩定服務
● 2013, 平均每日服務量: 5,922,000,000 人次
● 2012, 平均每日服務量: 5,134,000,000 人次
● 2011, 平均每日服務量: 4,717,000,000 人次
● 2010, 平均每日服務量: 3,627,000,000 人次
● 2009, 平均每日服務量: 2,610,000,000 人次
● 2008, 平均每日服務量: 1,745,000,000 人次

What is the components of Hadoop...
Strategy
MapReduce
HDFS
Your idea for filtering information from the
given datasets
Mass computing power to parallel load and
process the requirements
Persistence storage for parallel access, better
with good performance...

You have better choice in Cloud...
Strategy
MapReduce
HDFS
Nothing can replace a good idea…, but fast...
Cloud machines with unlimited resources,
better with lower and scalable pricing...
Object storage services, like: Google Cloud
Storage, AWS S3...

● The fast way run hadoop - docker

Before Demo… Prepare
1. Install google_cloud_sdk
2. Install bdutil

google cloud sdk
curl https://sdk.cloud.google.com | bash

● Setup default project
● Test configuration….

Using bdutil...
https://developers.google.com/hadoop/setting-up-a-hadoop-cluster

bdutil scopes
● Design for fast create hadoop cluster
● Quick run a hadoop task
● Quick integrate google’s resources
● Quick clear finished resources

● bdutil deploy -e bigquery_env.sh

● The Administration console

TeraSort
https://www.mapr.com/fr/company/press/mapr-and-google-compute-engine-set-new-world-record-hadoop-terasort

You can win the game, too...
…. (skip)

BigQuery Connector
https://developers.google.com/hadoop/running-with-bigquery-connector

hadoop-w-0 hadoop-m hadoop-w-1

Run a BigQuery Connector job...

Workflow...
1. Dump sample data from [publicdata:samples.shakespeare]
2. MapReduce to count the word display
3. Update result to BigQuery specific table

Look into source code...
● BigQueryInputFormat class
● Input parameters
● Mapper
● BigQueryOutputFormat class
● Output parameters
● Reducer

BigQueryInputFormat
● Using a user-specified query to select the appropriate
BigQuery objects.
● Splitting the results of the query evenly among the Hadoop
nodes.
● Parsing the splits into java objects to pass to the mapper

Input parameters
● Project Id : GCP project id , eg. hadoop-conf-2014
● Input Table Id :[optional projectId]:[datasetId].[table id]

BigqueryOutputFormat Class
● Provides Hadoop with the ability to write JsonObject
values directly into a BigQuery table
● An extension of the Hadoop OutputFormat class

Output parameters
● Project Id : GCP project id ,eg. hadoop-conf-2014
● Output Table Id :[optional projectId]:[datasetId].[table id]
● Output Table Schema :[{'name': 'Name','type': 'STRING'},
{'name': 'Number','type': 'INTEGER'}]

bdutil house keeping...
https://developers.google.com/hadoop/setting-up-a-hadoop-cluster

● GDamee ovleer - Dteelet e tthhe haedo ohp claustedr oop cluster

You cost in this lab...
VM (n1-standard-1) * machines *
hours
$0.070 USD/Hour 24 1
* *

Today’s Demo
Using Docker...

● Using google optimized docker container
localhost:~$ gcloud compute instances create simon-docker
> --image https://www.googleapis.com/compute/v1/projects/google-containers/global/images/container-vm-v20140522
> --zone asia-east1-a
> --machine-type f1-micro
localhost:~$ gcloud compute ssh simon-docker
simonsu@simon-docker:~$ sudo docker search bdutil
simonsu@simon-docker:~$ docker run -it peihsinsu/bdutil bash

Other connectors
BigQuery connector for Hadoop
$ ./bdutil deploy -e bigquery_env.sh
Datastore connector for Hadoop
$ ./bdutil deploy -e datastore_env.sh
To use both BQ & Datastore
$ ./bdutil deploy -e datastore_env.sh,bigquery_env.sh

http://jsdc-tw.kktix.cc/events/jsdc2014

Hadoop Conf 2014 - Hadoop BigQuery Connector

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (13)

Semelhante a Hadoop Conf 2014 - Hadoop BigQuery Connector

Semelhante a Hadoop Conf 2014 - Hadoop BigQuery Connector (20)

Mais de Simon Su

Mais de Simon Su (20)

Último

Último (20)

Hadoop Conf 2014 - Hadoop BigQuery Connector