SlideShare uma empresa Scribd logo
1 de 51
Capital One
Hadoop Intro:
History
ETL/Analytics Practices in LinkedIn/Netflix/Yahoo
Next Gen ETL 2014+
Scaling Layers
Hadoop Distributions
Analytics
1/7/2014
Hadoop/HBase


Original requirements
−

GFS: Storing internet html pages on disk for
analytics later

−

BigTable: 2002/Book pages had metadata.
Requirement return book pages to user, no joins
(no memory 2002, different now)





Latency determines requirements (analytics/Netflix later)
Semireal time. Schema for book pages. Where to store
the metadata? In BigTable
My role: not going to give you slides w/pics, everything
presented has code behind it w/documentation
Bigdata >>50% failure rate





After POCs very few enter production
Why? Workhabits for distributed computing.
Have to write distributed computing
components, J2EE idioms don't work.
Fail b/c Performance/Administration in
Production
e.g. Performance not an issue to support top 100
abinitio queries in Hadoop, 130k will be issue or
perhaps 10%
Measuring performance in POCS,
wrong means they can't build
components


Wrong

Server/Thre
ad

DN1/RS

NN
DN2/RS
Server/Thre
ad
DN2/RS
Performance Measurement, leader
election, countdown latch, test
failure/handoff w/chaos monkey


Zookeeper+Jetty

DN1/RS
Server

DN1/RS
Zookeeper

Server

DN1/RS
Hive at LinkedIn (bottom left). All 3
similar
Linkedin Simple Abstractions


Teradata with Hadoop



Multiple clusters:Prod/Dev/Research(POC?)



Hive: adhoc small ETL lower left hand corner





Pig/DataFu + enhancements for ETL
production
Multiple data stages in green box, (POC
Abinitio Datastaging, REST API for staging).



Workflow POC; Oozie+Pig+Hive. Add Web UI



Data Staging POC: CDK as example
POC Coding Style






High level directory with Maven subprojects,
Simple Archetype ok
Define Data Repositories with Avro schemas,
start with a simple file repository with files
copied from Abinitio file system. No need to
spend time reverse engineering; just copy
Add pig and hive directories to cdk-examples
POC Simple extensions








Define a webserver in the cdk and create a
REST API. Jersey/.../DI if you want more
advanced coding styles
Webserver graphs performance of
Hive/Pig/ETL metrics with JVM metrics and by
sending dummy queries in.
Start Nagios/Ganglia monitoring and Puppet
deployment of CDK as learning for larger scale
Integrate CDK into Bigtop for Capital One
distribution practice
Netflix, Block Diagram
Simple Netflix Abstractions



http://www.slideshare.net/adrianco/netflix-architectu
Automated Develop and deploy s/w process on
APIs. Perforce/Ivy/Jenkins. Hadoop POC,
github, Jenkins, deploy to demo webpage. No
code sitting in an Eclipse project
Netflix Automated App Dev/Deploy


REST specification makes Web Uis easier. C1
ETL REST I/F
Netflix Instance config


Do same for Capital One, exercise to help
w/deployment; Apache Bigtop, define 1) NN
instance, 2) DN/RS instance, customize the
scripts/instance
Netflix Security


Default turn off iptables/selinux. Define Capital
One POC testing? Start w/auditing
requirements on test cluster (w/Aravind )
Netflix Metrics


Send dummy queries through to measure
latency
Netflix Scaling Layer, do simpler
first, JDBC manage connection
pool,Pig/Hive
Yahoo Block Diagram, Pig, Hive,
Spark, Storm
Yahoo Spark (cont)
Yahoo Next Gen ETL
LinkedIn/Yahoo/Netflix References






Reference: LinkedIn: Muhammed Islam
http://www.slideshare.net/mislam77/hive-at-linkedin
Yahoo:Chris Drome, for outside business
users. Very similar to slide before.
Netflix: Jeff Magnusson Hive used for adhoc
queries and lightweight ETL (on web also)
ETL - Pig


Original Pig paper:
http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf
−

ETL language based on relational algebra
(reorder/Set) vs. SQL queries. Each step M/R ETL

−

No transactional consistency or indexes (other
projects have this)

−

Nested Data model vs. Flat SQL E/R model. Why?

Faster scan performance, replace joins.
e.g.MongoDB
Requires development UDFs, LinkedIn: DataFu






Netflix: Lipstick for debugging Pig DAGs. Will need
some debugging tool. Better than Spill
ETL - Pig


M/R ETL Points
−

Data distributed on several nodes, merge sort
results at end. Careful sending data across the
network. Doesn't scale with more users. Network
limitation

−

Google custom network switch, 1k+ ports. Custom
TCP stack, modified OS

−

Careful: streams scale, do ETL with Streams.
Real time performance. Send results to a separate
server. Do not embed writes into stream POCs
Pig vs M/R
Pig Usage






Yahoo(http://www.linkedin.com/pub/chrisdrome/2/a2/346): thousands of ETL jobs daily,
Hive for small user base external to Yahoo
Netflix(http://www.linkedin.com/in/jmagnuss):
Thousands of jobs, at analyst level. Open
sourced Lipstick, Pig UI debugging tool
LinkedIn(http://www.slideshare.net/hadoopuser
group/pig-at-linkedin): thousands of jobs, open
sourced DataFu UDFs
PIG POCS(~2009)


Possible Pig POCs:
−

Top XX queries, manually code up Abinitio queries.
This is already completed 2012? Which queries?

−

Add a JDBC connection type scaling layer to
PigServer.java

−

Out of scope for 4/30/14:




POC Tez on Pig:
https://issues.apache.org/jira/browse/PIG-3446
Apache's Pig Optimizer (MR->MR->MR goes to MRRR)
by writing optimizer in YARN AM.
POC quality


Turn the POCs into Bigtop integration tests and
get open source approval. Commit changes to
verify quality and accountability
Hive 0.11







More difficult to configure, add mysql metastore
Moving to Hcatalog for metadata to be
accessible by other Hadoop Components
Access using WebHCat, in progress
Hive Stinger using TEZ, additional in memory
optimization
No time spent on this yet; starting 1/2014
w/Hortonworks. Last day 4/30/2014
Hive 0.11
−

Hive 0.11 POCs







User guide for Abinitio programmers using Hive/Pig
Test multitenancy features w/Pig/HDFS
Test jdk 1.7 features. Hadoop 2.x works with 1.7
HiveMetastore/MySQL/HCatalog/HWebCat
Test cluster performance using benchpress
Next gen: 0.12-0.13;Spark/Shark hiveql compatability
Next Gen ETL Frameworks for
2014+


Faster Reads/Scans w/o using HBase. 3
Developments(wibidata)
−
−

Spark/Shark

−


Dremel:Impala/Apache Drill
Hive/Tez

Dremel Paper review, Interactive analysis of
Web Scale datasets
−

Don't use M/R for speed, 100x faster

−

Column schema: Nested Column oriented storage,
not rows, faster for some types of queries!!!

−

Partition key (not in paper)
Next Gen ETL Frameworks for
2014+


Faster Reads/Scans w/o using HBase. 3
Developments(wibidata)
−
−

Spark/Shark

−


Dremel:Impala/Apache Drill
Hive/Tez

Dremel Paper review, Interactive analysis of
Web Scale datasets
−

Don't use M/R for speed, 100x faster

−

Column schema: Nested Column oriented storage,
not rows, faster for some types of queries!!!

−

Partition key (not in paper)
Dremel Schema/Column Perf, sim
to kiji w/o Hbase? Sqoop objects
Impala/Drill POC
Next Gen ETL




Shark/Spark; distributed memory RDD, analysis
and ETL
Hive/Tez
Next Gen ETL POCs (combine
mem)




Goal: develop skill for getting to higher Read HDFS
performance.
Stage Data Schema/Representation effects on
Performance. Dremel nested columns:
−
−



Data w/ avro schemas and partition strategies.
Partition by timestamp, partition by custom rowkey,
partition by schema definitions

Measure effect of data schema on M/R and nonM/R
implementations. Conversion or staging process for
data
Next Gen ETL


Addition of new components into Hadoop
−

CDH will come with Spark/Shark

−

CDH comes with Impala

−

HDP status unknown for now (clear EOM)
Hadoop Distributions


Create a Capital One distribution


Why? Production is 3-4x the amount of work compared to
Dev
−









Make sure ready for production before development completed

Refactoring of scripts, bin and sbin to allow admin and
users access to admin/user scripts
Customize and Add components, (scaling layer)
Puppet/Chef scripts for cluster deployment
Real Time Monitoring(not provided in CDH/HDP), hotspot
detection for long running jobs
Ready for cluster deployment allows integration of
functional requirements like security into functional
Groovy iTests.
Possible Hadoop Distro POCs


Beginner POCs:



Goal: smooth handoff from dev to production
−

Build Apache Bigtop (will need reference doc)

−

Add components you are currently using not in
distro (e.g. mongodb + hbase for schema)

−

Add integration tests,

−

Add puppet recipes

−

Learn how to apply patches, how to customize for
simple modifications, production stability
POC framework



Goal: contribute open source code
Start with the documentation and s/w
processes first
−

DocBook;

−

Jenkins server;
http://apachebigtop.pbworks.com/w/file/49310946/A
pache%20Bigtop%20%20Jenkins.docx
POC Framework/Roadmap


Track the Jiras!!!
−

Multitenancy needs a test plan.


−

Development environment using Vagrant instead of
EC2. Cheaper, easier to administer


−

https://issues.apache.org/jira/browse/BIGTOP-1171

Create a Capital One Hadoop* user guide


−

https://issues.apache.org/jira/browse/BIGTOP-1136

https://issues.apache.org/jira/browse/BIGTOP-1157

Create a functional spec for missing components


Include test cases for security, multiuser access,
minimum performance to meet SLAs
Scaling


Astyanax on Cassandra (Netflix)
−





Small companies don't have 300 users accessing
HDFS. Manage the clients.

Some examples. Scaling involves multiple
components above the cluster h/w and Hadoop
daemons. This is NOT running CDH or HDP
using Ambari or Cloudera Manager
Gives SLA and Adhoc high priority jobs
Capital One will need a custom
component




Either for Security or scaling or … even to
separate batch analytics queries from adhoc
queries
Break down into 2 bigger steps:
−

Cluster Testing tool for scaling/security

−

Develop multiuser client layer using above and
measure performance and modified use cases
Building a scaling layer


Need a tool for testing. Need to know how to
use zookeeper at a minimum.
−

Impossible to figure out via web searches

−

Leader election and countdown latch

−

Most people do their POCs incorrectly.





Worst mistake is multiple threads on a single server
Second worst mistake is using HBase
PerformanceEvaluation.java as a reference. PE.java is
not cluster aware
Test cluster throughput for cluster scaling
Analytics





Review and Demo (weblog targeting)
Concepts to agree on first: modeling and
targeting
http://www.slideshare.net/DougChang1/demogr
aphics-andweblogtargeting-10757778
Analytics, (wibidata), schema,
model, targeting, use db vs hbase
Analytics f(latency). Netflix
Analytics


Model iteration performance key. O(n^2) #
users
−





Random Forest 6-8h on macbook

Sponsorship from EMC, free 1k node cluster +
Gemfire for faster model building
Hadoop;HDFS + M/R for certain specific use
cases
−

Batch analysis, log analysis. Click log analysis from
large disk files

−

ETL, M/R ETL only. Much much slower than any
commercial system
Analytics 2014+


Visualizations
−



Deep Learning case studies:
−



Tableau/Datameer POC? Data+Queries?
Google Now >> Apple Siri. Deep Learning models
replaced Gaussian MM

Background refresher speech recognition
−

Deep learning as a replacement for GMMs in the
Acoustic model,
http://www.stanford.edu/class/cs224s/2006/

−

Can do POCs here for innovation. Requires outside
consultant assistance
Deliverables avail today


Start the Capital One distribution
−
−



Build instructions
Functional Specification Capital One Hadoop Distro
POC

Planned, need approval before starting
−

Data Staging





Functional Specification Capital One Data Staging POC
Functional Specification Data Staging API

ETL Performance POC
−

Functional Specification Top 100 queries from
Abinitio
Capital One Block Diagram
REST:
Batch
ETL
M/R

REST:
AdHoc M/R

Real Time
ETL
No M/R

Streams/Storm
Real Time Anaytics

HCatalog/Schema
Scaling Layer

HDFS
POCs




Data Ingestion: POC w/Apache Kafka; test
fixture needed. Current abilities may not be
there
Hadoop ETL:
−

Schema definition


Write/Read query performance of top 10/100 Abinitio
queries. How close is current ETL to Abinitio? Assume
this answer exists.
POCs




Hadoop Dev->Production: Building Capital One
distribution Apache Bigtop, replicate CDH
configuration with
HDFS/Pig/Hive/OOzie/Flume/Spark. Leave out
Impala, not currently in Bigtop
Scaling: POC intermediate layer.

Mais conteúdo relacionado

Mais procurados

JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...0xdaryl
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsDamien Dallimore
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesDatabricks
 
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevTriple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevWerner Keil
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python UsersProject Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python UsersDatabricks
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!DataWorks Summit
 
Splunking the JVM (Java Virtual Machine)
Splunking the JVM (Java Virtual Machine)Splunking the JVM (Java Virtual Machine)
Splunking the JVM (Java Virtual Machine)Damien Dallimore
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNDataWorks Summit
 
Migration tales from java ee 5 to 7
Migration tales from java ee 5 to 7Migration tales from java ee 5 to 7
Migration tales from java ee 5 to 7Roberto Cortez
 
Kubernetes day 2 Operations
Kubernetes day 2 OperationsKubernetes day 2 Operations
Kubernetes day 2 OperationsPaul Czarkowski
 
A DevOps guide to Kubernetes
A DevOps guide to KubernetesA DevOps guide to Kubernetes
A DevOps guide to KubernetesPaul Czarkowski
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"Daniel Bryant
 
Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...
Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...
Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...All Things Open
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Java EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The CloudJava EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The CloudBruno Borges
 
Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)
Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)
Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)Roberto Pérez Alcolea
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageDamien Dallimore
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intromarpierc
 

Mais procurados (20)

JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
JavaOne 2015 CON7547 "Beyond the Coffee Cup: Leveraging Java Runtime Technolo...
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
 
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevTriple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
 
Project Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python UsersProject Zen: Improving Apache Spark for Python Users
Project Zen: Improving Apache Spark for Python Users
 
Helium makes Zeppelin fly!
Helium makes Zeppelin fly!Helium makes Zeppelin fly!
Helium makes Zeppelin fly!
 
Splunking the JVM (Java Virtual Machine)
Splunking the JVM (Java Virtual Machine)Splunking the JVM (Java Virtual Machine)
Splunking the JVM (Java Virtual Machine)
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
 
Migration tales from java ee 5 to 7
Migration tales from java ee 5 to 7Migration tales from java ee 5 to 7
Migration tales from java ee 5 to 7
 
Splunking the JVM
Splunking the JVMSplunking the JVM
Splunking the JVM
 
Kubernetes day 2 Operations
Kubernetes day 2 OperationsKubernetes day 2 Operations
Kubernetes day 2 Operations
 
A DevOps guide to Kubernetes
A DevOps guide to KubernetesA DevOps guide to Kubernetes
A DevOps guide to Kubernetes
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 
Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...
Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...
Open-source RPA: Leveraging Python and Robot Framework ecosystems for busines...
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
 
Java EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The CloudJava EE Arquillian Testing with Docker & The Cloud
Java EE Arquillian Testing with Docker & The Cloud
 
Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)
Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)
Leveraging Gradle @ Netflix (Guadalajara JUG Feb 25, 2021)
 
Madrid Meetup
Madrid MeetupMadrid Meetup
Madrid Meetup
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 

Semelhante a Capital onehadoopintro

Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitecturesDoug Chang
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Apache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open SourceApache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open SourceTimothy Spann
 
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Mack Hardy
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Migraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sitesMigraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sitesdrupalindia
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
 
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereApache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereGanesh Raju
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingStanislav Osipov
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeIan Lumb
 

Semelhante a Capital onehadoopintro (20)

Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitectures
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Apache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open SourceApache Deep Learning 201 - Philly Open Source
Apache Deep Learning 201 - Philly Open Source
 
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
Strategies and Tips for Building Enterprise Drupal Applications - PNWDS 2013
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Migraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sitesMigraine Drupal - syncing your staging and live sites
Migraine Drupal - syncing your staging and live sites
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereApache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scaling
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
 

Mais de Doug Chang

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkDoug Chang
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notesDoug Chang
 
Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming InfoDoug Chang
 
Capital onehadoopclass
Capital onehadoopclassCapital onehadoopclass
Capital onehadoopclassDoug Chang
 
L'Oreal Tech Talk
L'Oreal Tech TalkL'Oreal Tech Talk
L'Oreal Tech TalkDoug Chang
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013Doug Chang
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013Doug Chang
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1Doug Chang
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC frameworkDoug Chang
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargetingDoug Chang
 

Mais de Doug Chang (12)

BRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning TalkBRV CTO Summit Deep Learning Talk
BRV CTO Summit Deep Learning Talk
 
Hapi
HapiHapi
Hapi
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notes
 
Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming Info
 
Capital onehadoopclass
Capital onehadoopclassCapital onehadoopclass
Capital onehadoopclass
 
Training
TrainingTraining
Training
 
L'Oreal Tech Talk
L'Oreal Tech TalkL'Oreal Tech Talk
L'Oreal Tech Talk
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargeting
 

Último

248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdfkushkruthik555
 
办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询
办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询
办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询gejoij
 
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书zdzoqco
 
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 HybridHyundai Motor Group
 
原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量
原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量
原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量208367051
 
IPCR-Individual-Performance-Commitment-and-Review.doc
IPCR-Individual-Performance-Commitment-and-Review.docIPCR-Individual-Performance-Commitment-and-Review.doc
IPCR-Individual-Performance-Commitment-and-Review.docTykebernardo
 
907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in Engineering907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in EngineeringFi sss
 
Building a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget GilmoreBuilding a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget GilmoreForth
 
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样whjjkkk
 
EPA Funding Opportunities for Equitable Electric Transportation by Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportationby Mike MoltzenEPA Funding Opportunities for Equitable Electric Transportationby Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportation by Mike MoltzenForth
 
Program Design by Prateek Suri and Christian Williss
Program Design by Prateek Suri and Christian WillissProgram Design by Prateek Suri and Christian Williss
Program Design by Prateek Suri and Christian WillissForth
 
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607dollysharma2066
 
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理克莱姆森大学毕业证成绩单|购买美国文凭证书
办理克莱姆森大学毕业证成绩单|购买美国文凭证书办理克莱姆森大学毕业证成绩单|购买美国文凭证书
办理克莱姆森大学毕业证成绩单|购买美国文凭证书zdzoqco
 
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样gfghbihg
 
Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...
Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...
Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...Forth
 
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...Forth
 
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
 
办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询
办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询
办理原版学位证(UofT毕业证)多伦多大学毕业证成绩单修改留信学历认证永久查询
 
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
 
sauth delhi call girls in Connaught Place🔝 9953056974 🔝 escort Service
sauth delhi call girls in  Connaught Place🔝 9953056974 🔝 escort Servicesauth delhi call girls in  Connaught Place🔝 9953056974 🔝 escort Service
sauth delhi call girls in Connaught Place🔝 9953056974 🔝 escort Service
 
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
 
原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量
原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量
原版1:1定制(IC大学毕业证)帝国理工学院大学毕业证国外文凭复刻成绩单#电子版制作#留信入库#多年经营绝对保证质量
 
IPCR-Individual-Performance-Commitment-and-Review.doc
IPCR-Individual-Performance-Commitment-and-Review.docIPCR-Individual-Performance-Commitment-and-Review.doc
IPCR-Individual-Performance-Commitment-and-Review.doc
 
907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in Engineering907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in Engineering
 
Building a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget GilmoreBuilding a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
Building a Future Where Everyone Can Ride and Drive Electric by Bridget Gilmore
 
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
 
EPA Funding Opportunities for Equitable Electric Transportation by Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportationby Mike MoltzenEPA Funding Opportunities for Equitable Electric Transportationby Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportation by Mike Moltzen
 
Program Design by Prateek Suri and Christian Williss
Program Design by Prateek Suri and Christian WillissProgram Design by Prateek Suri and Christian Williss
Program Design by Prateek Suri and Christian Williss
 
Hot Sexy call girls in Pira Garhi🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Pira Garhi🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Pira Garhi🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Pira Garhi🔝 9953056974 🔝 escort Service
 
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
 
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理克莱姆森大学毕业证成绩单|购买美国文凭证书
办理克莱姆森大学毕业证成绩单|购买美国文凭证书办理克莱姆森大学毕业证成绩单|购买美国文凭证书
办理克莱姆森大学毕业证成绩单|购买美国文凭证书
 
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
 
Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...
Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...
Electric Nation Upper Midwest Inter-Tribal Electric Vehicle (EV) Charging Com...
 
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
 
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 

Capital onehadoopintro

  • 1. Capital One Hadoop Intro: History ETL/Analytics Practices in LinkedIn/Netflix/Yahoo Next Gen ETL 2014+ Scaling Layers Hadoop Distributions Analytics 1/7/2014
  • 2. Hadoop/HBase  Original requirements − GFS: Storing internet html pages on disk for analytics later − BigTable: 2002/Book pages had metadata. Requirement return book pages to user, no joins (no memory 2002, different now)    Latency determines requirements (analytics/Netflix later) Semireal time. Schema for book pages. Where to store the metadata? In BigTable My role: not going to give you slides w/pics, everything presented has code behind it w/documentation
  • 3. Bigdata >>50% failure rate    After POCs very few enter production Why? Workhabits for distributed computing. Have to write distributed computing components, J2EE idioms don't work. Fail b/c Performance/Administration in Production e.g. Performance not an issue to support top 100 abinitio queries in Hadoop, 130k will be issue or perhaps 10%
  • 4. Measuring performance in POCS, wrong means they can't build components  Wrong Server/Thre ad DN1/RS NN DN2/RS Server/Thre ad DN2/RS
  • 5. Performance Measurement, leader election, countdown latch, test failure/handoff w/chaos monkey  Zookeeper+Jetty DN1/RS Server DN1/RS Zookeeper Server DN1/RS
  • 6. Hive at LinkedIn (bottom left). All 3 similar
  • 7. Linkedin Simple Abstractions  Teradata with Hadoop  Multiple clusters:Prod/Dev/Research(POC?)  Hive: adhoc small ETL lower left hand corner   Pig/DataFu + enhancements for ETL production Multiple data stages in green box, (POC Abinitio Datastaging, REST API for staging).  Workflow POC; Oozie+Pig+Hive. Add Web UI  Data Staging POC: CDK as example
  • 8. POC Coding Style    High level directory with Maven subprojects, Simple Archetype ok Define Data Repositories with Avro schemas, start with a simple file repository with files copied from Abinitio file system. No need to spend time reverse engineering; just copy Add pig and hive directories to cdk-examples
  • 9. POC Simple extensions     Define a webserver in the cdk and create a REST API. Jersey/.../DI if you want more advanced coding styles Webserver graphs performance of Hive/Pig/ETL metrics with JVM metrics and by sending dummy queries in. Start Nagios/Ganglia monitoring and Puppet deployment of CDK as learning for larger scale Integrate CDK into Bigtop for Capital One distribution practice
  • 11. Simple Netflix Abstractions   http://www.slideshare.net/adrianco/netflix-architectu Automated Develop and deploy s/w process on APIs. Perforce/Ivy/Jenkins. Hadoop POC, github, Jenkins, deploy to demo webpage. No code sitting in an Eclipse project
  • 12. Netflix Automated App Dev/Deploy  REST specification makes Web Uis easier. C1 ETL REST I/F
  • 13. Netflix Instance config  Do same for Capital One, exercise to help w/deployment; Apache Bigtop, define 1) NN instance, 2) DN/RS instance, customize the scripts/instance
  • 14. Netflix Security  Default turn off iptables/selinux. Define Capital One POC testing? Start w/auditing requirements on test cluster (w/Aravind )
  • 15. Netflix Metrics  Send dummy queries through to measure latency
  • 16. Netflix Scaling Layer, do simpler first, JDBC manage connection pool,Pig/Hive
  • 17. Yahoo Block Diagram, Pig, Hive, Spark, Storm
  • 20. LinkedIn/Yahoo/Netflix References    Reference: LinkedIn: Muhammed Islam http://www.slideshare.net/mislam77/hive-at-linkedin Yahoo:Chris Drome, for outside business users. Very similar to slide before. Netflix: Jeff Magnusson Hive used for adhoc queries and lightweight ETL (on web also)
  • 21. ETL - Pig  Original Pig paper: http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf − ETL language based on relational algebra (reorder/Set) vs. SQL queries. Each step M/R ETL − No transactional consistency or indexes (other projects have this) − Nested Data model vs. Flat SQL E/R model. Why? Faster scan performance, replace joins. e.g.MongoDB Requires development UDFs, LinkedIn: DataFu    Netflix: Lipstick for debugging Pig DAGs. Will need some debugging tool. Better than Spill
  • 22. ETL - Pig  M/R ETL Points − Data distributed on several nodes, merge sort results at end. Careful sending data across the network. Doesn't scale with more users. Network limitation − Google custom network switch, 1k+ ports. Custom TCP stack, modified OS − Careful: streams scale, do ETL with Streams. Real time performance. Send results to a separate server. Do not embed writes into stream POCs
  • 24. Pig Usage    Yahoo(http://www.linkedin.com/pub/chrisdrome/2/a2/346): thousands of ETL jobs daily, Hive for small user base external to Yahoo Netflix(http://www.linkedin.com/in/jmagnuss): Thousands of jobs, at analyst level. Open sourced Lipstick, Pig UI debugging tool LinkedIn(http://www.slideshare.net/hadoopuser group/pig-at-linkedin): thousands of jobs, open sourced DataFu UDFs
  • 25. PIG POCS(~2009)  Possible Pig POCs: − Top XX queries, manually code up Abinitio queries. This is already completed 2012? Which queries? − Add a JDBC connection type scaling layer to PigServer.java − Out of scope for 4/30/14:   POC Tez on Pig: https://issues.apache.org/jira/browse/PIG-3446 Apache's Pig Optimizer (MR->MR->MR goes to MRRR) by writing optimizer in YARN AM.
  • 26. POC quality  Turn the POCs into Bigtop integration tests and get open source approval. Commit changes to verify quality and accountability
  • 27. Hive 0.11     More difficult to configure, add mysql metastore Moving to Hcatalog for metadata to be accessible by other Hadoop Components Access using WebHCat, in progress Hive Stinger using TEZ, additional in memory optimization No time spent on this yet; starting 1/2014 w/Hortonworks. Last day 4/30/2014
  • 28. Hive 0.11 − Hive 0.11 POCs       User guide for Abinitio programmers using Hive/Pig Test multitenancy features w/Pig/HDFS Test jdk 1.7 features. Hadoop 2.x works with 1.7 HiveMetastore/MySQL/HCatalog/HWebCat Test cluster performance using benchpress Next gen: 0.12-0.13;Spark/Shark hiveql compatability
  • 29. Next Gen ETL Frameworks for 2014+  Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) − − Spark/Shark −  Dremel:Impala/Apache Drill Hive/Tez Dremel Paper review, Interactive analysis of Web Scale datasets − Don't use M/R for speed, 100x faster − Column schema: Nested Column oriented storage, not rows, faster for some types of queries!!! − Partition key (not in paper)
  • 30. Next Gen ETL Frameworks for 2014+  Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) − − Spark/Shark −  Dremel:Impala/Apache Drill Hive/Tez Dremel Paper review, Interactive analysis of Web Scale datasets − Don't use M/R for speed, 100x faster − Column schema: Nested Column oriented storage, not rows, faster for some types of queries!!! − Partition key (not in paper)
  • 31. Dremel Schema/Column Perf, sim to kiji w/o Hbase? Sqoop objects
  • 33. Next Gen ETL   Shark/Spark; distributed memory RDD, analysis and ETL Hive/Tez
  • 34. Next Gen ETL POCs (combine mem)   Goal: develop skill for getting to higher Read HDFS performance. Stage Data Schema/Representation effects on Performance. Dremel nested columns: − −  Data w/ avro schemas and partition strategies. Partition by timestamp, partition by custom rowkey, partition by schema definitions Measure effect of data schema on M/R and nonM/R implementations. Conversion or staging process for data
  • 35. Next Gen ETL  Addition of new components into Hadoop − CDH will come with Spark/Shark − CDH comes with Impala − HDP status unknown for now (clear EOM)
  • 36. Hadoop Distributions  Create a Capital One distribution  Why? Production is 3-4x the amount of work compared to Dev −      Make sure ready for production before development completed Refactoring of scripts, bin and sbin to allow admin and users access to admin/user scripts Customize and Add components, (scaling layer) Puppet/Chef scripts for cluster deployment Real Time Monitoring(not provided in CDH/HDP), hotspot detection for long running jobs Ready for cluster deployment allows integration of functional requirements like security into functional Groovy iTests.
  • 37. Possible Hadoop Distro POCs  Beginner POCs:  Goal: smooth handoff from dev to production − Build Apache Bigtop (will need reference doc) − Add components you are currently using not in distro (e.g. mongodb + hbase for schema) − Add integration tests, − Add puppet recipes − Learn how to apply patches, how to customize for simple modifications, production stability
  • 38. POC framework   Goal: contribute open source code Start with the documentation and s/w processes first − DocBook; − Jenkins server; http://apachebigtop.pbworks.com/w/file/49310946/A pache%20Bigtop%20%20Jenkins.docx
  • 39. POC Framework/Roadmap  Track the Jiras!!! − Multitenancy needs a test plan.  − Development environment using Vagrant instead of EC2. Cheaper, easier to administer  − https://issues.apache.org/jira/browse/BIGTOP-1171 Create a Capital One Hadoop* user guide  − https://issues.apache.org/jira/browse/BIGTOP-1136 https://issues.apache.org/jira/browse/BIGTOP-1157 Create a functional spec for missing components  Include test cases for security, multiuser access, minimum performance to meet SLAs
  • 40. Scaling  Astyanax on Cassandra (Netflix) −   Small companies don't have 300 users accessing HDFS. Manage the clients. Some examples. Scaling involves multiple components above the cluster h/w and Hadoop daemons. This is NOT running CDH or HDP using Ambari or Cloudera Manager Gives SLA and Adhoc high priority jobs
  • 41. Capital One will need a custom component   Either for Security or scaling or … even to separate batch analytics queries from adhoc queries Break down into 2 bigger steps: − Cluster Testing tool for scaling/security − Develop multiuser client layer using above and measure performance and modified use cases
  • 42. Building a scaling layer  Need a tool for testing. Need to know how to use zookeeper at a minimum. − Impossible to figure out via web searches − Leader election and countdown latch − Most people do their POCs incorrectly.    Worst mistake is multiple threads on a single server Second worst mistake is using HBase PerformanceEvaluation.java as a reference. PE.java is not cluster aware Test cluster throughput for cluster scaling
  • 43. Analytics    Review and Demo (weblog targeting) Concepts to agree on first: modeling and targeting http://www.slideshare.net/DougChang1/demogr aphics-andweblogtargeting-10757778
  • 44. Analytics, (wibidata), schema, model, targeting, use db vs hbase
  • 46. Analytics  Model iteration performance key. O(n^2) # users −   Random Forest 6-8h on macbook Sponsorship from EMC, free 1k node cluster + Gemfire for faster model building Hadoop;HDFS + M/R for certain specific use cases − Batch analysis, log analysis. Click log analysis from large disk files − ETL, M/R ETL only. Much much slower than any commercial system
  • 47. Analytics 2014+  Visualizations −  Deep Learning case studies: −  Tableau/Datameer POC? Data+Queries? Google Now >> Apple Siri. Deep Learning models replaced Gaussian MM Background refresher speech recognition − Deep learning as a replacement for GMMs in the Acoustic model, http://www.stanford.edu/class/cs224s/2006/ − Can do POCs here for innovation. Requires outside consultant assistance
  • 48. Deliverables avail today  Start the Capital One distribution − −  Build instructions Functional Specification Capital One Hadoop Distro POC Planned, need approval before starting − Data Staging    Functional Specification Capital One Data Staging POC Functional Specification Data Staging API ETL Performance POC − Functional Specification Top 100 queries from Abinitio
  • 49. Capital One Block Diagram REST: Batch ETL M/R REST: AdHoc M/R Real Time ETL No M/R Streams/Storm Real Time Anaytics HCatalog/Schema Scaling Layer HDFS
  • 50. POCs   Data Ingestion: POC w/Apache Kafka; test fixture needed. Current abilities may not be there Hadoop ETL: − Schema definition  Write/Read query performance of top 10/100 Abinitio queries. How close is current ETL to Abinitio? Assume this answer exists.
  • 51. POCs   Hadoop Dev->Production: Building Capital One distribution Apache Bigtop, replicate CDH configuration with HDFS/Pig/Hive/OOzie/Flume/Spark. Leave out Impala, not currently in Bigtop Scaling: POC intermediate layer.