SlideShare uma empresa Scribd logo
1 de 41
Successes, Challenges and
Pitfalls Migrating a SAAS
Business to Hadoop
Shaun Klopfenstein, CTO
Eric Kienle, Chief Architect
The Vision
Requirements
Page 4
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Business Requirements
• Near real-time activity processing
• 1 billion activities per customer per day
• Improve cost efficiency of operations while scaling up
• Global enterprise grade security and governance
Page 5
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Architecture Requirements
• Maximize utilization of hardware
• Multitenancy support with fairness
• Encryption, Authorization & Authentication
• Applications must scale horizontally
Technology Bake Off
Page 7
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Bake Off
• Technology Selection
• Storm/Spark Streaming
• HBase/Cassandra
• Built POC with each permutation + Kafka
• Load tested with one day of web traffic
Page 8
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
The Winner Is… Our First Challenge
• We hoped to find a clear winner… we didn’t exactly
• Truth is all the POCs worked at the scale we tested
• It’s possible if we had scaled up the test, we would
have found more differences
Page 9
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
How We Chose
• Community
• Features
• Team Skillset
• History
• The winners: HBase/Kafka/Spark streaming
Architecture & Design
Page 11
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Marketo Lambda Architecture
CRM Sync
Partner APIs
Other Marketing
Activities
Spark Streaming
Consumers
Campaign Triggers
Solr Indexing
Solr
Spark Streaming Indexer
Ingestion Processor
Scala/Tomcat
HBase
HDFS
Kafka Event Stream
Web Activity
RTP Activity
Mobile Activity
Marketo UI
Campaign Detail
Lead Detail
Other Clients
CRM Sync
Revenue Cycle Analylitcs
APIs
Email Report Loader
Web Activity Processor
• Enhanced Lambda Architecture
• Inbound activities written to Ingestion Processor
• Hbase and then Kafka
• High volume (e.g. web) activities
• First written to Kafka, then enriched
• Spark Streaming applications consume events from Kafka
• Solr Indexing
• Email Reports
• Campaign Processing
• HBase is used for simple historical queries, and is system of record
High Level Architecture
Build It
Implementation
Page 14
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise
• We had a few people with Hadoop and
Spark experience
• We decided to grow knowledge in house
• Focus on training - HortonWorks boot camp
for operations
• In house courses and tech talks for engineering/QE
Page 15
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise - Successes
• Critical to kick start the project
• Built excitement
• Created foundation for the design process
Page 16
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise – Context Challenge
Challenge
• Training packed a lot of information into a short period
• Teams that didn’t leverage the training right away lost context
Recommendation
• Create environments for hands on experience early
• Hands on experience across all teams right after training
Page 17
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Expertise – Experience Challenge
Challenge
• Hadoop technology is like playing a piano… knowing how to read
music doesn’t mean you can play
• Many ways to design, configure, manage - Only a few right ways
and the reasons can be subtle
Recommendation
• Find your experts!
• Partner and hire
Page 18
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Our First Cluster
• Initial sizing and capacity planning of first
Hadoop Clusters
• Perform load tests to get initial capacity plan
• Decided that disk I/O and storage would be the leading indicator
• Went with industry best practice on hardware and network
configuration
Page 19
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Our First Cluster- Success
• Leading indicator ended up being compute
• But cluster sizing ended up being close enough to start
• Clusters can always be expanded…
So don’t get too hung up
Page 20
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Building Our First Cluster – Zookeeper & VM
Challenge
• We started with Zookeeper virtualized
• Didn’t perform properly (we think because of disk IO)
• Caused random outages
Recommendation
• We ended up migrating zookeeper to physical boxes
• Don’t use VMs for zookeeper!
Page 21
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security
• All data at rest must be encrypted
• Applications sharing Hadoop must be isolated
from each other
• Applications must have hard quotas for both
compute and disk resources
Page 22
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security - Success
• Enabled Kerberos security for Hadoop cluster
• Kerberos allowed us to leveraged HDFS
native encryption
• Used encrypted disks for Kafka servers
• Created separate secure Yarn queues to
isolate applications
• Each application uses a separate Kerberos principal
Page 23
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security – Kerberos Challenge
Challenge
• Kerberos can’t be added to a Hadoop cluster without prolonged
downtime and patches
• Needed weeks of developer time to accommodate security changes
• Added several months to the overall rollout schedule
Recommendation
• Allow extra time for Kerberos
• Educate your team beforehand, find an expert to guide you
• Be prepared for different levels of Kerberos support across the
Hadoop ecosystem
Page 24
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Security – Kafka and Spark Challenge
Challenge
• Kafka doesn’t support data encryption (and won’t)
• HDP version we had didn’t fully support Kerberos Kafka and Spark
clients properly
Recommendation
• Move Kafka and Spark out of Ambari
• Only encrypt Kafka data if you absolutely must, as it adds complexity
Test It
Page 26
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Validation
• Changing the engines on a plane while in flight is hard
• Required all components implemented “Passive mode”
• The new code ran in the background and continuously compared results
with the legacy system
• Automated functional tests kicked off from Jenkins
• Performance testing at AWS
Page 27
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Validation - Success
• Passive mode is one of the best moves we made!
• Allowed for testing of components with real world
data and load
• Found countless performance and logic issues with
minimal operational impact
Page 28
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Validation – Passive Mode “Minimal Impact”
Challenge
• By design passive mode wrote to both Legacy and Hadoop systems
• We impacted performance during an outage of our cluster
Recommendation
• Use asynchronous writes or tight timeouts in passive mode
• Monitoring for the Hadoop cluster should be in place before
passive testing
Deploying It
Page 30
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Migration and Management
• We are here!
• Migrate over 6,000 subscriptions with no service interruption
or data loss
• Track and monitor migration and provide management tools
for the new platform
• Achieve the end goal of removing the safety net
Page 31
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Migration and Management - Successes
• Created a new management console called Sirius
• Close architectural coordination of all teams during
migration
• If problems arose, we had a quick, automated, fallback
path to the legacy system
• Daily cross-functional standup meetings to track the
rollout
Challenge
• Oozie workflows can be challenging to build and debug
• Capacity planning and resource management in the shared Hadoop
cluster is very complex
Recommendation
• Only use Oozie workflows for automating complex or long running
processes, or use a different orchestration platform
• Constantly reevaluate your capacity plan based on current deployment
Migration and Management Challenges
Running It
Page 35
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Monitoring
• Needed to monitor hundreds of new Hadoop and other
infrastructure servers
• Our custom Spark Streaming applications required all
new metrics and monitors
• Capacity planning requires trend analysis of both the
infrastructure and our applications
• Don’t overwhelm our already busy Cloud Platform Team
Page 36
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Monitoring - Successes
• Built a custom monitoring infrastructure using
OpenTSDB and Grafana
• Added business SLA metrics to our Sirius console to
provide real-time alerts
• Added comprehensive Hadoop monitors into our
pre-existing production monitoring system
Page 37
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Monitoring - Challenges
Challenges
• Adding hundreds of servers and a dozen new applications
makes for a huge monitoring task
• Nagios is a very general purpose system and isn’t designed
to monitor Hadoop out of the box
Recommendations
• Make sure that you have monitors and trend analysis in
place and tested before migration
• Be prepared to constantly refine and improve the your
monitors and alerts
Page 38
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Patching and Upgrading
• We have a zero-downtime requirement for applications
• Patching and upgrading of either the infrastructure or our own
applications is problematic
• Keeping up with the community requires frequent patching
• Eventually hundreds of Spark Streaming jobs will need to be
constantly processing data with no interruption
Page 39
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Patching and Upgrading - Successes
• Use Sirius console to manage Spark Streaming jobs
• Marketo’s Kafka consumer allows streaming jobs to pick up
where they left off after a restart
• Integrated existing Jenkins infrastructure with the Sirius
console to provide painless automated patching/upgrades
Page 40
Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016
Infrastructure Patching and Upgrading - Challenges
Challenges
• Patches/upgrades managed with Ambari – not perfect!
• We almost never get through an upgrade without one or more Hadoop
components having downtime (so far)
Recommendations
• Test all infrastructure patches and upgrades in a loaded non-production
environment
• Check out the start and stop scripts from the component specific open
source communities, rather than rely on Ambari
We’re Hiring!
Http://Marketo.Jobs
Q & A

Mais conteúdo relacionado

Mais procurados

Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Cécile Poyet
 
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...DataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...DataWorks Summit
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieDataWorks Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesDataWorks Summit
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...DataWorks Summit
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...Hortonworks
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerDataWorks Summit
 

Mais procurados (20)

Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and Oozie
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
 

Destaque

2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemPerficient, Inc.
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0Minwoo Kim
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Launching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industryLaunching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industryDataWorks Summit/Hadoop Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkJen Aman
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
A better business case for big data with Hadoop
A better business case for big data with HadoopA better business case for big data with Hadoop
A better business case for big data with HadoopAptitude Software
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습동현 강
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제NAVER D2
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 

Destaque (20)

2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Migrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management SystemMigrating Clinical Data in Various Formats to a Clinical Data Management System
Migrating Clinical Data in Various Formats to a Clinical Data Management System
 
Cloudera Impala 1.0
Cloudera Impala 1.0Cloudera Impala 1.0
Cloudera Impala 1.0
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Launching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industryLaunching your advanced analytics program for success in a mature industry
Launching your advanced analytics program for success in a mature industry
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
A better business case for big data with Hadoop
A better business case for big data with HadoopA better business case for big data with Hadoop
A better business case for big data with Hadoop
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Hadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash CourseHadoop Summit Tokyo Apache NiFi Crash Course
Hadoop Summit Tokyo Apache NiFi Crash Course
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
[D2 COMMUNITY] Spark User Group - 스파크를 통한 딥러닝 이론과 실제
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 

Semelhante a SAAS Migration to Hadoop Challenges

Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Rogue Wave Software
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesAll Things Open
 
C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyDr. Wilfred Lin (Ph.D.)
 
Does Big Data Spell Big Costs- Impetus Webinar
Does Big Data Spell Big Costs- Impetus WebinarDoes Big Data Spell Big Costs- Impetus Webinar
Does Big Data Spell Big Costs- Impetus WebinarImpetus Technologies
 
OOW-5185-Hybrid Cloud
OOW-5185-Hybrid CloudOOW-5185-Hybrid Cloud
OOW-5185-Hybrid CloudBen Duan
 
C4 optimizing your_application_infrastructure
C4 optimizing your_application_infrastructureC4 optimizing your_application_infrastructure
C4 optimizing your_application_infrastructureDr. Wilfred Lin (Ph.D.)
 
Top 5 benefits of docker
Top 5 benefits of dockerTop 5 benefits of docker
Top 5 benefits of dockerJohn Zaccone
 
ThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsBrad Williams
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backIcinga
 
Superfast Business - Moving to the Cloud
Superfast Business - Moving to the CloudSuperfast Business - Moving to the Cloud
Superfast Business - Moving to the CloudSuperfast Business
 
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...Chris Kernaghan
 
DevOps, what should you decide, when, why & how - Vinita Rathi
DevOps, what should you decide, when, why & how - Vinita RathiDevOps, what should you decide, when, why & how - Vinita Rathi
DevOps, what should you decide, when, why & how - Vinita RathiJAXLondon_Conference
 
Twelve-Factor application pattern with Spring Framework
Twelve-Factor application pattern with Spring FrameworkTwelve-Factor application pattern with Spring Framework
Twelve-Factor application pattern with Spring Frameworkdinkar thakur
 
Mark Interrante OpenStack Design Summit
Mark Interrante OpenStack Design SummitMark Interrante OpenStack Design Summit
Mark Interrante OpenStack Design SummitOpen Stack
 
Enterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using JenkinsEnterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using JenkinsCollabNet
 
SAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous IntegrationSAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous IntegrationPeter Muessig
 
Open source applied: Real-world uses
Open source applied: Real-world usesOpen source applied: Real-world uses
Open source applied: Real-world usesRogue Wave Software
 
Agile DevOps Transformation At HUD (AgileDC 2017)
Agile DevOps Transformation At HUD (AgileDC 2017)Agile DevOps Transformation At HUD (AgileDC 2017)
Agile DevOps Transformation At HUD (AgileDC 2017)Marco Corona
 
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...Natalia Kataoka
 

Semelhante a SAAS Migration to Hadoop Challenges (20)

Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)Open source applied - Real world use cases (Presented at Open Source 101)
Open source applied - Real world use cases (Presented at Open Source 101)
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
 
C1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategyC1 keynote creating_your_enterprise_cloud_strategy
C1 keynote creating_your_enterprise_cloud_strategy
 
Does Big Data Spell Big Costs- Impetus Webinar
Does Big Data Spell Big Costs- Impetus WebinarDoes Big Data Spell Big Costs- Impetus Webinar
Does Big Data Spell Big Costs- Impetus Webinar
 
OOW-5185-Hybrid Cloud
OOW-5185-Hybrid CloudOOW-5185-Hybrid Cloud
OOW-5185-Hybrid Cloud
 
C4 optimizing your_application_infrastructure
C4 optimizing your_application_infrastructureC4 optimizing your_application_infrastructure
C4 optimizing your_application_infrastructure
 
Top 5 benefits of docker
Top 5 benefits of dockerTop 5 benefits of docker
Top 5 benefits of docker
 
ThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.jsThatConference 2016 - Highly Available Node.js
ThatConference 2016 - Highly Available Node.js
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
 
Intel Cloud Foundry and OpenStack
Intel Cloud Foundry and OpenStackIntel Cloud Foundry and OpenStack
Intel Cloud Foundry and OpenStack
 
Superfast Business - Moving to the Cloud
Superfast Business - Moving to the CloudSuperfast Business - Moving to the Cloud
Superfast Business - Moving to the Cloud
 
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...TEC118 –How Do You Manage the Configuration of Your Environments from Metal ...
TEC118 – How Do You Manage the Configuration of Your Environments from Metal ...
 
DevOps, what should you decide, when, why & how - Vinita Rathi
DevOps, what should you decide, when, why & how - Vinita RathiDevOps, what should you decide, when, why & how - Vinita Rathi
DevOps, what should you decide, when, why & how - Vinita Rathi
 
Twelve-Factor application pattern with Spring Framework
Twelve-Factor application pattern with Spring FrameworkTwelve-Factor application pattern with Spring Framework
Twelve-Factor application pattern with Spring Framework
 
Mark Interrante OpenStack Design Summit
Mark Interrante OpenStack Design SummitMark Interrante OpenStack Design Summit
Mark Interrante OpenStack Design Summit
 
Enterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using JenkinsEnterprise CI as-a-Service using Jenkins
Enterprise CI as-a-Service using Jenkins
 
SAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous IntegrationSAPUI5/OpenUI5 - Continuous Integration
SAPUI5/OpenUI5 - Continuous Integration
 
Open source applied: Real-world uses
Open source applied: Real-world usesOpen source applied: Real-world uses
Open source applied: Real-world uses
 
Agile DevOps Transformation At HUD (AgileDC 2017)
Agile DevOps Transformation At HUD (AgileDC 2017)Agile DevOps Transformation At HUD (AgileDC 2017)
Agile DevOps Transformation At HUD (AgileDC 2017)
 
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...
Migrating DataPower to IBM's API Connect Using Custom Policies//DataPower Wee...
 

Mais de DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mais de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

SAAS Migration to Hadoop Challenges

  • 1. Successes, Challenges and Pitfalls Migrating a SAAS Business to Hadoop Shaun Klopfenstein, CTO Eric Kienle, Chief Architect
  • 4. Page 4 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Business Requirements • Near real-time activity processing • 1 billion activities per customer per day • Improve cost efficiency of operations while scaling up • Global enterprise grade security and governance
  • 5. Page 5 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Architecture Requirements • Maximize utilization of hardware • Multitenancy support with fairness • Encryption, Authorization & Authentication • Applications must scale horizontally
  • 7. Page 7 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Bake Off • Technology Selection • Storm/Spark Streaming • HBase/Cassandra • Built POC with each permutation + Kafka • Load tested with one day of web traffic
  • 8. Page 8 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 The Winner Is… Our First Challenge • We hoped to find a clear winner… we didn’t exactly • Truth is all the POCs worked at the scale we tested • It’s possible if we had scaled up the test, we would have found more differences
  • 9. Page 9 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 How We Chose • Community • Features • Team Skillset • History • The winners: HBase/Kafka/Spark streaming
  • 11. Page 11 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Marketo Lambda Architecture CRM Sync Partner APIs Other Marketing Activities Spark Streaming Consumers Campaign Triggers Solr Indexing Solr Spark Streaming Indexer Ingestion Processor Scala/Tomcat HBase HDFS Kafka Event Stream Web Activity RTP Activity Mobile Activity Marketo UI Campaign Detail Lead Detail Other Clients CRM Sync Revenue Cycle Analylitcs APIs Email Report Loader Web Activity Processor
  • 12. • Enhanced Lambda Architecture • Inbound activities written to Ingestion Processor • Hbase and then Kafka • High volume (e.g. web) activities • First written to Kafka, then enriched • Spark Streaming applications consume events from Kafka • Solr Indexing • Email Reports • Campaign Processing • HBase is used for simple historical queries, and is system of record High Level Architecture
  • 14. Page 14 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise • We had a few people with Hadoop and Spark experience • We decided to grow knowledge in house • Focus on training - HortonWorks boot camp for operations • In house courses and tech talks for engineering/QE
  • 15. Page 15 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise - Successes • Critical to kick start the project • Built excitement • Created foundation for the design process
  • 16. Page 16 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise – Context Challenge Challenge • Training packed a lot of information into a short period • Teams that didn’t leverage the training right away lost context Recommendation • Create environments for hands on experience early • Hands on experience across all teams right after training
  • 17. Page 17 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Expertise – Experience Challenge Challenge • Hadoop technology is like playing a piano… knowing how to read music doesn’t mean you can play • Many ways to design, configure, manage - Only a few right ways and the reasons can be subtle Recommendation • Find your experts! • Partner and hire
  • 18. Page 18 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Our First Cluster • Initial sizing and capacity planning of first Hadoop Clusters • Perform load tests to get initial capacity plan • Decided that disk I/O and storage would be the leading indicator • Went with industry best practice on hardware and network configuration
  • 19. Page 19 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Our First Cluster- Success • Leading indicator ended up being compute • But cluster sizing ended up being close enough to start • Clusters can always be expanded… So don’t get too hung up
  • 20. Page 20 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Building Our First Cluster – Zookeeper & VM Challenge • We started with Zookeeper virtualized • Didn’t perform properly (we think because of disk IO) • Caused random outages Recommendation • We ended up migrating zookeeper to physical boxes • Don’t use VMs for zookeeper!
  • 21. Page 21 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security • All data at rest must be encrypted • Applications sharing Hadoop must be isolated from each other • Applications must have hard quotas for both compute and disk resources
  • 22. Page 22 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security - Success • Enabled Kerberos security for Hadoop cluster • Kerberos allowed us to leveraged HDFS native encryption • Used encrypted disks for Kafka servers • Created separate secure Yarn queues to isolate applications • Each application uses a separate Kerberos principal
  • 23. Page 23 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security – Kerberos Challenge Challenge • Kerberos can’t be added to a Hadoop cluster without prolonged downtime and patches • Needed weeks of developer time to accommodate security changes • Added several months to the overall rollout schedule Recommendation • Allow extra time for Kerberos • Educate your team beforehand, find an expert to guide you • Be prepared for different levels of Kerberos support across the Hadoop ecosystem
  • 24. Page 24 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Security – Kafka and Spark Challenge Challenge • Kafka doesn’t support data encryption (and won’t) • HDP version we had didn’t fully support Kerberos Kafka and Spark clients properly Recommendation • Move Kafka and Spark out of Ambari • Only encrypt Kafka data if you absolutely must, as it adds complexity
  • 26. Page 26 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Validation • Changing the engines on a plane while in flight is hard • Required all components implemented “Passive mode” • The new code ran in the background and continuously compared results with the legacy system • Automated functional tests kicked off from Jenkins • Performance testing at AWS
  • 27. Page 27 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Validation - Success • Passive mode is one of the best moves we made! • Allowed for testing of components with real world data and load • Found countless performance and logic issues with minimal operational impact
  • 28. Page 28 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Validation – Passive Mode “Minimal Impact” Challenge • By design passive mode wrote to both Legacy and Hadoop systems • We impacted performance during an outage of our cluster Recommendation • Use asynchronous writes or tight timeouts in passive mode • Monitoring for the Hadoop cluster should be in place before passive testing
  • 30. Page 30 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Migration and Management • We are here! • Migrate over 6,000 subscriptions with no service interruption or data loss • Track and monitor migration and provide management tools for the new platform • Achieve the end goal of removing the safety net
  • 31. Page 31 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Migration and Management - Successes • Created a new management console called Sirius • Close architectural coordination of all teams during migration • If problems arose, we had a quick, automated, fallback path to the legacy system • Daily cross-functional standup meetings to track the rollout
  • 32.
  • 33. Challenge • Oozie workflows can be challenging to build and debug • Capacity planning and resource management in the shared Hadoop cluster is very complex Recommendation • Only use Oozie workflows for automating complex or long running processes, or use a different orchestration platform • Constantly reevaluate your capacity plan based on current deployment Migration and Management Challenges
  • 35. Page 35 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Monitoring • Needed to monitor hundreds of new Hadoop and other infrastructure servers • Our custom Spark Streaming applications required all new metrics and monitors • Capacity planning requires trend analysis of both the infrastructure and our applications • Don’t overwhelm our already busy Cloud Platform Team
  • 36. Page 36 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Monitoring - Successes • Built a custom monitoring infrastructure using OpenTSDB and Grafana • Added business SLA metrics to our Sirius console to provide real-time alerts • Added comprehensive Hadoop monitors into our pre-existing production monitoring system
  • 37. Page 37 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Monitoring - Challenges Challenges • Adding hundreds of servers and a dozen new applications makes for a huge monitoring task • Nagios is a very general purpose system and isn’t designed to monitor Hadoop out of the box Recommendations • Make sure that you have monitors and trend analysis in place and tested before migration • Be prepared to constantly refine and improve the your monitors and alerts
  • 38. Page 38 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Patching and Upgrading • We have a zero-downtime requirement for applications • Patching and upgrading of either the infrastructure or our own applications is problematic • Keeping up with the community requires frequent patching • Eventually hundreds of Spark Streaming jobs will need to be constantly processing data with no interruption
  • 39. Page 39 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Patching and Upgrading - Successes • Use Sirius console to manage Spark Streaming jobs • Marketo’s Kafka consumer allows streaming jobs to pick up where they left off after a restart • Integrated existing Jenkins infrastructure with the Sirius console to provide painless automated patching/upgrades
  • 40. Page 40 Marketo Proprietary and Confidential | © Marketo, Inc. 7/11/2016 Infrastructure Patching and Upgrading - Challenges Challenges • Patches/upgrades managed with Ambari – not perfect! • We almost never get through an upgrade without one or more Hadoop components having downtime (so far) Recommendations • Test all infrastructure patches and upgrades in a loaded non-production environment • Check out the start and stop scripts from the component specific open source communities, rather than rely on Ambari

Notas do Editor

  1. 18 months ago our team kicked off an ambitious project which we have since named orion.   A group of us came to Hadoop summit to learn as much as we could.  That experience is the inspiration for this talk We wanted to share is about what we have learned over the last 18 months.  What worked well, and what we would do differently
  2. Although the talk isn’t about the project…  we have a few slides up front to set the context around what we are working on If you have been near technology at all in the last couple of years you know that the world has become very connected.   The number of connected devices blows my mind.  It’s not just phones anymore…   Amazon dash buttons, coffee makers, propane tanks, garage doors.  These devices are sending 10’s of billions of activities and user interactions every day... Orion is our platfor Our marketing platform ingests the user interactions process them into relevant marketing touchpoints Its enables marketers to create marketing campaigns around these activities to build relationships with their customers Become the fabric for marketers Its been a great experience building this
  3. Here are a few of the requirements Near real time processing At least a 1 billion activities per customer per day. customer demands from increasing devices caused us to evaluate next get queueing and streaming... reduction in infrastructure COGS primarily from expensive enterprise class filers... reduction in people COGS by gained efficiency from reducing tech stack from using too many similar technologies ... Multitenant… of course Secure Customer isolation and improved resource management
  4. Arch requirement driven from biz requirement Improve utilization over the existing system Lots of customers in same infra, without starving Encryption from day 1 for safe data storage Aim for horz scalability Radically reduce processing latency Eliminate backlogs Brownout protection
  5. Bakeoff to decide which platform to use Build POCs to pick the best tech stack Researched various technologies hadoop/non
  6. Decided to take day worth of web traffic and build POC storm/spark as our event processing platforms Hbase/Cassandra for storage And Kafka as the event queue
  7. All combos worked, no clear winner The amount of load generated was not enough to
  8. Community - Spark had much more active community than Storm Features - Spark solved batch processing, something Storm couldn't do Team Experience - HBase to leverage existing Hadoop expertise History – Our team had poor experiences scaling up our existing Cassandra cluster
  9. A few words about the architecture Main goal is to inject, process and store marketing events
  10. High level diagram of our event processor Enhanced Lambda Architecture Inbound activities written to Ingestion Processor Hbase and then Kafka High volume (e.g. web) activities First written to Kafka, then enriched Spark Streaming applications consume events from Kafka Solr Indexing Email Reports Campaign Processing HBase is used for simple historical queries, and is system of record
  11. Reiterates my points on the last slide. I included in case you wanted to look at the slides later
  12. Next thing we are going to talk about are some key points during the implementation phase of the project Lots of learnings around training Getting first cluster running And security
  13. One of the 1st things was to build expertise Grow knowledge in house Tech talks lead by architecture team on new infrastructure online courses (Coursera) Scala and Hadoop onsite training for Scala (which is the peferred language for sparck steam) Hortonworks Bootcamp to train operators
  14. Training helped us kickstart project – by getting people in the right mindset Helped people feel included in the project/process got people thinking about new technologies Created a nice foundation for design process
  15. Early training was great Groups who didn’t leverage knowledge immediately lost context Would set up hadoop environments early to let people get hands on experience right away Hands on experience should have spanned all teams For example, developers were developing in Spark standalone mode, made a rough transition into Yarn cluster mode
  16. Hadoop ecosystem is quite complex Design possibilities are large, only a few right ways Difference between right and wrong can be very subtle Best way to navigate is find experts – hire if possible, or get expertise from partner like Hortonworks You need experts!
  17. Took a scientiic approach – took POC and did some load tests in AWS Leading indicator was disk 1/O Asked HW and HP for recommendations for hardware Our next task was how to figure out how to build our first cluster, which is quite daunting Built scale model in AWS Talked to HP and Hortonworks to get best practices around server builds
  18. Leading indicator not disk -- its compute We can add either disk only or compute only nodes to scale Do the initial exercise –but don’t get too hung up on the cluster composition You will end up resizing and tuning as you scale up anyway We may add compute only nodes, don’t get too stuck on initial sizing You can always scale up later Don’t overscale from day one
  19. ZK not in the path of direct user queries ZK in VMs did not work well We think it was disk I/O Moved to physical boxes and life was much better Zookeeper, for those of you who are new to Hadoop, is the cluster coordination service
  20. Talk about why you talk about capacity with securty? From beginning infra needs to meet enterprise sec requirements All applications are isolated Restrict applications resource usage (disk IO, etc)
  21. Hadoop has support for kerboros (some parts better then others) HDFS Native disk encryption Encrypted disks for Kafka because of lack of native support Isolated yarn queues
  22. Kerboros really really hard Allow extra time for kerboros Training first Find someone who has done it before (much easier then on your own) Kerboros has different support, not so great in kafka We still some bugs we are trying to work out
  23. Kafka doesn’t support data encryption (and won’t because of performance) Disk encryption ended up not being a critical performance blocker Ended up rolling back kerborization for spark and Move Kafka and Spark out of Ambari and manage if you don’t need the features More control over versions Take patches faster Only loosely integrated for now
  24. Next phase was when we were ready to validate our newly built event ingestion system
  25. Wanted to validate that the new system performed as a functional superset of the old one Doing this on a running system is extremely difficult We decided early on to require all components to implement run in a silent mode – Allows us to test on real data for correctness with real data – in the wild We had automeated CI tests in Jenkins perf testing at AWS
  26. Passive Mode one of the best moves we made – found countless bugs and config issues Real world load testing Super valuable – worth the cost of implemenation
  27. By design writes to both the legacy and new system – Caused performance issue due to slow writes Cluster didn’t really go all the way down – because we overloaded ZK We recommend do Passive mode Use short timeouts or write async Make sure you have monitors in place even for passive mode
  28. After finished proving the service in passive mode for beta customers Massive undertaking
  29. Ready to migrate 6000 subs without any service interruption and no downtown Maybe say customers instead of subscriptions Non-trivial! Marketo has a 24/7/365 commitment Migrate customers a few subs at a time Create management and migrations tools Delete data out of relational database
  30. In order to manage the migration created sirius Human factor – about 10 teams and 30 sub components Whole team involved closely with the migration Automers fallback to legacy system if problem arose Daily standup to track rollout
  31. This is a pic of our management console All test data Example
  32. One big challenge - built on top of oozie Oozie is powerful but very complex Capacity planning was more complex then we thought Ended up ramping up customers -> capacity plan -> ramp up Only use oozie if you have to Important to capacity plan in the wild -> one team ended up needing 10% their original estimate
  33. We have had several learnings already running this new infrastructure Its challenging running dozens of applications to keep track of across 100s of servers
  34. First, needed to add monitors for all the new servers ~350 Created a bunch of spark streaming applications all needing metrics to be reported and monitors Metrics used for capacity planning and ensuring they are meeting the biz metrics for the project Didn’t want to overwhelm CTP
  35. Built a new monitoring and metrics system using OPENTSDB and grafana Allow us to do trend analysis Sirius console monitors the biz level metrics In addition added comprehensive set of monitors to our existing system (Nagios) Hadoop requires a lot of monitoring Built a custom monitoring infrastructure using OpenTSDB and Grafana Allows us to do trend analysis on Hadoop and other infrastructure Instrumented all of our new applications to report metrics Added comprehensive Hadoop monitors into our pre-existing production monitoring system (Nagios) to alert our operators of infrastructure issues
  36. Big challenge to create all the monitors to make sure we knew the health of the systems Constantly tuning monitors to make sure we aren’t over or under alerting Creating “Goldilocks” alerts for the operators, not too noisy, not too quiet
  37. Big challenge with spark streaming and yarn is that there isn’t any built in facility for patching and upgrading with zero downtime Really true across all hadoop components Eventually will have 100s of spark streaming jobs running, and need to do it without interruption
  38. Decided early on that we would build our on tooling for managing patches and upgrades Allows us to deploy a new set of spark streaming application without intteruptions Kafka consumers are coded to allow jobs to pick up where they left off Integrated with CI system Sirius uses the Oozie workflow engine to manage orchestration during patches/upgrades with minimal downtime
  39. One big challenge is that Ambari doesn’t always stop and start infrastructure in a way that doesn’t cause service interruption Have been close, but not successful Test under load! It makes a huge difference. You will hit timeout, etc that upset abari Check out the communities graceful restart scripts. They seem to be further along Hortonworks has been very good about learning from our issues and improving the upgrade process