February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Amazon Elastic Map/Reduce" by Simone Brunozzi
1. Making Hadoop Enterprise ready with Amazon Elastic Map/Reduce Simone Brunozzi Technology Evangelist, Amazon Web Services, APAC twitter: @simon Blog: www.brunozzi.com
2. What is Elastic MapReduce Use Cases Service Features New Feature Announcements Elastic MapReduce Ecosystem AGENDA
3. Enables customers to easily, securely and cost-effectively process vast amounts of data. Spin-up 10s or 100s or even 1000s of instances Process 10s or 100s of Terabytes of data Hosted Hadoop framework running on the web-scale infrastructure of Amazon. What is Amazon Elastic MapReduce
8. Problems customers solve with Elastic MapReduce Data mining and BI Log processing, click stream analysis, similarities, advertizing Data warehousing applications Bio-informatics (Genome analysis) Financial simulation (Monte Carlo simulation) File processing (resize jpegs) Web indexing
11. Apache Hive Batch and Interactive Mode Support Hive Steps Integration with Elastic MapReduce Client and Management Console Load table partitions automatically to/from Amazon S3 Optimized data writes to Amazon S3 Reference resources such as streaming scripts located on Amazon S3 Specify an off-instance metadata store Support variables defined directly in Hive script Supports JDBC and ODBC connections ELASTIC MAPREDUCE – HIVE FEATURES
12. Apache Pig Batch and interactive mode Support Pig Steps Integration with Elastic MapReduce Client and Management Console Concurrent access to multiple file systems (HDFS, Amazon S3) Reference resources in Amazon S3 directly from Pig script Several User Defined Functions in Piggy Bank ELASTIC MAPREDUCE – PIG FEATURES
13.
14.
15.
16.
17. Enterprise customers need more flexibility Configuring Clusters Running Clusters Paying for clusters Enterprise customers need more tools Application development Data analytics Enterprise customers need support options Forums support is not enough Amazon Elastic MapReduce For Enterprise
18. Amazon Elastic MapReduce features Bootstrap actions Run arbitrary scripts before job flow begins Run on all nodes before data processing begins Used for Hadoop configuration (site-conf, Hadoop-conf, etc.) Cluster configuration (memory, swap, etc.) Application/packages installation (app-get install r-base) Several pre-defined bootstrap actions available
19.
20. Enterprise customers need more flexibility Configuring Clusters Running Clusters Paying for clusters Enterprise customers need more tools Application development Data analytics Enterprise customers need support options Forum support is not enough Amazon Elastic MapReduce For Enterprise
21. Amazon Elastic MapReduce - new features Preannounce: Expand running clusters Increase number of nodes in a running cluster Increase processing speed Increasing HDFS size
22. Use Case: Increase speed of running job flows Speed up job flow execution in response to changing requirements Dynamically balance cost versus performance without restarting a job PREANNOUNCE – EXPAND/SHRINK CLUSTERS Job Flow Job Flow Job Flow 3 Hours Allocate 4 instances Expand to 25 instances Expand to 9 instances Time remaining: Time remaining: 14 Hours 7 Hours Time remaining:
23. Amazon Elastic MapReduce - new features Shrink running clusters Decrease number of nodes in a running job flow Different capacity requirements from step to step Automatically regulate capacity between steps
24. Use Case: Agile Data Warehouse Cluster Customize cluster size to support varying resource needs (e.g., query support during the day versus batch processing overnight) Leverage flexibility to reduce costs and increase cluster utilization EXPAND/SHRINK CLUSTERS Data Warehouse (Batch Processing) Data Warehouse (Steady State) Data Warehouse (Steady State) Allocate 9 instances Expand to 25 instances Shrink to 9 instances
25. Enterprise customers need more flexibility Configuring Clusters Running Clusters Paying for clusters Enterprise customers need more tools Application development Data analytics Enterprise customers need support options Forums support is not enough Amazon Elastic MapReduce For Enterprise
27. What is a Spot Instance? Way to purchase & consume EC2 instances based on compute value Reduce your computing costs Bid for unused EC2 capacity Control your costs Differences from On-Demand Instances: Request – maximum price bid Spot Price – what you pay Termination
29. Amazon Elastic MapReduce – new feature Spot pricing support for Elastic MapReduce job flows Specify the price you want to pay for instances Elastic MapReduce takes care of Provisioning Node addition and removal to/from the cluster Can mix On-Demand and Spot instances in the same cluster
30. Use Case: Manage cost of running job flows Start with 4 On-Demand instances of type m2.xlarge Expand the cluster with 5 Spot Nodes Cost without Spot: 4 instances *14 hrs * $0.50 = $28 Cost with Spot: 4 instances *7 hrs * $0.50 = $13 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $21.75 Savings: ~22% Integration with EC2 Spot Job Flow Job Flow Allocate 4 instances Expand to 9 instances Time remaining: Time remaining: 14 Hours 7 Hours
31. Enterprise customers need more flexibility Configuring Clusters Running Clusters Paying for clusters Enterprise customers need more tools Application development Data analytics Enterprise customers need support options Forums support is not enough Amazon Elastic MapReduce For Enterprise
32. Elastic MapReduce Ecosystem Ecosystem is growing Integrated development environments for Hadoop Tools designed for data analytics Broad support for Amazon Elastic MapReduce
33. Big Data Intelligence software For developers and analysts to work faster and easier Purpose built for all popular Hadoop distros and versions Tightly integrated with Elastic MapReduce (since 2009) Built on Karmasphere Application Framework™ Native Hadoop client-side platform Karmasphere
40. Web Logs Social Media CRM Sales Excel Files Customer Data Datameer Analytics Solution Amazon Elastic MapReduce
41. MicroStrategy is a Global Leader in Business Intelligence Corporate Overview Founded in 1989 Largest independent public BI vendor (NASDAQ: MSTR) Positioned in the Gartner “Leader Quadrant” for BI Platforms Over 1 million business users at over 3,000 organizations The MicroStrategy 9 business intelligence platform enables mobile apps, dashboards, reporting and analytics with your business data Build once, deliver instantly and securely any time, to any device
42. What can you do with MicroStrategy and Amazon Elastic MapReduce? Deliver insights to a broader range of users. End users interact with a point-and-click interface to query data without writing HiveQL or MapReduce jobs Use cases: Mobile Apps: Floor manager accesses order details stored in Amazon Elastic MapReduce through a custom iPhone App Dashboards: End user starts with a Dynamic Dashboard populated from data mart or data warehouse. The user then drills to a detail report that executes in Amazon Elastic MapReduce. Reporting: Application developer builds a parameterized HiveQL report, then schedules it to execute. Jobs execute against Amazon Elastic MapReduce and MicroStrategy sends out exception based alerts via email to end users. Analysis: Application developer populates a multidimensional cache in MicroStrategy with results of a HiveQL query. End user uses MicroStrategy’s web interface to slice-and-dice through results without going back to Hadoop.
43. How can I learn more? Try it! Free MicroStrategy software is available at: http://www.microstrategy.com/freereportingsoftware Get More information about Microstrategy solutions for Amazon Elastic MapReduce http://aws.amazon.com/solutions/solution-providers/microstrategy
44. Enterprise customers need more flexibility Configuring Clusters Running Clusters Paying for clusters Enterprise customers need more tools Application development Data analytics Enterprise customers need more support options Forums support is not enough Amazon Elastic MapReduce For Enterprise
46. Enterprise customers need more flexibility Configuring Clusters Running Clusters Paying for clusters Enterprise customers need more tools Application development Data analytics Enterprise customers need more support options Forums support is not enough Amazon Elastic MapReduce For Enterprise
47. Making Hadoop Enterprise ready with Amazon Elastic Map/Reduce Simone Brunozzi Technology Evangelist, Amazon Web Services, APAC twitter: @simon Blog: www.brunozzi.com