O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 40 Anúncio

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading

Baixar para ler offline

Big Data is moving to the next level of maturity and it’s all about the applications. Dhruv Kumar, one of the minds behind Cascading, the most widely used and deployed development framework for building Big Data applications, will discuss how Cascading can enable developers to accelerate the time to market for their data applications, from development to production. In this session, Dhruv will introduce how to easily and reliably develop, test, and scale your data applications and then deploy them on Hadoop and Hortonworks Data Platform. He will show a demo using the Hortonworks Sandbox and Cascading. Recording is at
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=e5582bcbc0516d35fc2dcf0bce86146e

Big Data is moving to the next level of maturity and it’s all about the applications. Dhruv Kumar, one of the minds behind Cascading, the most widely used and deployed development framework for building Big Data applications, will discuss how Cascading can enable developers to accelerate the time to market for their data applications, from development to production. In this session, Dhruv will introduce how to easily and reliably develop, test, and scale your data applications and then deploy them on Hadoop and Hortonworks Data Platform. He will show a demo using the Hortonworks Sandbox and Cascading. Recording is at
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=e5582bcbc0516d35fc2dcf0bce86146e

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading (20)

Anúncio

Mais de Hortonworks (20)

Mais recentes (20)

Anúncio

C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Cascading

  1. 1. Big Data Meetup October Page 1 29, 2014 C-BAG Chennai
  2. 2. C-BAG Chennai Big Data Analytic Group C-BAG is an open group formed in the interest of creating a good BIG DATA Environment. C-BAG is conducting weekly and monthly online/offline free sessions, creating awareness on the BIG DATA technologies and support BIG DATA initiatives. C-BAGs aim is to be a one stop place for all BIG DATA queries, discussions and support ! Contact Us : chennaibigdataanalyticgroup@gmail.com Page 2
  3. 3. Speakers Page 3 About Dhruv Kumar Solutions Architect Concurrent Inc. Dhruv Kumar has over six years of diverse software development experience in Big Data, Web and High Performance Computing applications. Prior to joining Concurrent, he worked at Terracotta as a Software Engineer. He has a MS degree in Computer Engineering from the University of Massachusetts-Amherst. About Vinay Shukla Director of Product Management Hortonworks Vinay Shukla is a seasoned Enterprise Software professional with extensive experience in Product management, Product development and Project management. Prior to Hortonworks, Vinay has worked as security architect, product manager, developer and project manager. Vinay admits to being a caffeine addict and spends his free time on a Yoga mat and on Hikes.
  4. 4. Hortonworks enables adoption of Apache Hadoop with Hortonworks Data Platform • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • Leaders in Hadoop community • 500+ employees Page 4 Customer Momentum • 300+ customers in seven quarters, growing at 75+/quarter • Two thirds of customers come from F1000 Partner Momentum • Over 1000 Partners, Hundreds of Certified Solutions • Some key partners include: Hortonworks and Hadoop at Scale • HDP in production on largest clusters on planet • Most +1000 node clusters
  5. 5. Page 5 The Forrester Wave™ Big Data Hadoop Solutions Q1 2014 A Leader in Hadoop “Hortonworks loves and lives open source innovation” World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR
  6. 6. HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation HDP 2.2 October 2014 HDP 2.1 April 2014 Page 6 0.98.0 1.4.0 0.5.0 0.60 0.4.0 Tez Slider 4.10.0 4.7.2 Hortonworks Data Platform 2.2 Hadoop &YARN Pig Hive & HCatalog HBase Sqoop 4.0.0 Oozie 3.4.5 Zookeeper 1.5.1 Ambari Storm Flume Knox Phoenix Accumulo 2.2.0 0.12.0 0.12.0 2.4.0 0.12.1 Data Management 0.13.0 0.96.1 0.9.1 1.4.4 1.3.1 1.4.4 3.3.2 3.4.5 0.4.0 4.0.0 1.5.1 Falcon Ranger Spark Kafka 0.14.0 0.14.0 0.98.4 1.6.1 4.2 0.9.3 1.2.0 0.6.0 0.8.1 1.4.5 1.5.0 1.7.0 4.1.0 0.5.0 0.4.0 2.6.0 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process HDP 2.0 October 2013 Solr 0.5.1 Data Access Governance & Integration Operations Security
  7. 7. The Modern Data Architecture w/ HDP Page 7
  8. 8. Enterprise Goals for the Modern Data Architecture Page 8 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP Batch Interactive Real-Time YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING( Systems( Clickstream( Web(( &Social( Geoloca9on( Sensor(( &(Machine( Server(( Logs( Unstructured(
  9. 9. 1. Unlock New Applications from New Types of Data INDUSTRY USE CASE Sentiment Page 9 & Web Clickstream & Behavior Machine & Sensor Geographic Server Logs Structured & Unstructured Financial Services New Account Risk Screens ✔ ✔ Trading Risk ✔ Insurance Underwriting ✔ ✔ ✔ Telecom Call Detail Records (CDR) ✔ ✔ Infrastructure Investment ✔ ✔ Real-time Bandwidth Allocation ✔ ✔ ✔ Retail 360° View of the Customer ✔ ✔ ✔ Localized, Personalized Promotions ✔ Website Optimization ✔ Manufacturing Supply Chain and Logistics ✔ Assembly Line Quality Assurance ✔ Crowd-sourced Quality Assurance ✔ Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔ Monitor Patient Vitals in Real-Time Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔ Improve Prescription Adherence ✔ ✔ ✔ ✔ Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔ Monitor Rig Safety in Real-Time ✔ ✔ ✔ Government ETL Offload/Federal Budgetary Pressures ✔ ✔ Sentiment Analysis for Government Programs ✔
  10. 10. ..to shift from reactive to proactive interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 10 HDP and Hadoop allow organizations to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  11. 11. 2. Or to realize a dramatic cost savings… EDW Optimization Page 11 ✚ OPERATIONS 50% ANALYTICS 20% ETL PROCESS 30% OPERATIONS 50% ANALYTICS 50% Current Reality EDW at capacity: some usage from low value workloads Older data archived, unavailable for ongoing exploration Source data often discarded Hadoop Parse, Cleanse Apply Structure, Transform Augment w/ Hadoop Free up EDW resources from low value tasks Keep 100% of source data and historical data for ongoing exploration Mine data for value after loading it because of schema-on-read
  12. 12. 2. Or to realize a dramatic cost savings… EDW Optimization Page 12 ✚ OPERATIONS 50% ANALYTICS 20% ETL PROCESS 30% OPERATIONS 50% ANALYTICS 50% Current Reality EDW at capacity: some usage from low value workloads Older data archived, unavailable for ongoing exploration Source data often discarded Augment w/ Hadoop Free up EDW resources from low value tasks Keep 100% of source data and historical data for ongoing exploration Mine data for value after loading it because of schema-on-read Commodity Compute & Storage Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure Cloud Storage Engineered System MPP SAN HADOOP NAS $0 $20,000 $40,000 $60,000 $80,000 $180,000 Fully-loaded Cost Per Raw TB of Data (Min–Max Cost) Hadoop Parse, Cleanse Apply Structure, Transform Storage Costs/Compute Costs from $19/GB to $0.23/GB
  13. 13. 3. Data Lake: An architectural shift SCALE Page 13 SCOPE Unlocking the Data Lake ( RDBMS MPP EDW Data Lake Enabled by YARN • Single data repository, shared infrastructure • Multiple biz apps accessing all the data • Enable a shift from reactive to proactive interactions • Gain new insight across the entire enterprise New Analytic Apps or IT Optimization HDP 2.1 Governance & Integration Security Operations Data Access YARN Data Management
  14. 14. Case Study: 12 month Hadoop evolution at TrueCar Data Platform Capabilities Page 14 June 2013 Begin Hadoop Execution July 2013 Hortonworks Partnership 12 months execution plan May ‘14 IPO Aug 2013 Training & Dev Begins Nov 2013 Production Cluster 60 Nodes 2 PB Jan 2014 40% Dev Staff Perficient Dec 2013 Three Production Apps (3 total) Feb 2014 Three More Production Apps (6 total) 12 Month Results at TrueCAR • Six Production Hadoop Applications • Sixty nodes/2PB data • Storage Costs/Compute Costs from $19/GB to $0.23/GB “We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
  15. 15. DRIVING INNOVATION THROUGH DREDUACINGT DEAVELOPMENT TIME FOR PRODUCTION-GRADE HADOOP APPLICATIONS Dhruv Kumar Solutions Architect, Concurrent Inc
  16. 16. GET TO KNOW CONCURRENT 2 Leader in Application Infrastructure for Big Data! • Building enterprise software to simplify Big Data application development and management Products and Technology! • CASCADING Open Source - The most widely used application infrastructure for building Big Data apps with over 200,000 downloads each month. 8000 deployments worldwide. • DRIVEN Enterprise data application management for Big Data apps Proven — Simple, Reliable, Robust! • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. Founded: 2008 HQ: San Francisco, CA ! CEO: Gary Nakamura CTO, Founder: Chris Wensel ! ! www.concurrentinc.com
  17. 17. BIG DATA APPLICATION INFRASTRUCTURE 3 “It’s all about the apps”" There needs to be a comprehensive solution for building, deploying, running and managing this new class of enterprise applications. Business Strategy Connecting Business and Data Data & Technology Challenges! ! Skill sets, systems integration, standard op procedure and operational visibility
  18. 18. DATA APPLICATIONS - ENTERPRISE NEEDS Enterprise Data Application Infrastructure! ! • Need reliable, reusable tooling to quickly build and consistently deliver data products 4 ! • Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets ! • Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application ! • Need operational visibility for entire data application lifecycle
  19. 19. WORD COUNT EXAMPLE WITH CASCADING 5 ! ! String docPath = args[ 0 ];! String wcPath = args[ 1 ];! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );! ! configuration integration ! // create source and sink taps! Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );! Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );! ! processing // specify a regex to split "document" text lines into token stream! Fields token = new Fields( "token" );! Fields text = new Fields( "text" );! RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );! // only returns "token"! Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );! // determine the word counts! Pipe wcPipe = new Pipe( "wc", docPipe );! wcPipe = new GroupBy( wcPipe, token );! wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );! scheduling ! // connect the taps, pipes, etc., into a flow definition! FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );! // create the Flow! Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work! wcFlow.complete(); // <<-- Runs jobs on Cluster
  20. 20. SOME COMMON PROCESSING PATTERNS • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc 6 filter filter function function filter function data Pipeline Split Join Merge data Topology
  21. 21. CASCADING API • Java API • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters 7 Processing API Integration API Process Planner Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  22. 22. FRAMEWORK AND PROGRAMMING LANGUAGE INDEPENDENCE Cascading Domain Specific Languages (DSLs) 8 SQL Clojure Ruby New Fabrics Tez Storm Supported Fabrics and Data Stores Mainframe DB / DW In-Memory Data Stores Hadoop ! • Any JVM language can use Cascading API • Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
  23. 23. THE STANDARD FOR DATA APPLICATION DEVELOPMENT 9 www.cascading.org Build data apps that are scale-free! !!! Design principles ensure best practices at any scale Test-Driven Development! ! Efficiently test code and process local files before deploying on a cluster Staffing Bottleneck! ! Use existing Java, SQL, modeling skill sets Application Portability! ! ! Write once, then run on different computation fabrics Operational Complexity! ! Simple - Package up into one jar and hand to operations Systems Integration! ! ! Hadoop never lives alone. Easily integrate to existing systems ! Proven application development framework for building data apps Application platform that addresses:
  24. 24. CASCADING DATA APPLICATIONS 10 Enterprise IT! Extract Transform Load Log File Analysis Systems Integration Operations Analysis ! Corporate Apps! HR Analytics Employee Behavioral Analysis Customer Support | eCRM Business Reporting ! Telecom! Data processing of Open Data Geospatial Indexing Consumer Mobile Apps Location based services Marketing / Retail! Mobile, Social, Search Analytics Funnel Analysis Revenue Attribution Customer Experiments Ad Optimization Retail Recommenders ! Consumer / Entertainment! Music Recommendation Comparison Shopping Restaurant Rankings Real Estate Rental Listings Travel Search & Forecast ! ! Finance! Fraud and Anomaly Detection Fraud Experiments Customer Analytics Insurance Risk Metric ! Health / Biotech! Aggregate Metrics For Govt Person Biometrics Veterinary Diagnostics Next-Gen Genomics Argonomics Environmental Maps !
  25. 25. STRONG ORGANIC GROWTH 11 200,000+ downloads / month! 8000+ Deployments!
  26. 26. BUSINESSES DEPEND ON US • 30000 Jobs per day! • Makes complex analysis of very large data sets simple! • Machine learning, linear algebra to improve! • User experience! • Ad quality (matching users and ad effectiveness)! • All revenue applications are running on Cascading/Scalding! 12 TWITTER
  27. 27. BUSINESSES DEPEND ON US • Cascading Java API! • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts! • Easy to operationalize heavy lifting of data in one framework 13
  28. 28. BUSINESSES DEPEND ON US • Cascalog (Clojure)! • Weather pattern modeling to protect growers against loss! • ETL against 20+ datasets daily! • Machine learning to create models! • Purchased by Monsanto for $930M US 14
  29. 29. BROAD SUPPORT 15 Hadoop ecosystem supports Cascading!
  30. 30. … AND INCLUDES RICH SET OF EXTENSIONS 16 http://www.cascading.org/extensions/
  31. 31. WORD COUNT DEMO ON HDP 17
  32. 32. SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING • Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite ! • Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x ! • Cascading 3.0 opens up the query planner — write apps once, run on any fabric 18 Concurrent offers training classes for Cascading & Scalding
  33. 33. CONTACT INFORMATION Dhruv Kumar! Solutions Architect! Concurrent Inc.! dkumar@concurrentinc.com
  34. 34. DRIVING INNOVATION THROUGH DTHAANKT YAOU Dhruv Kumar
  35. 35. APPENDIX 21
  36. 36. USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO SPARK • Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps ! • Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC ! • Great for migration of data, integrating with non- Big Data assets — extends life of existing IT assets in an organization 22 CLI / Shell Enterprise Java Provider API JDBC API Lingual API Query Planner Cascading Apache Hadoop Lingual Data Stores Catalog
  37. 37. SCALDING • Scalding is a language binding to Cascading for Scala 23 • The name Scalding comes from the combining of SCALa and cascaDING ! • Scalding is great for Scala developers; can crisply write constructs for matrix math… ! • Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines
  38. 38. PATTERN ENABLES MIGRATING YOUR MODELS TO SPARK 24 • Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps. • PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms • PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows! • Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner • Pattern is great for migrating your model scoring to Hadoop from your decision systems
  39. 39. PATTERN: ALGOS IMPLEMENTED • Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest ! algorithms extended based on customer use cases – 25 Confidential
  40. 40. BUILDING AND RUNNING PMML MODELS LINGUAL Data PMML LINGUAL 26 Confidential Model Producer Model Explore data and build model using Regression, clustering, etc. Training Scoring New Data PMML model Measure and improve model Post Processing Model Consumer Data Data scores PATTERN ETL, prepare data ETL, prepare data

×