O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

How to Wrangle Data for Machine Learning on AWS

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 37 Anúncio

How to Wrangle Data for Machine Learning on AWS

Baixar para ler offline

Join our webinar to hear how Consensus, a Target-owned subsidiary, utilizes AWS and Trifacta to prepare data for use in fraud detection algorithms. You’ll learn how self-service automated data wrangling can save your organization time and money, and tips for getting started with Trifacta’s solution, built for AWS.
.
Webinar attendees will learn:

- Why automating your data wrangling tasks can lead to greater data accuracy and more meaningful insights.
- How you can reduce your data preparation time by 60% and more with self-service data wrangling tools built for AWS.
- How easy it is to get started with machine learning solutions for data wrangling on the cloud.

Join our webinar to hear how Consensus, a Target-owned subsidiary, utilizes AWS and Trifacta to prepare data for use in fraud detection algorithms. You’ll learn how self-service automated data wrangling can save your organization time and money, and tips for getting started with Trifacta’s solution, built for AWS.
.
Webinar attendees will learn:

- Why automating your data wrangling tasks can lead to greater data accuracy and more meaningful insights.
- How you can reduce your data preparation time by 60% and more with self-service data wrangling tools built for AWS.
- How easy it is to get started with machine learning solutions for data wrangling on the cloud.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a How to Wrangle Data for Machine Learning on AWS (20)

Anúncio

Mais de Amazon Web Services (20)

How to Wrangle Data for Machine Learning on AWS

  1. 1. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. How to wrangle data for machine learning on AWS May 31, 2018 | 10:00 AM PT © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  2. 2. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pratap Ramamurthy, Partner Solutions Architect, Amazon Web Services, Inc. David McNamara, Customer Success Manager, Trifacta Harrison Lynch, Senior Director of Product Development, Consensus Corporation Today’s speakers © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  3. 3. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. • An overview of machine learning (ML) solutions offered through AWS and the AWS Partner Network • Featured AWS Machine Learning Partner: Trifacta • Case study: Consensus Corporation • Q&A / Discussion Today’s agenda © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  4. 4. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Learning objectives © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • How easy it is to get started with machine learning solutions for data wrangling on the cloud • Why automating your data wrangling tasks can lead to greater data accuracy and more meaningful insights • How you can reduce your data preparation time by 60% and more with self-service data wrangling tools built for AWS • How Consensus Corporation is using Trifacta on AWS to detect fraud
  5. 5. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine learning on AWS © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  6. 6. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. A long heritage of machine learning at Amazon Personalized recommendation s Inventing entirely new customer experiences Fulfillment automation and inventory management Drones Voice driven interactions
  7. 7. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our mission: Put machine learning in the hands of every developer and data scientist
  9. 9. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Source: McKinsey Global Institute, Artificial Intelligence The Next Digital Frontier. • Strong overall appetite for adopting AI • Top heavy in High Tech due to expertise • Opportunities exist in Health Care, Education, Retail, and other segments • 3000+ startups today (up from 100 in 2011) Market adoption: $46B market by 2020
  10. 10. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Machine Learning Stack Vision Frameworks & Infrastructure AWS Deep Learning AMI GPU (P3 Instances) MobileCPU IoT (Amazon Greengrass) Platform Services Application Services Amazon SageMaker AWS DeepLens Amazon Rekognition Image Amazon Rekognition Video Speech Amazo n Polly Amazon Transcribe Language Amazon Translate Amazon Comprehend Amazo n Lex Amazon Machine Learning Amazon Spark on Amazon EMR Amazon Mechanical Turk TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras
  11. 11. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. New: Amazon Rekognition Video Object and activity detection Person tracking Face recognition Real-time live stream Content moderation Celebrity recognition Video analysis
  12. 12. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  13. 13. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Customers running machine learning on AWS today
  14. 14. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. The AWS Competency Program is designed to highlight APN Partners who have demonstrated technical proficiency and proven customer success in specialized solution areas. Attaining an AWS Competency allows partners to differentiate themselves to customers by showcasing expertise in a specific solution area. W H AT IS TH E AW S C OMPETEN C Y PR OGR A M? The AWS Machine Learning Competency Program
  15. 15. Data wrangling for machine learning on AWS David McNamara, Customer Success Manager, Trifacta
  16. 16. We believe that our ability to solve big problems in business and society depends on seeing patterns in the data we collect. But data comes in all shapes and sizes, and too often the messy process of pulling it together gets in the way of progress. At Trifacta, we empower change-makers to work with diverse and fragmented data—as it’s being cleaned and refined—so they can ask more interesting questions and create a better future.
  17. 17. A global leader in data preparation #1 Rankings from Media & Analysts 85+ Global Technology and SI Partners #1 in Users with 10,000+ Companies Enterprise Standard for Data Preparation at 100+ Accounts
  18. 18. Self-service data wrangling: the critical enabler *Wrangler: Interactive Visual Specification of Data Transformation Scripts – Heer, Hellerstein, Kandel, Paepke; Stanford University & University California, Berkeley (2011) DATA PLATFORMS ANALYSIS & CONSUMPTION 80% ”There's the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data” — Kaggle founder and CEO Anthony Goldbloom
  19. 19. DATA PLATFORMS ANALYSIS & CONSUMPTION DATA WRANGLING ACTIVITIES Discover Structure Clean Enrich Validate Publish The Solution: Trifacta Data Wrangling Platform
  20. 20. Typical AI/ML modeling data pipeline to empower business self-service
  21. 21. Demo
  22. 22. Consensus Corporation: Improved data wrangling to increase the speed of a machine learning model building for anti-fraud software Harrison Lynch, Senior Director of Product Development, Consensus Corporation
  23. 23. Your speaker Harrison Lynch Sr. Director of Product Development
  24. 24. Consensus history 24Consensus / Proprietary & Confidential 1999 20182007 2012 2014 IPhone Launches acquireslaunches as an online retailer of wireless phones & services acquires LetsTalk.com & rebrands as launches Connected Commerce in First Client Launch
  25. 25. Use our wireless capabilities to perform new tricks 25 Multiplex Bundles Product & Connected Service Bundles LoyaltySubscriptions $625 Cash or $29 per month Underwriters Services Supply Chain 3. COLLECT 2. CONNECT 1. CATCH A TRANSACTION
  26. 26. Multiplex is about two things CONFIDENTIAL 26 Cart Margin Enhancement (CME): Add products and activate services to improve margin for retailer and deliver a more complete customer experience. Future Proof Subscriptions (FPS): Give guests subscriptions to bundles of latest products and services with options to pay over time Retailers face increasing price competition leading to downward margin and revenue pressure. Consumers face fragmented ‘buying to using’ experiences, and sticker shock on large ticket items. INSIGHTS KEY BENEFITS OF OUR SOLUTION Get More with Assisted Sales Pay Less with Subscriptions
  27. 27. • COGS: $850 • Customer finances 100% • Carrier pays device cost subsidy • Carrier pays commission • Retailer makes money on warranty, accessories • Carrier takes back commission and device subsidy if fraud • 1 bad sale wipes out the profit from up to 10 good sales Wireless retailing economics 27Consensus / Proprietary & Confidential
  28. 28. Current industry practice relies on insufficient credit scoring methods 28Consensus / Proprietary & Confidential WillThey Pay OnTime? • FICO • Income • Payment History • Length of Employment • Credit Utilization • # of Accounts V. Identity Thieves are trying to steal the best credit scores WillThey Ever Pay? • Distance from Store • Basket Composition • RepTenure • Time of Day • Number Port from
  29. 29. 29Consensus / Proprietary & Confidential Extract: • Start with a robust data set – preferably at least 3 months of data from orders that are at least 120 days old • This dataset (of orders) should indicate whether the carrier has deactivated a correspondingline (and if possible, whether the carrier classified the order as fraudulent) Analyze: • Applyhunches, theories and business knowledge • this is what’s known as “Ground Truth” Identify: • To extract and analyze a set of characteristics from the set of orders • These are referred to as “Features” Augment: • Identify characteristics and tangential data elements about Features that enhance the model’s usefulness • If reliable cell phone customers tend to come from particular areas, it may be prudent to model the regions from which customers drive to reach the cell phone store • If an annual festival increase a store region’s population by 100,000 people and a higher percentage of fraud cases come from this annual period, it may be prudent to cluster purchase orders that are proximate to the festival period Transform: • Put the data into a format into one that makes it easier for data modeling systems to read • This is usually one or a series of two-dimensional tables with one of the followingtypes of data  Continuous: numeric, numbers,like dollar values or distance  Binary: 1/0, or a yes/no  Categorical: colors (white, black, gold) or a carrier Split data into two sets: • Larger set is the training set that feeds into a set of models; it allows the models to identify good vs bad orders and to generate correlations. • Smaller test set is one that the data scientist puts away for future use against the model Choose: • Models that would best fit the data that comes through the system based on the models that have worked in the past • Put the training set through the set of candidate models, which results in: • Risk assessments for each order in the training set • Refinement of the candidate models such they become “pickled models” Train & Test: • After the candidate models generate risk assessments for the training data set, a data scientist pulls the test set out of a drawer (so to speak) • The models are then judged for accuracy against data they’ve never seen - this is how you make sure you’re not building a model to predict yesterday’s weather • The data scientist evaluates the candidate models based upon the comparison. • The candidate models undergo tuning and refinement such that they produce the best results possible against the test data set. • The data scientist selects the candidate model that produces the best results against the test data set Pickle, Promote & Deploy: • Compare the results for the best candidate model against the model currently in use, if it’s a winner move forward • Test the performance of the new model in use to ensure that it meets performance standards and promote it to production Model building 1 2 3 4 5 6 7 8 9
  30. 30. System for machine learning 30Consensus / Proprietary & Confidential 1 7 6 5 3 2 4 8 1.Orders arrive at the machine learning system via several channels 2.The system parses the data. It is able to do so regardless of the channel of origin 3. The system extracts that data which is most relevant to the order scoring process, transforms the extracted data into a format the model can read & loads reformatted data into the model 4. The system scores the extracted and reformatted order using the risk scoring model 5. The system determines at random whether an order should be in a control group. It approves those orders immediately and without regard to their score 6. The system scores the extracted and reformatted order using the risk scoring model 7. Based on the results of the rules application, the system routes orders to third-party services and manual review processes as appropriate8. The system sends a yes/no determination and/or a risk score for the order back to the originating commercial channel
  31. 31. Conventional • Many systems claim to use “models” • What they’re wedded to are linear models • LM’s are great for some things, not for others • Overfitting is an issue The whole universe of statistical models V. Consensus • The universe of models • Select the one that best fits the job Gaussian Kernel Random Forest Support Vector Machine Support Vector Machine
  32. 32. Searching for a data wrangling solution • Join disparate data sources together • Carrier reconciliation data • Internal order data • External data • Data is messy • Shifting date formats • Inconsistent data types • Shifts in logic across partners • Reduce reliance on developers/ data analysts • I’m lousy at SQL • Day jobs need to be attended to • Scaling Discovery • Needed to be able to act on hunches • Explore ground truth • Data Prep is a labor of love CONFIDENTIAL 32
  33. 33. • Data is in different places (and sometimes toxic) • Black box data • Pulling standalone sets of hashed values • Script Dev > SRE > Output > back to me • Making repairing of data repeatable • Geocode failures • Exploration of Census data • Tracking your changes • Showing your work & Data Provenance How Trifacta helps me 33Consensus / Proprietary & Confidential
  34. 34. • Discovery of data • From 2-3 days, to less than one day • Data preparation • From 8 hours to less than one hour Our results with Trifacta 34Consensus / Proprietary & Confidential
  35. 35. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Q & A
  36. 36. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Start wrangling today with Trifacta Wrangler Pro in AWS Marketplace: • aws.amazon.com/marketplace/ • Search for “Trifacta Wrangler Pro” Learn more about machine learning on AWS • aws.amazon.com/machine-learning/featured-partner-solutions/ Try AWS for free: • aws.amazon.com/free/ Next steps and further information: © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  37. 37. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

×