O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Pay no attention to the man behind the curtain - the unseen work behind data science

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 64 Anúncio

Pay no attention to the man behind the curtain - the unseen work behind data science

Baixar para ler offline

Goal: explain the nature of the work of an analytics team to a manager, and enable people on those teams to explain what a data science team needs to a manager.

It seems as if every organization wants to enable analytical-decision making and embed analytics into operational processes. What can you do with analytics? It looks like anything is possible. What can you really do? Probably a lot less than you expect. Why is this? Vendors promise easy-to-use analytics tools and services but they rarely deliver. The products may be easy but the work is still hard.
Using analytics to solve problems depends on many factors beyond the math: people, processes, the skills of the analyst, the technology used, the data. Technology is the easy part. Figuring out what to do and how to do it is a lot harder. Despite this, fancy new tools get all the attention and budget.
People and data are the truly hard parts. People, because many believe that data is absolute rather than relative, and that analytic models produce an answer rather than a range of answers with varying degrees of truth, accuracy and applicability. Data, because managing data for analytics is a nuanced, detail-oriented and seemingly dull task left to back-office IT.
If your goal is to build a repeatable analytics capability rather than a one-off analytics project then you will need to address the parts that are rarely mentioned. This talk will explain some of the unseen and little-discussed aspects involved when building and deploying analytics.

Goal: explain the nature of the work of an analytics team to a manager, and enable people on those teams to explain what a data science team needs to a manager.

It seems as if every organization wants to enable analytical-decision making and embed analytics into operational processes. What can you do with analytics? It looks like anything is possible. What can you really do? Probably a lot less than you expect. Why is this? Vendors promise easy-to-use analytics tools and services but they rarely deliver. The products may be easy but the work is still hard.
Using analytics to solve problems depends on many factors beyond the math: people, processes, the skills of the analyst, the technology used, the data. Technology is the easy part. Figuring out what to do and how to do it is a lot harder. Despite this, fancy new tools get all the attention and budget.
People and data are the truly hard parts. People, because many believe that data is absolute rather than relative, and that analytic models produce an answer rather than a range of answers with varying degrees of truth, accuracy and applicability. Data, because managing data for analytics is a nuanced, detail-oriented and seemingly dull task left to back-office IT.
If your goal is to build a repeatable analytics capability rather than a one-off analytics project then you will need to address the parts that are rarely mentioned. This talk will explain some of the unseen and little-discussed aspects involved when building and deploying analytics.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Pay no attention to the man behind the curtain - the unseen work behind data science (20)

Anúncio

Mais de mark madsen (20)

Mais recentes (20)

Anúncio

Pay no attention to the man behind the curtain - the unseen work behind data science

  1. 1. Pay no attention to the man behind the curtain… The unseen work behind data science and analytics Accelerate Data Science conference October 18, 2017 Mark Madsen www.ThirdNature.net @markmadsen
  2. 2. Copyright Third Nature, Inc. INTRO The problem we’re (really) trying to solve, current state
  3. 3. Copyright Third Nature, Inc.Copyright Third Nature, Inc. The focus is largely on machine learning today You are here
  4. 4. Copyright Third Nature, Inc. The craft model of information delivery does not scale
  5. 5. Copyright Third Nature, Inc. So we shifted to data publishing Industrialized data delivery for self-service access.
  6. 6. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Increased data capture and BI maturity leads to more data-intensive practices, rising complexity Pareto analysis of the share of buyers who make up 80% of sales volume for products, in this case Coke. Data source: CMO council
  7. 7. Copyright Third Nature, Inc.Copyright Third Nature, Inc. What makes these customers different? How does this affect a new product launch, or line extensions? These are not the type of questions you can answer with only queries and reporting. Data source: CMO council
  8. 8. Copyright Third Nature, Inc. Compounding the problem: observations, not transactions Event data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it. Copyright Third Nature, Inc.
  9. 9. Copyright Third Nature, Inc. The old problem was access, the new one is analysis
  10. 10. Copyright Third Nature, Inc. The applied view of data science Five basic things you can do: ▪Prediction – what is most likely to happen? ▪Estimation – what’s the future value of a variable? ▪Description – what relationships exist in the data? ▪Simulation – what could happen? ▪Prescription – what should you do? Slide 10 Copyright Third Nature, Inc.
  11. 11. Copyright Third Nature, Inc. Applying analytics isn’t just putting them on a screen There are different models of use at machine and human speed Decision- Action Human decision support Humans moderating machine decisions Machine decisions Monitor- Alert Human monitoring Machine monitoring
  12. 12. Copyright Third Nature, Inc. THE NATURE OF THE PROBLEM FOR ORGANIZATIONS Implementing data science is a problem of multiple perspectives
  13. 13. Copyright Third Nature, Inc. We don’t have an analytics problem, just like we didn’t have a BI problem The origin of analytics as “business intelligence” was stated well in 1958: …the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal. ~ H. P. Luhn “A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf ” “ Our goal is analytics as a capability, not a technology
  14. 14. Copyright Third Nature, Inc. Three constituencies Stakeholder Analyst Builder aka the recipient aka the data scientist aka the engineer
  15. 15. Copyright Third Nature, Inc. Starting points Many organizations choose to start with the analysts. Create a data science team. Turn them loose to find a problem. Many more start with builders: technology solutions looking for problems, e.g. 55% of the IT driven Hadoop and Spark projects over the last five years. The right place to start? Stakeholders. The goal to achieve, the problem to solve.
  16. 16. Copyright Third Nature, Inc. NATURE OF THE PROBLEM FROM THE STAKEHOLDER’S PERSPECTIVE Each constituency has their own set of problems to deal with
  17. 17. Copyright Third Nature, Inc. The myth that still drives analytics – analytic gold All we need is a fat pipe and pans working in parallel…
  18. 18. Copyright Third Nature, Inc. Analytic insights that result in no action are expensive trivia. It’s not the insight, but what you do with it, that matters As a manager: what would you do in this situation?
  19. 19. Copyright Third Nature, Inc. Perennially difficult: What question do you address? What’s possible? How do you know what’s feasible and what isn’t? (both technically and financially) You don’t, unless you know the data science and the business (and even then maybe not, ML makes no guarantees) It takes domain expertise and analytic expertise and intuition - that’s why you need analysts.
  20. 20. Copyright Third Nature, Inc. Important questions for managers 1. What is the goal? 2. Is the goal worth achieving? 3. Do you have a clearly stated, measureable goal? 4. Do you have the data required? If they don’t realize this is important, they complain about analysts asking them a bunch of (obvious*) questions. There are processes you can put in place to find problems to address, prioritize them and determine how to deploy the solutions for them. *Not really
  21. 21. Copyright Third Nature, Inc. Applying analytics is not an analytics problem Applying analytics is not in the analyst’s control. It’s not in the engineer’s control. It’s in the control of the people involved in the process. Failures are often in execution, not in analytics development. For example, we saw unexpectedly poor performance in a number of geographies. Was it the new analytics we tried? Was it a data problem? No, it was a simple compliance problem.
  22. 22. Copyright Third Nature, Inc. NATURE OF THE PROBLEM FROM THE ANALYST’S PERSPECTIVE
  23. 23. Copyright Third Nature, Inc. The analytics process at a high level Diagram: Kate Matsudaira
  24. 24. Copyright Third Nature, Inc. The nature of analytics problems is researching the unknown rather than accessing the known. Repeat for each new problem Diagram: Kate Matsudaira
  25. 25. Copyright Third Nature, Inc. Important: no two analytics projects are entirely alike Different goals = different data, preparation, algorithm Different algorithms have different resource consumption profiles and scaling ability. Each requires it’s own custom engineered data features
  26. 26. Copyright Third Nature, Inc. Starting at the start: Do you have a clearly stated, measureable goal?
  27. 27. Copyright Third Nature, Inc. The main hurdle: just getting the data Do you know where to find it? Because it’s unlikely to be in the data warehouse. Do you have access to it? Is access fast enough? Because DWs are for QRD, not for moving huge piles of data. And ERP systems and SaaS apps are right out.
  28. 28. Copyright Third Nature, Inc. Do you have the right data? Many machine learning techniques require labeled (known good) training data: Supervised learning: a person has to define the correct output for some portion of the data. Data is divided into training sets used for model building and test sets for validating the results. • What is spam and what isn’t? • What does a fraudulent transaction look like 28
  29. 29. Copyright Third Nature, Inc. Do you have enough of the right data? ML needs a lot, you may be disappointed in your own efforts
  30. 30. Copyright Third Nature, Inc. Define the business problem Translate the problem into an analytic context Select appropriate data Learn the data Create a model set Fix problems with data Transform data Build models Assess models Deploy models Assess results Source: Michael Berry, Data Miners Inc. Slide 30Copyright Third Nature, Inc. What does an expert analyst really do?
  31. 31. Copyright Third Nature, Inc.Copyright Third Nature, Inc. What does an expert analyst do? You can’t model data for this in advance.
  32. 32. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Where do analysts spend their time? mostly data work Define the business problem Translate the problem into an analytic context Select appropriate data Learn the data Create a model set Fix problems with data Transform data Build models Assess models Deploy models Assess results % of time spent 70% 30% Source: Michael Berry, Data Miners Inc. Slide 32
  33. 33. Copyright Third Nature, Inc. Feature engineering is the core of the process Lots of data (as attributes) makes things harder Lots of data (instances) makes things slow Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are. Cleaning up data, choosing attributes, deriving features is not a technical problem as much as a creative one. The best way to enable data scientists is to remove data management obstacles.
  34. 34. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Where do most of the analytics tools focus? Define the business problem Translate the problem into an analytic context Select appropriate data Learn the data Create a model set Fix problems with data Transform data Build models Assess models Deploy models Assess results Source: Michael Berry, Data Miners Inc. Slide 34
  35. 35. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Where do most of the analytics aaS focus? Define the business problem Translate the problem into an analytic context Select appropriate data Learn the data Create a model set Fix problems with data Transform data Build models Assess models Deploy models Assess results Source: Michael Berry, Data Miners Inc. Slide 35
  36. 36. Copyright Third Nature, Inc. The analyst’s workspace in BI is relatively spare
  37. 37. Copyright Third Nature, Inc. The analyst’s workspace needs to be more like a kitchen than like BI vending machines
  38. 38. Copyright Third Nature, Inc. NATURE OF THE PROBLEM FROM THE BUILDER’S PERSPECTIVE
  39. 39. Copyright Third Nature, Inc. IT and Ops people want to know “what to build?” Giant data platform? Self service tools?
  40. 40. Copyright Third Nature, Inc. Analytics requires different processes and workloads None of this analytics work is the same as what IT considered “analysis” to be, which is usually equated with BI or ad-hoc query. Ad-hoc analysis = Exploratory data analysis = Batch analytics = Real-time analytics A real analytics production workflow Hatch, CIKM ‘11 Slide 40
  41. 41. Copyright Third Nature, Inc. Embedding analytics: less voodoo, more engineering
  42. 42. Copyright Third Nature, Inc. Things engineering and operations worry about Engineering time and effort ▪ Introduction of new technology, complexity ▪ Integration - Deployment of models requirements linking different types of environments, creating supportable workflows for the analysts ▪ Ability to develop and deploy at the required speed Supportability ▪ Automation ▪ The environment requires additional monitoring, other technology and processes, particularly for customer-facing work ▪ Support costs (time and money) SLAs: ▪ Availability – if analytics are tied to production operations, particularly customer facing, this becomes important and difficult because it’s not standard application work ▪ Performance and scalability – have to manage unpredictable workloads, resource conflicts between model development with model execution
  43. 43. Copyright Third Nature, Inc. The world changes, do the models? In BI you maintain ETL and schemas, in ML you maintain models. “Model decay” happens as the assumptions around which a model is built change, e.g. spam techniques change. When you adjust the model you need to know it is better again ▪ Better save the data used to build the model ▪ Better save the model ▪ Baseline and measurements
  44. 44. Copyright Third Nature, Inc. You need a system of record for analytics
  45. 45. Copyright Third Nature, Inc. THREE PERSPECTIVES, ONE SOLUTION? There are requirements from all constituents. You need to put them together to have a complete picture of what’s needed.
  46. 46. Copyright Third Nature, Inc.Copyright Third Nature, Inc. The missing stakeholder There is another stakeholder: analytics management - the CAO, CDO, VP of analytics, aka “your boss” if you’re a data scientist. The perspective and problems of the person responsible for oversight of the team and efforts is across the organization and across multiple projects
  47. 47. Copyright Third Nature, Inc. Repeatability
  48. 48. Copyright Third Nature, Inc. Operational predictability
  49. 49. Copyright Third Nature, Inc. Reproducibility
  50. 50. Copyright Third Nature, Inc. Analytics solutions are interdisciplinary Team composition is best when the skills and backgrounds are mixed. Domain knowledge is still valuable – ignore the AI and ML hype saying that it’s all math and engineering. Data management and engineering is a necessary part for much of this work.
  51. 51. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Data scientists and engineers work from opposing directions exploration modeling integration applications infrastructure help people ask the right questions, frame them, define measurable goals define models that run to determine answers or carry out actions deliver the results / product in production, at scale build data science models into applications and delivery systems provide the systems and practices to build and run the desired models Diagram concept: Paco Nathan
  52. 52. Using a matrix to plan the project team Image: Paco Nathan
  53. 53. This is a team sport, not a solo act Image: Paco Nathan
  54. 54. Copyright Third Nature, Inc. We already know the craft model doesn’t scale. How do we industrialize like we did for BI?
  55. 55. Copyright Third Nature, Inc.Copyright Third Nature, Inc. There is an extensive list of requirements to support Primary requirements needed by constituents S D E Data catalog and ability to search it for datasets X X Self-service access to curated data X Self-service access to uncurated (unknown, new) data X X Temporary storage for working with data X Data integration, cleaning, transformation, preparation tools and environment X X Persistent storage for source data used by production models X X Persistent storage for training, testing, production data used by models X X Storage and management of models X X Deployment, monitoring, decommissioning models X Lineage, traceability of changes made for data used by models X X Lineage, traceability for model changes X X X Managing baseline data / metrics for comparing model performance X X X Managing ongoing data / metrics for tracking ongoing model performance X X X S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
  56. 56. Copyright Third Nature, Inc. Non-answer #1: “Innovation as Procurement” Software vendors want to sell you one thing: high margin software. Most assume the data is there and ready to use by their application – just load it. Most of the work lies in data integration, cleaning and data management. Embedding analytics in a process adds infrastructure that most organizations don’t have and can’t support. It takes new infrastructure.
  57. 57. Copyright Third Nature, Inc. Non-answer #2: Best Practices “78% of high performing companies have a centralized data science team in place in their organization” – follow their lead! This is called survival bias. Flipping a coin is often as effective as “Do what they did.” The problem: you have directions to cross a minefield but no map of where to start.
  58. 58. Copyright Third Nature, Inc. The enterprise focus needs to be on repeatability - where it can be supported
  59. 59. Copyright Third Nature, Inc. Key focus for the organization: Infrastructure vs Application Infrastructure enables value, applications deliver value. Enable applications by pushing the reusable elements down into the platform. The infrastructure is a hidden combination of technology, process and methods.
  60. 60. Copyright Third Nature, Inc. Data management is a key element of infrastructure Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  61. 61. Copyright Third Nature, Inc. Manage your data (or it will manage you) Data management is where both analysts and developers are weakest. Modern engineering practices are where data management is weakest. You need to bridge the groups and practices in the organization if you want to make this work repeatable.
  62. 62. Copyright Third Nature, Inc. Conclusion: new stuff eventually becomes old stuff
  63. 63. Copyright Third Nature, Inc. About the Presenter Mark Madsen is president of Third Nature, an advisory firm focused on analytics, data and technology strategy. Mark is an award-winning author, architect and CTO who has received awards for his work from the American Productivity & Quality Center, Smithsonian Institute and industry associations. He is an international speaker, a contributor to Forbes, and member of the O’Reilly Artificial Intelligence and Strata program committees. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net
  64. 64. Copyright Third Nature, Inc. About Third Nature Third Nature is an advisory firm focused on practices and technology in analytics, information strategy, business intelligence and data management. Our goal is to help organizations solve problems using data. We offer education, advisory and research services to support business and IT organizations. We also provide product-related consulting to software vendors in the data industry. We specialize in strategy and architecture, so we look at emerging technologies and markets, evaluating how technologies are applied to solve problems rather than simply comparing product features. We fill the gap between what industry analyst firms cover and what organizations need.

×