O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

The Black Box: Interpretability, Reproducibility, and Data Management

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 43 Anúncio

The Black Box: Interpretability, Reproducibility, and Data Management

Baixar para ler offline

The growing complexity of data science leads to black box solutions that few people in an organization understand. You often hear about the difficulty of interpretability—explaining how an analytic model works—and that you need it to deploy models. But people use many black boxes without understanding them…if they’re reliable. It’s when the black box becomes unreliable that people lose trust.
Mistrust is more likely to be created by the lack of reliability, and the lack of reliability is often the result of misunderstanding essential elements of analytics infrastructure and practice. The concept of reproducibility—the ability to get the same results given the same information—extends your view to include the environment and the data used to build and execute models.
Mark Madsen examines reproducibility and the areas that underlie production analytics and explores the most frequently ignored and yet most essential capability, data management. The industry needs to consider its practices so that systems are more transparent and reliable, improving trust and increasing the likelihood that your analytic solutions will succeed.
This talk will treat the black boxed of ML the way management perceives them, as black boxes.
There is much work on explainable models, interpretability, etc. that are important to the task of reproducibility. Much of that is relevant to the practitioner, but the practitioner can become too focused on the part they are most familiar with and focused on. Reproducing the results needs more.

The growing complexity of data science leads to black box solutions that few people in an organization understand. You often hear about the difficulty of interpretability—explaining how an analytic model works—and that you need it to deploy models. But people use many black boxes without understanding them…if they’re reliable. It’s when the black box becomes unreliable that people lose trust.
Mistrust is more likely to be created by the lack of reliability, and the lack of reliability is often the result of misunderstanding essential elements of analytics infrastructure and practice. The concept of reproducibility—the ability to get the same results given the same information—extends your view to include the environment and the data used to build and execute models.
Mark Madsen examines reproducibility and the areas that underlie production analytics and explores the most frequently ignored and yet most essential capability, data management. The industry needs to consider its practices so that systems are more transparent and reliable, improving trust and increasing the likelihood that your analytic solutions will succeed.
This talk will treat the black boxed of ML the way management perceives them, as black boxes.
There is much work on explainable models, interpretability, etc. that are important to the task of reproducibility. Much of that is relevant to the practitioner, but the practitioner can become too focused on the part they are most familiar with and focused on. Reproducing the results needs more.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a The Black Box: Interpretability, Reproducibility, and Data Management (20)

Anúncio

Mais de mark madsen (16)

Mais recentes (20)

Anúncio

The Black Box: Interpretability, Reproducibility, and Data Management

  1. 1. Copyright Third Nature, Inc. Executive Briefing: The black box – interpretability, reproducibility, and data management O’Reilly Artificial Intelligence conference London October, 2019 Mark Madsen
  2. 2. Copyright Third Nature, Inc.Copyright Third Nature, Inc. What does management care about? The perspective and problems of the person responsible for oversight of the results is spread across the organization and across multiple projects ▪ The owner of a decision or the outcome of the decision. Aka the process owner ▪ The CAO, CDO, VP of analytics, aka “your boss” if you’re a data scientist.
  3. 3. Copyright Third Nature, Inc. Repeatability
  4. 4. Copyright Third Nature, Inc. Operational predictability
  5. 5. Reproducibility Can you support the answer at a later date? Can you replicate the process used to get the answer? Can you get similar results on the same or similar input?
  6. 6. Copyright Third Nature, Inc. Scientific reproducibility Can you get the same results given the same starting conditions? ▪ Duplicate as much as possible the experiment. ▪ Results of experimental replication may not be the same for many reasons ▪ Detailed statistical analysis is needed ▪ And critical assessment of experiments
  7. 7. Copyright Third Nature, Inc. Reproducibility in data science Can you get the same results on the same data? ▪ Direct replication is expected ▪ Assumption is that same input = same output ▪ You can have this with an unexplainable box ▪ But there are also confounding factors
  8. 8. Copyright Third Nature, Inc. Moving from predictable rule-based systems to complex mathematical systems, and from there to systems that exhibit stochasticity, makes the task harder. One thing worse than a black box is a random black box.
  9. 9. Copyright Third Nature, Inc. If you trust the box then it’s ok not to understand it. Without reproducibility there cannot be trust. Trust comes from it working It works = reliability, working over time Reliability of your solution is a systems problem, not a just a model problem (in the GST sense of system) Therefore, if there is trust, or there is low risk, you may not need to worry about reproducibility
  10. 10. Copyright Third Nature, Inc. Interpretability and reproducibility are driven by trust, which only matters when there is enough risk Regulation and compliance Material decisions Big penalties Complication: the cost of false positives and false negatives are usually different, so model characteristics matter here. Cost of error Frequency of decision Low High High Don’t care Oh crap
  11. 11. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Beyond toy examples, this problem matters Edge case error problems will limit AI applicability until they are solved (don’t hold your breath). The uncertainty challenge means: ▪ Calculate error costs and apply them to your model before (and after) ▪ The more risk averse, the more predictability is desire, the less variance one can tolerate
  12. 12. Copyright Third Nature, Inc. The real questions Can I support this result at a later date? ▪ Do I care? • Is the risk (cost) worth worrying about? • Is the cost of reproducibility less than the risk and cost? ▪ Do I only need to justify the decision? • interpretability and trust may be enough • Unless there are audits or legal discovery involved ▪ Do I need to reproduce the results? • May not need interpretability • Need a lot of other things
  13. 13. Copyright Third Nature, Inc. The real need is trust. Our trust is based on all the elements that are involved, not just the model. The higher the stakes the more you must think about all the ways it could go wrong. Reliability and robustness of the technology environment is as important as the model. Reproducibility is not just the data scientist’s problem – it is an operational concern.
  14. 14. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Three categories of use, with differing trust levels Embedded Human in the loop Ad hoc
  15. 15. Copyright Third Nature, Inc. Explainability only addresses the functioning of the model
  16. 16. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Three categories of use, with differing trust levels Embedded Human in the loop Ad hoc In general, explainability is more useful when there is human agency and control in decisions. Reproducibility is increasingly important where the machine has agency because the impact of automation expands the scope and increases the complexity beyond the model.
  17. 17. Machine learning is the smallest part of the environment ML Code Analysis Tools Data Collection Machine Resource Management Serving Infrastructure Feature Extraction Configuration Data Verification Process Management Tools Monitoring https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems ©2018 Teradata
  18. 18. Copyright Third Nature, Inc.Copyright Third Nature, Inc. The end result in the data science landscape is complexity You can explain the model, but can you explain this?
  19. 19. Copyright Third Nature, Inc.Copyright Third Nature, Inc. ML Principle: CACE, Change Anything Change Everything In embedded or autonomous ML everything is connected. Events usually happen in real time. ML is very sensitive to context and input. “So what if I changed NumPy in dev?”
  20. 20. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Working backwards from the decision or answer, what do you need to have reproducibiity? ProcessSource Run Env
  21. 21. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Fine for a single-run model. What if it’s embedded in an application somewhere? ProcessSource Run Env App EnvProcess Deploy
  22. 22. Copyright Third Nature, Inc.Copyright Third Nature, Inc. If a model is used over time, you will likely change it. Or sources, or processes will change. ProcessSource Run Env App EnvProcess Deploy Run Env Run Env Process Process Run Env Process Run Env Process
  23. 23. Copyright Third Nature, Inc. We’re so focused on the light switch that we’re not talking about the light.
  24. 24. Copyright Third Nature, Inc. Most emphasis in the industry is on code and code artifacts: ▪ Model repositories ▪ Model management ▪ Pipeline frameworks ▪ Packaging ▪ Versioning ▪ Tools Why? Because vendors want to sell you products for the problem they helped create. Managing the code without managing the data?
  25. 25. Copyright Third Nature, Inc.Copyright Third Nature, Inc. TIDY DATA: Hadley Wickham makes the case for Tidy data sets, that have specific structure, are easy to work with, that free analysts from mundane data manipulation chores – there’s no need to start from scratch and reinvent new methods for data cleaning Source: Tidy Data by Hadley Wickham, Journal of Statistical Software, Vol 59, issue 10 (2014) https://vita.had.co.nz/papers/tidy-data.html What do the experts say?
  26. 26. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Andrew Ng Leading AI Researcher “If your boss asks you, tell them that I said build a unified data warehouse” What about AI? GIGO! Everyone wants shortcuts. There’s aren’t any shortcuts
  27. 27. Copyright Third Nature, Inc. Reproducibility? No problem. Just save versions of data
  28. 28. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Today’s market solution: the Data Lake* *Depiction for illustrative purposes only, no warranty express or implied, actual system may vary. By a lot.
  29. 29. Copyright Third Nature, Inc.Copyright Third Nature, Inc. Today’s market solution: the Data Lake + DIY Data hoarding is not a data management strategy
  30. 30. Copyright Third Nature, Inc. A standard data science approach (to avoid) One-Pipeline- Per-Process Redundant Effort / Cost / Complexity / etc. WELL HEAD WELL HEAD Example use cases
  31. 31. Copyright Third Nature, Inc. A dystopian model: Lake + data engineers + pipelines The Lake with pipelines to files or tables for each model is exactly the same pattern as mainframe COBOL batch Data pipeline code We already know that people don’t scale. Don’t do this
  32. 32. Copyright Third Nature, Inc. Self-service tradeoffs Pay now or pay later, but you will always pay. The question is which payment will be less. Self-service gives flexibility and agility, but can reduce repeatability and add duplicated effort and conflicts, and an increase in risk.
  33. 33. Copyright Third Nature, Inc.Copyright Third Nature, Inc. But that self-serve bit only applies to building models. Embedding them in the enterprise is more involved ProcessSource Run Env App EnvProcess Deploy Run Env Run Env Process Process Run Env Process Run Env Process Managing the data on an ongoing basis is a complex problem to solve
  34. 34. Copyright Third Nature, Inc. The organization-wide focus needs to be on repeatability – where it can be supported
  35. 35. Copyright Third Nature, Inc. Buildings above: flexibility, repurposing, faster change above, funded independently Applications Utilities below: stability, reuse, slow predictable change below, funded centrally Infrastructure The infrastructure is a hidden combination of technology, process, and methods. There is a need to draw boundaries for responsibilities in the organization. It can’t be all business or all IT. Separate the uses from the infrastructure – focus on the city, not the buildings
  36. 36. Copyright Third Nature, Inc.Copyright Third Nature, Inc. 2855 Out of Deviation Sensor Readings Fraud Events CUSTOMER EXPERIENCE SALES MARKETING OPERATIONS Abandoned Carts Online Price Quotes Social Media Influencers FINANCE 11234 Sensor data formatting Unit Normalization Vehicle/version normalization Make/model/year selection Data pipelines, data architecture, and curation
  37. 37. Copyright Third Nature, Inc. Data curation is not data modeling The problem with so many sources, types, formats and latencies of data is that it is now impossible to create in advance one model for all of it. Data modeling is about the inside of a dataset. Curation is about the entire dataset. It’s about: creating, labeling, organizing, finding, navigating, retiring. Data curation, rather than data modeling, is becoming the most important data management practice.
  38. 38. Copyright Third Nature, Inc. Data can be maintained at multiple levels: not raw or DW Ingredients Goal: available User needs a recipe in order to make use of the data. Pre-mixed Goal: discoverable and integrateable User needs a menu to choose from the data available Meals Goal: usable User needs utensils but is given a finished meal
  39. 39. Copyright Third Nature, Inc. Data architecture dictates technology and process Collection Capture metadata (including request) Record the structure Apply keys PII masking, restrict Start lineage Distribution Common structures RDM and MDM Subject models Make data findable and linkable Data provisioning Consumption Model for target use Quality rules Track provenance Apply SLAs As few engines as possible – not fewer Decide on policies for when to place data in which area
  40. 40. Copyright Third Nature, Inc.Copyright Third Nature, Inc. ML has a lot of data requirements: no shortcuts https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  41. 41. Copyright Third Nature, Inc. Manage your data (or it will manage you) Data is the foundation of the models, and the results. Focus on managing data as well as the code and tool artifacts. Reproducibility, and data science in general, requires that you treat the operating and the build environments as a whole. Data science in the long term is not about individual projects, but organizational capability. This means infrastructure, process, governance.
  42. 42. Copyright Third Nature, Inc. Things you can do 1. Determine what level of risk is acceptable based on the costs of false positives and false negatives, or poor accuracy. 2. Manage the data history such that you can always reconstruct it. 3. Think through the dependencies outside the model that affect reproducibility. 4. Track the configuration of everything in the chain of dependencies. 5. Do not let a thousand flowers bloom with BYOT. You will have to cut many of them due to complexity. 6. Do not let IT overcomplicate your environment with technology because of belief things must be big. Brittle infrastructure can’t be tolerated with embedded machine learning. 7. Make a rational decision about how much effort to expend.
  43. 43. Mark Madsen is a Fellow at Teradata in the Technology and Innovation Office. he focused on the data and analytics ecosystem, problems of large-scale design, and R&D. Prior to that he was president of Third Nature, a research and consulting firm focused on analytics, integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. He received awards for his work from the American Productivity & Quality Center and the Smithsonian Institute. He is an international speaker, chairs several conferences and program committees. Mark Madsen

×