O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Do Your Homework! Writing tests for Data Science and Stochastic Code - David Waterman

317 visualizações

Publicada em

To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.

Publicada em: Tecnologia
  • Entre para ver os comentários

  • Seja a primeira pessoa a gostar disto

Do Your Homework! Writing tests for Data Science and Stochastic Code - David Waterman

  1. 1. Do Your Homework! Writing Tests for Data Science and Natural Language Processing David Waterman github.com/drwaterman/pydatadctesting
  2. 2. Agenda 1. Why test your code? 2. Our problem: text analysis, Natural Language Processing 3. Practical Implementation
  3. 3. How Does Homework Work? Why give students homework problems and the matching solutions?
  4. 4. If you know where you are going, you can tell if you are moving in the right direction
  5. 5. Many of our daily tasks are like homework problems
  6. 6. Test Driven Development Cycle
  7. 7. Why Test Your Code For others:  Reduce bugs  Improve feedback loops  Speeds up iteration  Makes code more reusable  Increases confidence in the system For you:  Earns trust and confidence in you and your work  Earns respect from devs and engineers (for whom it’s not optional)  Allows you to submit to open source projects  Probably required for acceptance into external code base
  8. 8. Agenda 1. Why test your code? 2. Our problem: text analysis, Natural Language Processing 3. Practical Implementation
  9. 9. Supply Chain Insight Who makes what?
  10. 10. Identifying Relationships in Text
  11. 11. Challenges • Per the long-term agreement, WABCO will supply its single-piston air disc brake (ADB) technology, MAXX, for the manufacturing of Hyundai’s new medium-duty trucks, which are expected to start from August 2019. • Timken has become the sole supplier of needle roller bearings to Volkswagen Transmission. • Dana Corp. has begun supplying Ford Motor Co. with its thermoplastic cylinder-head-cover modules for the automaker's 3.0- liter Duratec V-6 engine. The structure of text is domain specific. Regular people don’t talk like this:
  12. 12. Our Approach: “Gold” Tests  Human reads the text  Identifies relationship from text  Puts relationship into machine-friendly format (JSON, YAML)  Writes a test for the relationship  Write and rewrite code to pass the test
  13. 13. Agenda 1. Why test your code? 2. Our problem: text analysis, Natural Language Processing 3. Practical Implementation
  14. 14. Recommendation: Pytest as your framework ➕ More Pythonic ➕ Easy to write fast - less boilerplate ➕ Can still run unittests, doctests, and nose ➕ Readable, pretty output (including HTML reports) ➕ Great documentation & guides ➖ It’s not a builtin
  15. 15. What to Test  Expected output  Invalid input  Edge cases Data ModelsCode  Data is valid  Types are correct  Missing values are handled correctly  Format is correct  Produces expected results  Can be used to benchmark  Monitor for model drift
  16. 16. EXAMPLES Repo: https://github.com/drwaterman/pydatadctesting
  17. 17. Pytest features useful for Data Science ▪ Fixtures – For when you need something repeatedly over multiple tests (Loading test data, making a connection, preprocessing data) ▪ Skip and Xfail – For when you know what to test for but the code doesn’t pass yet ▪ Comparing images/plots – Available in matplotlib ▪ Benchmarking a model
  18. 18. Some Nice Pytest Options Save your pytest command line arguments in a shell script pytest --html=test-logs/testreport.html --self- contained-html --cov=my_module --cov-report term- missing -r aPp test ▪ --html: Where to save the html test report ▪ --self-contained-html: Save everything in one html file (no external CSS, etc.) ▪ --cov=: What modules to include in the coverage report ▪ --cov-report term-missing: Terminal report w/ missing line numbers ▪ -r aPp: display test results summary at the end ▪ test: The location in which to run the tests
  19. 19. CONCLUSION  Pytest is easy!  Start now  It will earn you trust and respect  It is possible to use it even if your code is stochastic Time for questions!

×