O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
The next terminal – Jupyter
With examples from Bioinformatics
@lynnlangit
“
”
How often do you use
the terminal?
@lynnlangit
Terminal Customizations
Prompt Output Aesthetics Code Comments Graphics
@lynnlangit
Terminalimproved
Terminalimproved
What does this Code do?
@lynnlangit
“
”
But it’s not good enough
Why not?
@lynnlangit
Machine Learning
Too much data to process? Or too much code? Can you ‘see’ what is happening?
@lynnlangit
What does this Code do?
Which algorithm?
@lynnlangit
Visualizing Data Processing ML Code
Which algorithm?
@lynnlangit
Now – more data, much more…
IoT increases data volume and complexity exponentially
@lynnlangit
“
”
Inspired by
Mathematica
Thanks Steven Wolfram
If you can SEE it (your data and code), you can work with it better
@lyn...
Next terminal -> a better Python REPL
• Fernando Perez in 2001
• IPython (interactive)
• Modeled - Mathematica Notebooks
•...
Enter Jupyter Notebooks
@lynnlangit
Jupyter Notebooks supports ML Lifecycle
1. Collect
Data
Retrieve Files
Query SQL Databases
Call Web Services
“Scrape” Web ...
Jupyter Visualizations –
so many possibilities
Notebook Customizations
Multiple
Runtimes
Languages
Share output
Code or
Equations
LaTex
Math
Comments
Markdown
Wiki-like
...
Example
Jupyter locally
@lynnlangit
Mathematica evolved…
Jupyter Notebook
Market leader
Started for single use
Academic community
GitHub integration
Added Jup...
Running Notebooks
Desktop
Install and run
Local Server
Can use Jupyter Hub for groups
Cloud
Large number of options
@lynnl...
Extending, Refactoring Open Notebooks
• Write functions in one notebook
• Link to another notebook
• Write extensions (nbe...
Up the bar
Personalized medicine via genomic analysis
@lynnlangit
Reproducible Research – Experiments as Code
@lynnlangit
Bioinformatics | Denis C. Bauer | @allPowerde|
GT-Scan2
How can genome engineering
be made more effective?
Variant Spark
H...
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Machine learning…
on 1.7 Trillion data points
https://www.p...
Bioinformatics | Denis C. Bauer | @allPowerde|
VariantSpark - Parallelize Random Forest for scalability
• Spark ML’s RF wa...
Bioinformatics | Denis C. Bauer | @allPowerde|
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column P...
Bioinformatics | Denis C. Bauer | @allPowerde|
Wide RF scalable with features and samples
# set up context and input parameters
spark = SparkSession(sc)
vc = VariantsContext(spark)
label = vc.load_label('dius/dat...
Demo VariantSpark
Jupyter for Genomics Research
@lynnlangit
Cloud-based Jupyter
PaaS
• AWS SageMaker
• Azure Notebooks
• Others…
@lynnlangit
Example - GT-Scan2
Jupyter for Genomics Research
@lynnlangit
Tools for Jupyter
• Binder for GitHub
• Point to your GitHub Repo
• Jupyter Notebooks
• Requirements.txt
• It builds a Doc...
Example
Binder
@lynnlangit
Future of Jupyter for Research
Academic
Institutions
and
Research
Labs
UC Berkeley, Davis, San Diego
Cal Poly San Luis Obi...
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
Próximos SlideShares
Carregando em…5
×

Understanding Jupyter notebooks using bioinformatics examples

477 visualizações

Publicada em

Understanding Jupyter notebooks using examples and tools from the CSRIO bioinformatics team - VariantSpark and GT-Scan2

Publicada em: Tecnologia
  • Seja o primeiro a comentar

Understanding Jupyter notebooks using bioinformatics examples

  1. 1. The next terminal – Jupyter With examples from Bioinformatics @lynnlangit
  2. 2. “ ” How often do you use the terminal? @lynnlangit
  3. 3. Terminal Customizations Prompt Output Aesthetics Code Comments Graphics @lynnlangit
  4. 4. Terminalimproved
  5. 5. Terminalimproved
  6. 6. What does this Code do? @lynnlangit
  7. 7. “ ” But it’s not good enough Why not? @lynnlangit
  8. 8. Machine Learning Too much data to process? Or too much code? Can you ‘see’ what is happening? @lynnlangit
  9. 9. What does this Code do? Which algorithm? @lynnlangit
  10. 10. Visualizing Data Processing ML Code Which algorithm? @lynnlangit
  11. 11. Now – more data, much more… IoT increases data volume and complexity exponentially @lynnlangit
  12. 12. “ ” Inspired by Mathematica Thanks Steven Wolfram If you can SEE it (your data and code), you can work with it better @lynnlangit
  13. 13. Next terminal -> a better Python REPL • Fernando Perez in 2001 • IPython (interactive) • Modeled - Mathematica Notebooks • IP(y): Notebook -> in a browser • 2012 IPython -> Jupyter Notebook @lynnlangit
  14. 14. Enter Jupyter Notebooks @lynnlangit
  15. 15. Jupyter Notebooks supports ML Lifecycle 1. Collect Data Retrieve Files Query SQL Databases Call Web Services “Scrape” Web Pages 2. Prepare Data Explore Data Validate Data Clean Data Features / Data 4. Evaluate Model Test Performance Compare Models Validate Model Visualize 5. Deploy Model Export Model File Prepare Job Deploy Container Re-package Model Execute code blocks: - Python, R… code - SQL queries - Shell commands 3. Train Model Prepare Training Set Experiment Test Model Visualize Write Documentation: - Markdown language Visualize Data - Viz tools…
  16. 16. Jupyter Visualizations – so many possibilities
  17. 17. Notebook Customizations Multiple Runtimes Languages Share output Code or Equations LaTex Math Comments Markdown Wiki-like Graphics Visualizations Charting Results LIVE DOCUMENTATION Reproducible Research @lynnlangit
  18. 18. Example Jupyter locally @lynnlangit
  19. 19. Mathematica evolved… Jupyter Notebook Market leader Started for single use Academic community GitHub integration Added Jupyter Hub for collaboration Zeppelin Notebook Start for collaboration Enterprise Security Vendor Notebook Databricks for Apache Spark Jupyter-like, but proprietary format @lynnlangit
  20. 20. Running Notebooks Desktop Install and run Local Server Can use Jupyter Hub for groups Cloud Large number of options @lynnlangit
  21. 21. Extending, Refactoring Open Notebooks • Write functions in one notebook • Link to another notebook • Write extensions (nbextensions.com)
  22. 22. Up the bar Personalized medicine via genomic analysis @lynnlangit
  23. 23. Reproducible Research – Experiments as Code @lynnlangit
  24. 24. Bioinformatics | Denis C. Bauer | @allPowerde| GT-Scan2 How can genome engineering be made more effective? Variant Spark How to find disease genes in population-size cohorts? Genomic Research Tools Two Examples
  25. 25. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Machine learning… on 1.7 Trillion data points https://www.projectmine.com/about/
  26. 26. Bioinformatics | Denis C. Bauer | @allPowerde| VariantSpark - Parallelize Random Forest for scalability • Spark ML’s RF was designed for ‘Big’ low dimensional data. • The full genome-wide profile does NOT fit into the executors memory “Cursed” BigData: e.g. Genomics Moderate number of samples with many features Feature set too large to be handled by single executer
  27. 27. Bioinformatics | Denis C. Bauer | @allPowerde| Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK Flip the matrix: partition by column VariantSpark - Parallelize RF to scale with features
  28. 28. Bioinformatics | Denis C. Bauer | @allPowerde| Wide RF scalable with features and samples
  29. 29. # set up context and input parameters spark = SparkSession(sc) vc = VariantsContext(spark) label = vc.load_label('dius/data/chr22-labels.csv', 'col_name') features = vc.import_vcf('dius/data/chr22_1000.vcf') # instantiate analysis (parameters are type-checked) imp_analysis = features.importance_analysis(label) # get significant factors as both a tuple list and a dataframe imp_vars = imp_analysis.important_variables(20) most_imp_var = imp_vars[0][0] imp_df = imp_analysis.variable_importance() oob_error = imp_analysis.oob_error() # convert to work with common Python tools pandas_imp_df = imp_df.toPandas() New -- Python API for VariantSpark
  30. 30. Demo VariantSpark Jupyter for Genomics Research @lynnlangit
  31. 31. Cloud-based Jupyter PaaS • AWS SageMaker • Azure Notebooks • Others… @lynnlangit
  32. 32. Example - GT-Scan2 Jupyter for Genomics Research @lynnlangit
  33. 33. Tools for Jupyter • Binder for GitHub • Point to your GitHub Repo • Jupyter Notebooks • Requirements.txt • It builds a Docker image • You can run your Notebooks @lynnlangit
  34. 34. Example Binder @lynnlangit
  35. 35. Future of Jupyter for Research Academic Institutions and Research Labs UC Berkeley, Davis, San Diego Cal Poly San Luis Obispo Clemson University UC Boulder U of Illinois, Minnesota, Missouri, Rochester, Texas MIT Michigan State U Texas A & M @lynnlangit

×