SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Escaping Flatland: interactive
high-dimensional data analysis in
drug discovery using Spark
Josh Snyder, Victor Hong, Laurent Galafassi
Novartis Institutes for BioMedical Research (NIBR)
Overview
• Use case
– High-dimensional screening data
• Goals
– Production data pipelines for scientists
– Reusable analysis platform for informaticians
• High level architecture
– Spark and other components
• Outcome
– Achievements & impact
– Future work
Screening is scale-out for bench science
1
2
3
4
Data size depends on readout technology,
structure is standard
• Microscopy
• Cell morphometrics
• Image texture
• ...
• Sequencing
• Multiple gene expression
• Cytometry
• Multiple protein expression
5
6
Datasets can be large
1000 plates 1536 wells/plate 1k to 5k cells/well
50 to 2000 features/cell
1 to 10 billion observations
10 to 2000 features
10b to 20 trillion data points
10 GB to 20 TB
+ time points (x10 = 200TB)
+ ??
1 screen
Many features can be used to quantify activity
Active
Control
Neutral
Control
Nucleus/Cytoplasm Intensity
Cell Texture Variance (3 pixel)
…
n = 1000’s
We can only see what we look at
Cell Texture
Variance (3 pixel)
Nucleus/Cytoplasm
Intensity
Average Z’: 0.65Average Z’: 0.78
7
So we need to look at everything
Input
• All observations, all
features
QC
• Mask problem
observations
• Mask problem
features
• Calculate aggregate
measures for review
• Per feature
• Per observation
group
Normalization
• Pattern correction
and scoring for each
feature
• Eliminate
uninformative
features
Classification
• Use full feature
vectors to find cases
showing desired
activity/phenotype
Smells like Spark…
Data Pipeline
• Rows =
observations
• Columns =
features
Data Pipeline
• Column-wise
filtering and
aggregation
Data Pipeline
• Column-wise
correction and
scoring
• Column to column
correlation over
rows
Data Pipeline
• Row-wise
aggregation over
features to
compute distance
metrics
Spark is not a tool for bench scientists
Data Pipeline Data Pipeline Data Pipeline Data Pipeline
Visualization &
Control
Visualization &
Control
Visualization &
Control
Visualization &
Control
Algorithms
Workflow
High-dimensional data-driven architecture
• Pipelines for large data à
Spark
– Distribute computation
– Minimize IO for intermediate
results
– Declarative API
– Support for popular data analysis
languages
– Ecosystem: MLlib, Spark Job
Server, etc.
• Visualization & control à
WebGL
– Web UI flexibility
– Render millions of data points
• Query à Cassandra
– Spark Connector
– Distributed, fast, mature, key-value
/ column family store
Simple workflow
Rich, interactive visualizations
Methods implementations
• Classification
– Mahalanobis Distance
– Gaussian Naïve Bayes
• Coarse-grained utilities
– findNearLinearCombos
– findCorrelation
• Fine-grained utilities
– Streaming models for incrementally integrating data (pairwise
correlation, Greenwald-Khanna quantile estimations, et al.)
– Robust statistical measures (MAD, IQR, et al.)
– Data masking, missing values handlers (casewise, pairwise, imputation)
The big picture
• Achievements
– Multi-day batch jobs à multi-hour jobs
– Unified data format & workflow across readout technologies
– End user application for bench scientists
• Future work
– Elastic infrastructure
– Supervised learning of cell phenotypes
– Methods APIs for informaticians
– Contributions back to open source
The really big picture
Discovery of therapeutics
for patients in need
Informatics applications
Distributed complex
analytics
Spark
Acknowledgments
Nabil Hachem
Fred Harbinski
Ioannis Moutsatsos
Hanspeter Gubler
Sergey Kokorin
Leonid Volobuev
Marat Gazimullin
Evgeniya Condrashina
Alexey Girin
David Wilson
and the entire NIBR project team, stakeholders, & sponsors
Attributions
1. "1905 Otto Folin in biochemistry lab at McLean HospitalbyAHFolsom Harvard" by A H Folsom -
http://preserve.harvard.edu/photographs/McLean.html. Licensed under Public Domain via Commons -
https://commons.wikimedia.org/wiki/File:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png#/media/Fil
e:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png
2. "Petri dish at the Pacific Northwest NationalLaboratory" by Pacific Northwest NationalLaboratory, US Department of Energy -
http://picturethis.pnl.gov/picturet.nsf/by+id/DRAE-8DBTWP. Licensed under Public Domain via Commons -
https://commons.wikimedia.org/wiki/File:Petri_dish_at_the_Pacific_Northwest_National_Laboratory.jpg#/media/File:Petri_dish_at_the_Pacifi
c_Northwest_National_Laboratory.jpg
3. "ChemicalGenomics Robot" by Maggie Bartlett, National Human Genome Research Institute -
http://www.genome.gov/dmd/img.cfm?node=Photos/Technology/Research%20laboratory&id=79299. Licensed under Public Domain via
Commons - https://commons.wikimedia.org/wiki/File:Chemical_Genomics_Robot.jpg#/media/File:Chemical_Genomics_Robot.jpg
4. "385 multiwell plate 1" by real name: Nadina Wiórkiewiczpl.wiki: Nadine90commons: Nadine90 - Own work(dziękiwspółpracy ze szkołą
fotograficzną - Fotoedukacja /in cooperation with the schoolof photography - Fotoedukacja). Licensed under CC BY-SA 3.0 via Wikimedia
Commons - https://commons.wikimedia.org/wiki/File:385_multiwell_plate_1.jpg#/media/File:385_multiwell_plate_1.jpg
5. "Automated confocalimage reader" by Neil Emans IPK - self-made. Original image cropped in this usage. Licensed under CC BY-SA 3.0
via Wikipedia - https://en.wikipedia.org/wiki/File:Automated_confocal_image_reader.jpg#/media/File:Automated_confocal_image_reader.jpg
6. By Kierano - Own work. Original image cropped and resized in this usage. CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=25180061
7. "Flatland sphere". Licensed under Public Domain via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:Flatland_sphere.JPEG#/media/File:Flatland_sphere.JPEG
THANK YOU.
josh.snyder@novartis.com Presentation and project
victor.hong@novartis.com
laurent.galafassi@novartis.com
nabil.hachem@novartis.com NIBR Data Engineering

Mais conteúdo relacionado

Mais procurados

Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaSpark Summit
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Sri Ambati
 
Making Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT DataMaking Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT DataDatabricks
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics eventOpen Analytics
 
Growing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso BetanzosGrowing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso BetanzosBig Data Spain
 
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...confluent
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"Pavel Hardak
 
Apache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsApache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsMuralidhar Somisetty
 
Analytics graph databases
Analytics graph databasesAnalytics graph databases
Analytics graph databasesSteve Sarsfield
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summitOpen Analytics
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 

Mais procurados (20)

Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016Introduction to Sparkling Water - Spark Summit East 2016
Introduction to Sparkling Water - Spark Summit East 2016
 
Making Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT DataMaking Homes Efficient and Comfortable Using AI and IoT Data
Making Homes Efficient and Comfortable Using AI and IoT Data
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Optier presentation for open analytics event
Optier presentation for open analytics eventOptier presentation for open analytics event
Optier presentation for open analytics event
 
Growing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso BetanzosGrowing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso Betanzos
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
 
Qo comparision
Qo comparisionQo comparision
Qo comparision
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Apache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsApache Spark and future of advanced analytics
Apache Spark and future of advanced analytics
 
Analytics graph databases
Analytics graph databasesAnalytics graph databases
Analytics graph databases
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 

Destaque

ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...Spark Summit
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark OverviewairisData
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Spark Summit
 
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...Cagatay Turkay
 
High Dimensional Data Visualization
High Dimensional Data VisualizationHigh Dimensional Data Visualization
High Dimensional Data VisualizationFabian Keller
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsairisData
 
Predict Repeat Shoppers with H20 and Spark
Predict Repeat Shoppers with H20 and SparkPredict Repeat Shoppers with H20 and Spark
Predict Repeat Shoppers with H20 and SparkairisData
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming IntroductionairisData
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Mentoring I
Mentoring IMentoring I
Mentoring Ishendin
 
碳酸鈉81027
碳酸鈉81027碳酸鈉81027
碳酸鈉81027family
 
Xaneiro 2015
Xaneiro 2015Xaneiro 2015
Xaneiro 2015iesasorey
 
Xander santiago 7 b
Xander santiago 7 bXander santiago 7 b
Xander santiago 7 bXandermanuel
 
Hi, today i present you five famous
Hi, today i present you five famousHi, today i present you five famous
Hi, today i present you five famousAnamaria Pinto
 
Factors that influence path to purchase
Factors that influence path to purchaseFactors that influence path to purchase
Factors that influence path to purchaseSasmita Pati
 
HP LaserJet Pro P1606dn – CE278A Toner Replacement
HP LaserJet Pro P1606dn – CE278A Toner ReplacementHP LaserJet Pro P1606dn – CE278A Toner Replacement
HP LaserJet Pro P1606dn – CE278A Toner ReplacementAbdur Rahman Quadri
 

Destaque (20)

ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
ggplot2.SparkR: Rebooting ggplot2 for Scalable Big Data Visualization by Jong...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
 
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
Designing Progressive and Interactive Analytics Processes for High-Dimensiona...
 
High Dimensional Data Visualization
High Dimensional Data VisualizationHigh Dimensional Data Visualization
High Dimensional Data Visualization
 
Getting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analyticsGetting Spark ready for real-time, operational analytics
Getting Spark ready for real-time, operational analytics
 
Predict Repeat Shoppers with H20 and Spark
Predict Repeat Shoppers with H20 and SparkPredict Repeat Shoppers with H20 and Spark
Predict Repeat Shoppers with H20 and Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Scala Programming Introduction
Scala Programming IntroductionScala Programming Introduction
Scala Programming Introduction
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Mentoring I
Mentoring IMentoring I
Mentoring I
 
碳酸鈉81027
碳酸鈉81027碳酸鈉81027
碳酸鈉81027
 
Xaneiro 2015
Xaneiro 2015Xaneiro 2015
Xaneiro 2015
 
Xander santiago 7 b
Xander santiago 7 bXander santiago 7 b
Xander santiago 7 b
 
Hi, today i present you five famous
Hi, today i present you five famousHi, today i present you five famous
Hi, today i present you five famous
 
Presentación sobre Diabetes
Presentación sobre DiabetesPresentación sobre Diabetes
Presentación sobre Diabetes
 
Factors that influence path to purchase
Factors that influence path to purchaseFactors that influence path to purchase
Factors that influence path to purchase
 
Factura
FacturaFactura
Factura
 
HP LaserJet Pro P1606dn – CE278A Toner Replacement
HP LaserJet Pro P1606dn – CE278A Toner ReplacementHP LaserJet Pro P1606dn – CE278A Toner Replacement
HP LaserJet Pro P1606dn – CE278A Toner Replacement
 

Semelhante a Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi

Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsGigaScience, BGI Hong Kong
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...Maryann Martone
 
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...Neuroscience Information Framework
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?Maryann Martone
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_PresentationYatpang Cheung
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsKen Karapetyan
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeEdward Baker
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeVince Smith
 
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...Larry Smarr
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsOla Spjuth
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 

Semelhante a Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi (20)

Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
EcsiNeurosciences Information Framework (NIF): An example of community Cyberi...
 
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...Neurosciences Information Framework (NIF): An example of community Cyberinfra...
Neurosciences Information Framework (NIF): An example of community Cyberinfra...
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
HPC at NIBR
HPC at NIBRHPC at NIBR
HPC at NIBR
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
A National Big Data Cyberinfrastructure Supporting Computational Biomedical R...
 
Towards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery LabsTowards Automated AI-guided Drug Discovery Labs
Towards Automated AI-guided Drug Discovery Labs
 
Big Data
Big Data Big Data
Big Data
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 

Mais de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mais de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 

Último (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 

Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discovery Using Spark by Josh Snyder, Victor Hong and Laurent Galafassi

  • 1. Escaping Flatland: interactive high-dimensional data analysis in drug discovery using Spark Josh Snyder, Victor Hong, Laurent Galafassi Novartis Institutes for BioMedical Research (NIBR)
  • 2. Overview • Use case – High-dimensional screening data • Goals – Production data pipelines for scientists – Reusable analysis platform for informaticians • High level architecture – Spark and other components • Outcome – Achievements & impact – Future work
  • 3. Screening is scale-out for bench science 1 2 3 4
  • 4. Data size depends on readout technology, structure is standard • Microscopy • Cell morphometrics • Image texture • ... • Sequencing • Multiple gene expression • Cytometry • Multiple protein expression 5 6
  • 5. Datasets can be large 1000 plates 1536 wells/plate 1k to 5k cells/well 50 to 2000 features/cell 1 to 10 billion observations 10 to 2000 features 10b to 20 trillion data points 10 GB to 20 TB + time points (x10 = 200TB) + ?? 1 screen
  • 6. Many features can be used to quantify activity Active Control Neutral Control Nucleus/Cytoplasm Intensity Cell Texture Variance (3 pixel) … n = 1000’s
  • 7. We can only see what we look at Cell Texture Variance (3 pixel) Nucleus/Cytoplasm Intensity Average Z’: 0.65Average Z’: 0.78 7
  • 8. So we need to look at everything Input • All observations, all features QC • Mask problem observations • Mask problem features • Calculate aggregate measures for review • Per feature • Per observation group Normalization • Pattern correction and scoring for each feature • Eliminate uninformative features Classification • Use full feature vectors to find cases showing desired activity/phenotype
  • 9. Smells like Spark… Data Pipeline • Rows = observations • Columns = features Data Pipeline • Column-wise filtering and aggregation Data Pipeline • Column-wise correction and scoring • Column to column correlation over rows Data Pipeline • Row-wise aggregation over features to compute distance metrics
  • 10. Spark is not a tool for bench scientists Data Pipeline Data Pipeline Data Pipeline Data Pipeline Visualization & Control Visualization & Control Visualization & Control Visualization & Control Algorithms Workflow
  • 11. High-dimensional data-driven architecture • Pipelines for large data à Spark – Distribute computation – Minimize IO for intermediate results – Declarative API – Support for popular data analysis languages – Ecosystem: MLlib, Spark Job Server, etc. • Visualization & control à WebGL – Web UI flexibility – Render millions of data points • Query à Cassandra – Spark Connector – Distributed, fast, mature, key-value / column family store
  • 14. Methods implementations • Classification – Mahalanobis Distance – Gaussian Naïve Bayes • Coarse-grained utilities – findNearLinearCombos – findCorrelation • Fine-grained utilities – Streaming models for incrementally integrating data (pairwise correlation, Greenwald-Khanna quantile estimations, et al.) – Robust statistical measures (MAD, IQR, et al.) – Data masking, missing values handlers (casewise, pairwise, imputation)
  • 15. The big picture • Achievements – Multi-day batch jobs à multi-hour jobs – Unified data format & workflow across readout technologies – End user application for bench scientists • Future work – Elastic infrastructure – Supervised learning of cell phenotypes – Methods APIs for informaticians – Contributions back to open source
  • 16. The really big picture Discovery of therapeutics for patients in need Informatics applications Distributed complex analytics Spark
  • 17. Acknowledgments Nabil Hachem Fred Harbinski Ioannis Moutsatsos Hanspeter Gubler Sergey Kokorin Leonid Volobuev Marat Gazimullin Evgeniya Condrashina Alexey Girin David Wilson and the entire NIBR project team, stakeholders, & sponsors
  • 18. Attributions 1. "1905 Otto Folin in biochemistry lab at McLean HospitalbyAHFolsom Harvard" by A H Folsom - http://preserve.harvard.edu/photographs/McLean.html. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png#/media/Fil e:1905_Otto_Folin_in_biochemistry_lab_at_McLean_Hospital_byAHFolsom_Harvard.png 2. "Petri dish at the Pacific Northwest NationalLaboratory" by Pacific Northwest NationalLaboratory, US Department of Energy - http://picturethis.pnl.gov/picturet.nsf/by+id/DRAE-8DBTWP. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Petri_dish_at_the_Pacific_Northwest_National_Laboratory.jpg#/media/File:Petri_dish_at_the_Pacifi c_Northwest_National_Laboratory.jpg 3. "ChemicalGenomics Robot" by Maggie Bartlett, National Human Genome Research Institute - http://www.genome.gov/dmd/img.cfm?node=Photos/Technology/Research%20laboratory&id=79299. Licensed under Public Domain via Commons - https://commons.wikimedia.org/wiki/File:Chemical_Genomics_Robot.jpg#/media/File:Chemical_Genomics_Robot.jpg 4. "385 multiwell plate 1" by real name: Nadina Wiórkiewiczpl.wiki: Nadine90commons: Nadine90 - Own work(dziękiwspółpracy ze szkołą fotograficzną - Fotoedukacja /in cooperation with the schoolof photography - Fotoedukacja). Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:385_multiwell_plate_1.jpg#/media/File:385_multiwell_plate_1.jpg 5. "Automated confocalimage reader" by Neil Emans IPK - self-made. Original image cropped in this usage. Licensed under CC BY-SA 3.0 via Wikipedia - https://en.wikipedia.org/wiki/File:Automated_confocal_image_reader.jpg#/media/File:Automated_confocal_image_reader.jpg 6. By Kierano - Own work. Original image cropped and resized in this usage. CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25180061 7. "Flatland sphere". Licensed under Public Domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Flatland_sphere.JPEG#/media/File:Flatland_sphere.JPEG
  • 19. THANK YOU. josh.snyder@novartis.com Presentation and project victor.hong@novartis.com laurent.galafassi@novartis.com nabil.hachem@novartis.com NIBR Data Engineering