SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
Scaling Data Science
with dgit
Dr. Venkata Pingali
Founder, Scribble Data
pingali@scribbledata.io
https://github.com/pingali
Summary
1. Scaling impact of data science requires increasing trust and efficiency
a. Trust requires auditability and reproducibility of results
b. Efficiency requires standardization and automation
2. Dataset is a fundamental abstraction of data science
3. dgit enables git-like management of datasets
a. Python package, open source, MIT licence
b. Familiar git interface with modifications
4. Call to collaborate
dgit - 1 min summary
dgit - git wrapper for datasets
1. Python package, MIT license
2. Application of git
3. Beyond git - “Understands” data
a. Metadata generation and management
b. Automatic scanning of working directory for changes
c. Automatic validation and materialization
d. Dependency tracking across repos
e. Automatic audit trails with execution
f. Pipeline support
Growing Pains in Data
Science
Anonymized Random Slide from an Actual
Presentation
Implication: Large wasted spend, poor production
design, baseline worsening
Decision-maker Questions
1. Where did the numbers come from? (Correctness, Lineage)
a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Retargetability)
a. Model, dataset, and question revisions
3. Can you get the results faster? (Efficiency)
a. Time, effort, cost
4. Can you also analyze X? (Extensibility)
a. Different dataset, question
5. Could we try X? (Dataset generation - synthetic and real)
a. What if scenarios, field experiments
Conceptual Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
All three roles could
be in a single team!
Business Complexity is Discovered Over
Time
Incomplete context (history, semantics)
Qtns not thought through
Continuous revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Imperfect Data Queries due to Limited
Understanding
Dependencies not specified
Wrong filters
Known outliers
Narrow specification (cubes)
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Weak process
Lack of protocol (email/files)
Missing validation checks
No lineage
No revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Eagerness to Present Great Narratives
Wrong input dataset
Mistakes in pipeline
Excel/adhoc transformations
Model evolution
Continuous revision of narratives
Missing interpretation integrity
checks (e.g. other time windows)
Better methodology
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Process in Reality
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Iterative
Expensive
Laborious
Actual Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Iterative
Expensive
Laborious
http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/
"80% of ..
companies
strategic decision
go haywire..
“flawed” data
Desired State
1. Trusted
a. Every model should be auditable to the last record and step ⬅
b. Every model should be reproducible with zero human intervention ⬅
c. Enables use and development of mathematical judgment
2. Scalable
a. Highly automated through most of the lifecycle ⬅
b. Continuous reduction in costs ⬅
c. Grow sublinearly with questions, datasets, models
3. Robust
a. Younger, inexperienced staff ⬅
b. Weak processes
Process with Dataset Repository
Biz
Analytics
Team
Data
Engg
Server Side CI
Dataset Rules
Evaluation Rules
Dependencies
Materialized dataset
v1
v2
v3Materialize
Model Pipeline
Pipeline Execution
v4
Slide Content
URN
Context,
Questions
v5Evaluation
Interpretation
v6
Dataset as mutable object
with memory
No emails/google docs
Continuous validation by
thirdparty (server)
Separate model
development and
evaluation
dgit
Dgit Structure
dgitcore API
Repo Mgr
Git
Backend
S3
Validator Generator Instrumentation
MySQLS3Regression ContentPlatform
dgit CLI
Metadata
Basic
Demo Goals
1. Show end-to-end example (command line)
a. Simple regression
2. Explain structure
3. Advanced features
a. Validation (regression quality plugin)
b. Generator (SQL)
c. Pipeline (Dora)
Open Tasks
1. Dgit specific
a. Cleanup and stabilization
i. Python v2/3 compatibility
ii. Plugins to do various tasks (anonymization, hive etc)
b. Testing infrastructure
c. Integration
i. Windows and MacOS support
ii. Support for instabase/dat/other services
2. Ideas for new tools to reduce cost and complexity of data science
Speaker
Dr. Venkata Pingali
Founder, Scribble Data
Former-VP Analytics, FourthLion
IIT(B) PhD (USC)
http://linkedin.com/in/pingali

Mais conteĂșdo relacionado

Mais procurados

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...Databricks
 
The REAL face of Big Data
The REAL face of Big DataThe REAL face of Big Data
The REAL face of Big DataDouglas Bernardini
 
IC-SDV 2019: Search Technology / Vantage Point
IC-SDV 2019: Search Technology / Vantage PointIC-SDV 2019: Search Technology / Vantage Point
IC-SDV 2019: Search Technology / Vantage PointDr. Haxel Consult
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Edureka!
 
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...Yahoo Developer Network
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 

Mais procurados (8)

From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
From Vaccine Management to ICU Planning: How CRISP Unlocked the Power of Data...
 
The REAL face of Big Data
The REAL face of Big DataThe REAL face of Big Data
The REAL face of Big Data
 
IC-SDV 2019: Search Technology / Vantage Point
IC-SDV 2019: Search Technology / Vantage PointIC-SDV 2019: Search Technology / Vantage Point
IC-SDV 2019: Search Technology / Vantage Point
 
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
Who is a Data Scientist? | How to become a Data Scientist? | Data Science Cou...
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Data science
Data scienceData science
Data science
 
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds__Ha...
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 

Destaque

Engineering Performance-programKesehatanRemaja
Engineering Performance-programKesehatanRemajaEngineering Performance-programKesehatanRemaja
Engineering Performance-programKesehatanRemajaImraan Muslim
 
Bonds Use Of Google Solutions
Bonds Use Of Google SolutionsBonds Use Of Google Solutions
Bonds Use Of Google Solutionsbirney.james
 
LAMP-MIKROPROPOSAL-2
LAMP-MIKROPROPOSAL-2LAMP-MIKROPROPOSAL-2
LAMP-MIKROPROPOSAL-2Imraan Muslim
 
program-training-Pendidik sebaya-KesehatanRemaja
program-training-Pendidik sebaya-KesehatanRemajaprogram-training-Pendidik sebaya-KesehatanRemaja
program-training-Pendidik sebaya-KesehatanRemajaImraan Muslim
 
BAB I-pkk-pemberdayaan ekonomi
BAB I-pkk-pemberdayaan ekonomiBAB I-pkk-pemberdayaan ekonomi
BAB I-pkk-pemberdayaan ekonomiImraan Muslim
 
Multi Supplier Market, Comm's Days Summit, Sydney April 2014
Multi Supplier Market,  Comm's Days Summit, Sydney April 2014 Multi Supplier Market,  Comm's Days Summit, Sydney April 2014
Multi Supplier Market, Comm's Days Summit, Sydney April 2014 gtilton
 
Practical Application of the TMF Reference Model Webinar
Practical Application of the TMF Reference Model WebinarPractical Application of the TMF Reference Model Webinar
Practical Application of the TMF Reference Model WebinarParagon Solutions
 
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0Project findings paper TMForum catalyst 2014 B2B service bundling 1.0
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0gtilton
 
Dynamic Data Specification
Dynamic Data SpecificationDynamic Data Specification
Dynamic Data Specificationgtilton
 
Dynamic modelling best practice recommendation for the SID
Dynamic modelling best practice recommendation for the SIDDynamic modelling best practice recommendation for the SID
Dynamic modelling best practice recommendation for the SIDgtilton
 
Cv imraan muslim-03 eng-edit
Cv imraan muslim-03 eng-editCv imraan muslim-03 eng-edit
Cv imraan muslim-03 eng-editImraan Muslim
 
Analytics Lessons Learnt
Analytics Lessons Learnt Analytics Lessons Learnt
Analytics Lessons Learnt Venkata Pingali
 

Destaque (13)

CVimron-qren
CVimron-qrenCVimron-qren
CVimron-qren
 
Engineering Performance-programKesehatanRemaja
Engineering Performance-programKesehatanRemajaEngineering Performance-programKesehatanRemaja
Engineering Performance-programKesehatanRemaja
 
Bonds Use Of Google Solutions
Bonds Use Of Google SolutionsBonds Use Of Google Solutions
Bonds Use Of Google Solutions
 
LAMP-MIKROPROPOSAL-2
LAMP-MIKROPROPOSAL-2LAMP-MIKROPROPOSAL-2
LAMP-MIKROPROPOSAL-2
 
program-training-Pendidik sebaya-KesehatanRemaja
program-training-Pendidik sebaya-KesehatanRemajaprogram-training-Pendidik sebaya-KesehatanRemaja
program-training-Pendidik sebaya-KesehatanRemaja
 
BAB I-pkk-pemberdayaan ekonomi
BAB I-pkk-pemberdayaan ekonomiBAB I-pkk-pemberdayaan ekonomi
BAB I-pkk-pemberdayaan ekonomi
 
Multi Supplier Market, Comm's Days Summit, Sydney April 2014
Multi Supplier Market,  Comm's Days Summit, Sydney April 2014 Multi Supplier Market,  Comm's Days Summit, Sydney April 2014
Multi Supplier Market, Comm's Days Summit, Sydney April 2014
 
Practical Application of the TMF Reference Model Webinar
Practical Application of the TMF Reference Model WebinarPractical Application of the TMF Reference Model Webinar
Practical Application of the TMF Reference Model Webinar
 
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0Project findings paper TMForum catalyst 2014 B2B service bundling 1.0
Project findings paper TMForum catalyst 2014 B2B service bundling 1.0
 
Dynamic Data Specification
Dynamic Data SpecificationDynamic Data Specification
Dynamic Data Specification
 
Dynamic modelling best practice recommendation for the SID
Dynamic modelling best practice recommendation for the SIDDynamic modelling best practice recommendation for the SID
Dynamic modelling best practice recommendation for the SID
 
Cv imraan muslim-03 eng-edit
Cv imraan muslim-03 eng-editCv imraan muslim-03 eng-edit
Cv imraan muslim-03 eng-edit
 
Analytics Lessons Learnt
Analytics Lessons Learnt Analytics Lessons Learnt
Analytics Lessons Learnt
 

Semelhante a R meetup talk scaling data science with dgit

Using dataset versioning in data science
Using dataset versioning in data scienceUsing dataset versioning in data science
Using dataset versioning in data scienceVenkata Pingali
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)Piet J.H. Daas
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...MĂĄrton Kodok
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Christopher Gutknecht
 
BI on Big Data Presentation
BI on Big Data PresentationBI on Big Data Presentation
BI on Big Data PresentationArcadia Data
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
DevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdfDevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdfMichael Chi
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data editionMark Kerzner
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningKai WĂ€hner
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Dr. Shikha Mehta
 
Manage Data Like Code (sf analytics meetup) (1)
Manage Data Like Code (sf analytics meetup) (1)Manage Data Like Code (sf analytics meetup) (1)
Manage Data Like Code (sf analytics meetup) (1)Michael Sindelar
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
 

Semelhante a R meetup talk scaling data science with dgit (20)

Using dataset versioning in data science
Using dataset versioning in data scienceUsing dataset versioning in data science
Using dataset versioning in data science
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Knowledge Discovery in Production
Knowledge Discovery in ProductionKnowledge Discovery in Production
Knowledge Discovery in Production
 
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
 
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
Building a Marketing Data Warehouse from Scratch - SMX Advanced 202
 
BI on Big Data Presentation
BI on Big Data PresentationBI on Big Data Presentation
BI on Big Data Presentation
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
DevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdfDevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdf
 
Oil and gas big data edition
Oil and gas  big data editionOil and gas  big data edition
Oil and gas big data edition
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Manage Data Like Code (sf analytics meetup) (1)
Manage Data Like Code (sf analytics meetup) (1)Manage Data Like Code (sf analytics meetup) (1)
Manage Data Like Code (sf analytics meetup) (1)
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 

Último

Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...SUHANI PANDEY
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableSeo
 
朹çșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„
朹çșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„ćœšçșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„
朹çșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„ydyuyu
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...SUHANI PANDEY
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...roncy bisnoi
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...Escorts Call Girls
 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceEscorts Call Girls
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...SUHANI PANDEY
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445ruhi
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls DubaiEscorts Call Girls
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...tanu pandey
 

Último (20)

valsad Escorts Service ☎ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
朹çșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„
朹çșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„ćœšçșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„
朹çșżćˆ¶äœœçșŠć…‹ć€§ć­ŠæŻ•äžšèŻïŒˆyuæŻ•äžšèŻïŒ‰ćœšèŻ»èŻæ˜Žèź€èŻćŻæŸ„
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Low Sexy Call Girls In Mohali 9053900678 đŸ„”Have Save And Good Place đŸ„”
Low Sexy Call Girls In Mohali 9053900678 đŸ„”Have Save And Good Place đŸ„”Low Sexy Call Girls In Mohali 9053900678 đŸ„”Have Save And Good Place đŸ„”
Low Sexy Call Girls In Mohali 9053900678 đŸ„”Have Save And Good Place đŸ„”
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Al Barsha Night Partner +0567686026 Call Girls Dubai
Al Barsha Night Partner +0567686026 Call Girls  DubaiAl Barsha Night Partner +0567686026 Call Girls  Dubai
Al Barsha Night Partner +0567686026 Call Girls Dubai
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 

R meetup talk scaling data science with dgit

  • 1. Scaling Data Science with dgit Dr. Venkata Pingali Founder, Scribble Data pingali@scribbledata.io https://github.com/pingali
  • 2. Summary 1. Scaling impact of data science requires increasing trust and efficiency a. Trust requires auditability and reproducibility of results b. Efficiency requires standardization and automation 2. Dataset is a fundamental abstraction of data science 3. dgit enables git-like management of datasets a. Python package, open source, MIT licence b. Familiar git interface with modifications 4. Call to collaborate
  • 3. dgit - 1 min summary
  • 4. dgit - git wrapper for datasets 1. Python package, MIT license 2. Application of git 3. Beyond git - “Understands” data a. Metadata generation and management b. Automatic scanning of working directory for changes c. Automatic validation and materialization d. Dependency tracking across repos e. Automatic audit trails with execution f. Pipeline support
  • 5. Growing Pains in Data Science
  • 6. Anonymized Random Slide from an Actual Presentation Implication: Large wasted spend, poor production design, baseline worsening
  • 7. Decision-maker Questions 1. Where did the numbers come from? (Correctness, Lineage) a. Assumption, models, datasets 2. Is this an accident? Does it hold now? (Reproducibility, Retargetability) a. Model, dataset, and question revisions 3. Can you get the results faster? (Efficiency) a. Time, effort, cost 4. Can you also analyze X? (Extensibility) a. Different dataset, question 5. Could we try X? (Dataset generation - synthetic and real) a. What if scenarios, field experiments
  • 8. Conceptual Process Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling All three roles could be in a single team!
  • 9. Business Complexity is Discovered Over Time Incomplete context (history, semantics) Qtns not thought through Continuous revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 10. Imperfect Data Queries due to Limited Understanding Dependencies not specified Wrong filters Known outliers Narrow specification (cubes) Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 11. Weak process Lack of protocol (email/files) Missing validation checks No lineage No revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 12. Eagerness to Present Great Narratives Wrong input dataset Mistakes in pipeline Excel/adhoc transformations Model evolution Continuous revision of narratives Missing interpretation integrity checks (e.g. other time windows) Better methodology Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  • 13. Process in Reality Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling Iterative Expensive Laborious
  • 14. Actual Process Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling Iterative Expensive Laborious http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/ "80% of .. companies strategic decision go haywire.. “flawed” data
  • 15. Desired State 1. Trusted a. Every model should be auditable to the last record and step ⬅ b. Every model should be reproducible with zero human intervention ⬅ c. Enables use and development of mathematical judgment 2. Scalable a. Highly automated through most of the lifecycle ⬅ b. Continuous reduction in costs ⬅ c. Grow sublinearly with questions, datasets, models 3. Robust a. Younger, inexperienced staff ⬅ b. Weak processes
  • 16. Process with Dataset Repository Biz Analytics Team Data Engg Server Side CI Dataset Rules Evaluation Rules Dependencies Materialized dataset v1 v2 v3Materialize Model Pipeline Pipeline Execution v4 Slide Content URN Context, Questions v5Evaluation Interpretation v6 Dataset as mutable object with memory No emails/google docs Continuous validation by thirdparty (server) Separate model development and evaluation
  • 17. dgit
  • 18. Dgit Structure dgitcore API Repo Mgr Git Backend S3 Validator Generator Instrumentation MySQLS3Regression ContentPlatform dgit CLI Metadata Basic
  • 19. Demo Goals 1. Show end-to-end example (command line) a. Simple regression 2. Explain structure 3. Advanced features a. Validation (regression quality plugin) b. Generator (SQL) c. Pipeline (Dora)
  • 20. Open Tasks 1. Dgit specific a. Cleanup and stabilization i. Python v2/3 compatibility ii. Plugins to do various tasks (anonymization, hive etc) b. Testing infrastructure c. Integration i. Windows and MacOS support ii. Support for instabase/dat/other services 2. Ideas for new tools to reduce cost and complexity of data science
  • 21. Speaker Dr. Venkata Pingali Founder, Scribble Data Former-VP Analytics, FourthLion IIT(B) PhD (USC) http://linkedin.com/in/pingali