Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Â
R meetup talk scaling data science with dgit
1. Scaling Data Science
with dgit
Dr. Venkata Pingali
Founder, Scribble Data
pingali@scribbledata.io
https://github.com/pingali
2. Summary
1. Scaling impact of data science requires increasing trust and efficiency
a. Trust requires auditability and reproducibility of results
b. Efficiency requires standardization and automation
2. Dataset is a fundamental abstraction of data science
3. dgit enables git-like management of datasets
a. Python package, open source, MIT licence
b. Familiar git interface with modifications
4. Call to collaborate
4. dgit - git wrapper for datasets
1. Python package, MIT license
2. Application of git
3. Beyond git - âUnderstandsâ data
a. Metadata generation and management
b. Automatic scanning of working directory for changes
c. Automatic validation and materialization
d. Dependency tracking across repos
e. Automatic audit trails with execution
f. Pipeline support
6. Anonymized Random Slide from an Actual
Presentation
Implication: Large wasted spend, poor production
design, baseline worsening
7. Decision-maker Questions
1. Where did the numbers come from? (Correctness, Lineage)
a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Retargetability)
a. Model, dataset, and question revisions
3. Can you get the results faster? (Efficiency)
a. Time, effort, cost
4. Can you also analyze X? (Extensibility)
a. Different dataset, question
5. Could we try X? (Dataset generation - synthetic and real)
a. What if scenarios, field experiments
9. Business Complexity is Discovered Over
Time
Incomplete context (history, semantics)
Qtns not thought through
Continuous revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
10. Imperfect Data Queries due to Limited
Understanding
Dependencies not specified
Wrong filters
Known outliers
Narrow specification (cubes)
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
11. Weak process
Lack of protocol (email/files)
Missing validation checks
No lineage
No revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
12. Eagerness to Present Great Narratives
Wrong input dataset
Mistakes in pipeline
Excel/adhoc transformations
Model evolution
Continuous revision of narratives
Missing interpretation integrity
checks (e.g. other time windows)
Better methodology
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
14. Actual Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Iterative
Expensive
Laborious
http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/
"80% of ..
companies
strategic decision
go haywire..
âflawedâ data
15. Desired State
1. Trusted
a. Every model should be auditable to the last record and step âŹ
b. Every model should be reproducible with zero human intervention âŹ
c. Enables use and development of mathematical judgment
2. Scalable
a. Highly automated through most of the lifecycle âŹ
b. Continuous reduction in costs âŹ
c. Grow sublinearly with questions, datasets, models
3. Robust
a. Younger, inexperienced staff âŹ
b. Weak processes
16. Process with Dataset Repository
Biz
Analytics
Team
Data
Engg
Server Side CI
Dataset Rules
Evaluation Rules
Dependencies
Materialized dataset
v1
v2
v3Materialize
Model Pipeline
Pipeline Execution
v4
Slide Content
URN
Context,
Questions
v5Evaluation
Interpretation
v6
Dataset as mutable object
with memory
No emails/google docs
Continuous validation by
thirdparty (server)
Separate model
development and
evaluation
19. Demo Goals
1. Show end-to-end example (command line)
a. Simple regression
2. Explain structure
3. Advanced features
a. Validation (regression quality plugin)
b. Generator (SQL)
c. Pipeline (Dora)
20. Open Tasks
1. Dgit specific
a. Cleanup and stabilization
i. Python v2/3 compatibility
ii. Plugins to do various tasks (anonymization, hive etc)
b. Testing infrastructure
c. Integration
i. Windows and MacOS support
ii. Support for instabase/dat/other services
2. Ideas for new tools to reduce cost and complexity of data science