SlideShare uma empresa Scribd logo
1 de 52
2019-05-25
Oops!
I wrote my data
science in COBOL
Which language?
Technical EconomicContext
Program
Data Science
for us
The use of Advanced
Analytics to help with
complex business
decisions or problems
Advanced
Analytics
for us
Handle Large, Complex
Datasets
Statistical Algorithms
Visualizations, Heuristics
Artificial Intelligence
Machine Learning
Prediction Machine
What languages should we
choose from?
Our Question
so far
? ?
? ?
? ?
? ?
? ?
? ?
? ?
GitHub – Top Languages Javascript
Java
Python
PHP
C++
C#
TypeScript
Shell
C
Ruby
Source:Octoverse2018
Stack Overflow –
Top Languages
Source:StackoverflowInsights2019
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What languages should we choose
from?
Technical EconomicContext
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
•
Source:ATaxonomyofDataScience
Source:2018KaggleML&DSSurvey
O S E M N
Flat Tables
Related Tables
CSV, TSV, fixed-width
JSON, XML
HTML, CSS
O
S
E
M
N
Our Question
so far
What language is good for:
- Querying related tables?
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQLSQL
O
S
E
M
N
Flat Tables
Related Tables
CSV, TSV, fixed-width
JSON, XML
HTML, CSS
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML
and scraping websites?
Our Question
so far
COBOL
SQLSQL
What we have What we need
O
S
E
M
N
O
S
E
M
N
Inconsistencies
Alberta = AB
Customer Province
Wonka Industries Alberta
Stark Industries AB
Wayne Enterprises BC
O
S
E
M
N
Categorical Data
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions
Customer Death By Chocolate Death by Plasma
Burns
Death by Multiple
Contusions
Wonka Industries 1 0 0
Stark Industries 0 1 0
Wayne Enterprises 0 0 1
O
S
E
M
N
Flatten (Denormalize)
Customer Province
Wonka Industries Alberta
Stark Industries AB
Wayne Enterprises BC
Customer Item Price Date
Wonka Industries Toffee 5.00 2018-12-31
Stark Industries Iron 15.00 2018-03-30
Wayne Enterprises
Vitamin
D
25.00 2018-07-31
Wonka Industries Toffee 5.00 2019-01-04
Stark Industries Iron 15.00 2018-04-15
Wayne Enterprises
Vitamin
D
25.00 2018-08-01
Customer Death By
Wonka Industries Chocolate
Stark Industries Plasma Burns
Wayne Enterprises Multiple Contusions
O
S
E
M
N
Flatten (Denormalize)
Customer Item Price Date Province Death By
Wonka Industries Toffee 5.00 2018-12-31 Alberta Chocolate
Stark Industries Iron 15.00 2018-03-30 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-07-31 BC Multiple Contusions
Wonka Industries Toffee 5.00 2019-01-04 Alberta Chocolate
Stark Industries Iron 15.00 2018-04-15 AB Plasma Burns
Wayne Enterprises
Vitamin
D
25.00 2018-08-01 BC Multiple Contusions
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables /
DataFrames?
Our Question
so far
C
COBOL
SQLSQL
Define or
observe
process
Codify process
Repeatable
Outcome
Verify
Result
O
S
E
M
N
Programming an Application
Define or
observe
problem
Experiment
Observe
Result
Exp.
Exp.
Exp.
Strong library of math algorithms and visualizations
O
S
E
M
N
Programming in Data Science
Interactive (REPL) languages
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for math analysis and
in-flight visualizations?
Our Question
so far
Source:Wikipedia–InteractiveLanguages
C
C# COBOL
GO
Kotlin
Rust
C++
Java
SQL
O
S
E
M
N
O
S
E
M
N
y = ax2 + bx + c
y = 10x2 + 5x + 12
O
S
E
M
N
O
S
E
M
N
Source:PeekabooVision
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Libraries for machine learning?
- Distributed modeling on Spark?
Python R
Scala
C
C# COBOL
GO Java
Kotlin
Rust
C++
SQLSQL
Ruby
Technical EconomicContext
Metcalfe’s
Law
The value of a network grows as
the square of the number of its
users
Network Effects
Metcalfe’s
Law
Network Effects
Users /
Nodes
Value
Network
Network Effects
Source:StackoverflowInsights2019
Network Effects
Source:StackoverflowInsights2019
Network Effects
Source:StackoverflowTagsandGithubStars
Network Effects
Source:StackoverflowTagsandGithubStars
7 years old
15 years old
25 years old
29 years old
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
so far
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Reducing the amount of time spent
debugging and writing code that
already exists?
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Python R
Scala
Quantity
Price
Supply
Demand
Discount
Premium
Supply and Demand
Supply and Demand
Source:Supply&DemandbyVilmosMüller
Supply Demand Fulfillment
Python 86% 34% 2.5 x
R 38% 8% 4.8 x
Scala <10% 12% 0.8 x
Source:StackoverflowInsights2019
Supply and Demand
Cost Premium
Python $63k baseline
R $64k 0.01%
Scala $78k 24%
C C++
C# COBOL
GO Java
Julia Kotlin
Python R
Ruby Rust
Scala SQL
Our Question
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Boosting productivity and efficiency?
- Reducing the supply premium?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
ScalaScala
Time
Knowledge
Learning Curve
Fast Learning Curve
Typical Learning Curve
Practitioner
Novice
Expert
Time Savings = Cost Savings
Learning Curve
Source:CodingDojo
Learning Curve
Source:wpengine
Our Question
What language is good for:
- Querying related tables?
- Handling Excel, CSV, JSON, XML and scraping websites?
- Manipulating DataTables / DataFrames?
- REPL Interactivity?
- Libraries for statistical analysis and in-flight visualizations?
- Distributed modeling on Spark?
- Boosting productivity and efficiency?
- Reducing the supply premium?
- Reducing training costs?
Python R
C
C# COBOL
GO Java
Julia Kotlin
Ruby Rust
SQL
C++
Scala
R
Python
Solution
Source:2018KaggleML&DSSurvey
O S E M N
Python | SQL | Algorithms
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
Obtain
S
Scrub
E
Explore
M
Model
N
iNterpret
O
S
E
M
N
O
S
E
M
N
Python | SQL | Algorithms
Storytelling

Mais conteúdo relacionado

Semelhante a Oops! I Wrote my Data Science in COBOL

How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stack
Fliptop
 
Mohit Kalra 25th August
Mohit Kalra 25th AugustMohit Kalra 25th August
Mohit Kalra 25th August
mdk8989
 

Semelhante a Oops! I Wrote my Data Science in COBOL (20)

ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to Go
 
Programming Languages: Trends for 2021
Programming Languages: Trends for 2021Programming Languages: Trends for 2021
Programming Languages: Trends for 2021
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
 
Resume analyst
Resume analystResume analyst
Resume analyst
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale final
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
 
Christian Mladenov @ Intuitics
Christian Mladenov @ IntuiticsChristian Mladenov @ Intuitics
Christian Mladenov @ Intuitics
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
 
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...Large Language Models, Data & APIs - Integrating Generative AI Power into you...
Large Language Models, Data & APIs - Integrating Generative AI Power into you...
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
 
How to integrate python into a scala stack
How to integrate python into a scala stackHow to integrate python into a scala stack
How to integrate python into a scala stack
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
A case for teaching SQL to scientists
A case for teaching SQL to scientistsA case for teaching SQL to scientists
A case for teaching SQL to scientists
 
Introduction To R
Introduction To RIntroduction To R
Introduction To R
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Mohit Kalra 25th August
Mohit Kalra 25th AugustMohit Kalra 25th August
Mohit Kalra 25th August
 

Último

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 

Último (20)

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

Oops! I Wrote my Data Science in COBOL

Notas do Editor

  1. An individual can
  2. Led to the question, what is the right language for data science?
  3. We’ll start by exploring some technical aspects, then some economic ones … but first, some context
  4. Building a program, and building a team around the program. This is not about you as an individual. Will need to consider not just the current business problem, but the many business problems you will face.
  5. Needs refinement
  6. Underlying these are statistical analyses, algorithms, models, visualizations … anything that results in a prediction machine
  7. Based on Github and SO, here’s our list. Now I’ll ask your indulgence here, I’ve added Julia and Scala because they are highly relevant in the context of data science. And we can’t forget COBOL, which we are already starting to see was a mistake in my fictitious data science project. Let’s continue to build our question
  8. Remove anything dedicated to web-programming. Remove anything dedicated to shell scripting.
  9. Let’s get rid of: What we already cleared out Exclusive for web or mobile app Take everything above VBA because I hate VBA and I never want to talk about VBA again and I’ve already said VBA too many times in this sentence.
  10. Based on Github and SO, here’s our list. Now I’ll ask your indulgence here, I’ve added Julia and Scala because they are highly relevant in the context of data science. And we can’t forget COBOL, which we are already starting to see was a mistake in my fictitious data science project. Let’s continue to build our question
  11. …with a focus on some technical elements. We’ll use the OSEMN model from earlier to walk through the technical gauntlet.
  12. OSEMN model “Awesome” 2010 by Hilary Mason and Chris Wiggins Simplified, but it does a good job of capturing the essence of datasci http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ https://towardsdatascience.com/5-steps-of-a-data-science-project-lifecycle-26c50372b492 https://medium.com/@randylaosat/life-of-data-data-science-is-osemn-f453e1febc10
  13. OBTAIN Although there are many datasets obtained from APIs and from scraping websites, the vast majority still comes from databases that house the data in a structured form. These might be application databases, ODS, data warehouses, semantic layers ... Regardless, they’re treated as structured databases Databases contain almost all of our contextual (reference) data, and almost all of our industry-secret data Websites contain a wealth of data when trying to extrapolate information that the world, in general, has to offer
  14. SQL was designed to query tables! In fact, most languages have abstraction libraries that allow you to write SQL or almost-SQL … and most of those are translated into SQL when executed against databases. It is the de facto standard for extracting data from databases, and this point must not be understated.
  15. All of them can handle Excel / CSV / JSON / XML, even COBOL! Not a helpful question to ask. Let’s ignore it and carry on.
  16. Reduce Clean Transform Categorize / Label Observe / Take notes … what might be a good feature? What is unnecessary noise? What might be an outcome?
  17. 1-hot encoding / binarize If we simply provide a numerical category, then the average of “Chocolate” and “Multiple Contusions” = “Plasma Burns”
  18. http://elitedatascience.com/data-cleaning Reduce to what you need Remove outliers Handle missing data
  19. Highlight SQL for O and S… it’s so valuable that in the very early days of big data, SQL interpreters were quintessential to adoption. This is a show-stopper. If there aren’t native objects or generally accepted libraries that help a language manage data native as a table, then there’s nowhere to go. C is really close to bare metal (i.e. low-level language), making it non-ideal. It’s possible, not pragmatic.
  20. Very All about workflows
  21. Could write our own libraries, but this is an immensely costly effort… and our objective is make this a cost-effective team / program. Read – Evaluate – Print – Loop https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Interactive_mode_languages
  22. What is modeling? A trained model = a populated algorithm
  23. Knowing the algorithm and the purpose; Applying the right one to the problem at hand; Training a model Testing the model’s validity Tuning the parameters Training and testing need large datasets Some algorithms are complex and need mucho data These require a distributed environment Distributed data frames The person filling this role needs the skill of knowing which algorithm are available, and when to apply the appropriate ones
  24. Julia could, but it’s still really, really new. We are now down to the most relevant languages for data science. For anyone who’s familiar with the field, this is where the question gets difficult. Good time to call out productionalization of a trained model… can rewrite it in a low-level language for efficiency, or can scale the solution in the cloud. I’ve intentionally ignored that
  25. Let’s see if we can use some economic principles to help us expand our question. Network Effects Supply / Demand Learning Curve
  26. Network effect (value of X is amplified by Y connected nodes) Number of developers that know the language (SO survey + google searches) Number of google pages / SO answers / github libraries in the language Number of libraries Number of developers
  27. Network effect (value of X is amplified by Y connected nodes) Number of developers that know the language Number of SO answers Number of github libraries in the language
  28. Observations: Python’s network with frameworks Python’s response size (network of people using it) Relationship to data-science frameworks Pandas, PyTorch, Tensorflow All interlinked with Jupyter acting as a node
  29. Observations: Python’s network with frameworks Python’s response size (network of people using it) Relationship to data-science frameworks Pandas, PyTorch, Tensorflow All interlinked with Jupyter acting as a node
  30. Y: Github Repos (including data-sci / machine-learning libraries, categorized by target language) X: StackOverflow questions (including libraries, categorized by target language) Bubblesize: Language popularity Why no SQL?
  31. I don’t have to re-invent what I can re-use Someone else is bound to have hit the problem I’m facing
  32. Data based on Kaggle datasets including the 2018 Kaggle survey and a job-demand dataset. We can’t fulfill Scala … in economic theory, if supply is below demand, we have to pay a premium to get it. This is really important, let’s validate this.
  33. Here’s SO’s pay-by-technology breakdown. Let’s zoom in on the relevant entries Note: Doesn’t account for cross-training.
  34. Also, timely staffing when turnover occurs, and reduced poaching
  35. On the fast learning curve, we get to being a practitioner much faster. Even if reaching expert takes around the same time, the developer can be useful much sooner. The faster something can be learned: The lesser the up front cost; The lower the barrier to entry; The greater the adoption Leading to an amplified network effect Virtuous cycle!
  36. The faster something can be learned: The lesser the up front cost; The lower the barrier to entry; The greater the adoption Leading to an amplified network effect Virtuous cycle! https://www.codingdojo.com/blog/python-perfect-beginners
  37. The faster something can be learned: The lesser the up front cost; The lower the barrier to entry; The greater the adoption Leading to an amplified network effect Virtuous cycle! https://www.codingdojo.com/blog/python-perfect-beginners
  38. Don’t need a homogeneous team! Remember! Not mutually exclusive! Depending on the size of the team and the problem at hand … team makeup can vary significantly Technical conclusion: Roles <-> Languages and Knowledges Which language has the combination of features and most pliable across the data science process
  39. Our team, as a team, needs to know The syntax, patterns, principles and utilization of Python To understand which algorithms are appropriate to the problem
  40. Remember the OSEMN model? We didn’t talk about the last step – Interpreting! This is what makes it real for stakeholders. If you can’t explain what you did, why you did it, and what the results imply … then it was all for nought.
  41. TRUST Stakeholders and users of our model want to trust it. If they don’t understand it, they don’t trust it. What data did we obtain? What did we do to scrub it? Why did we choose these algorithms, and this training data? What biases could remain? Under what conditions does this start to break down?
  42. The answer is actually this Story telling is clear communication in a natural language (English). I hope you have enjoyed my storytelling today. Thank you.
  43. Interesting chart: Stackoverflow ^ | | |______________> Github
  44. https://www.tiobe.com/tiobe-index/
  45. https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/
  46. https://insights.stackoverflow.com/survey/2019#technology