Data science 101 Masterclass

•Transferir como PPTX, PDF•

1 gostou•89 visualizações

In this presentation, Microsoft data scientists Ben Keen and Shahzia Holtom cover an introduction to data science with respect to: - What is a data scientist? - What data does a data scientist need? - AI ethics and responsibility - What is MLOps and how does it drive value?

Carreiras

AC&AI EMEA Masterclass
Data Science 101
Wednesday 2nd December 2020
Ben Keen
Shahzia Holtom

AGENDA What is a Data Scientist?
Data
AI Ethics & Responsibility
MLOps

Scientist
Strong understanding of scientific method &
hypothesis testing
Asks clarifying questions and remains
sceptical and objective
Strong critical thinking, root cause analysis,
and research skills
Bases decisions on data and statistical
analysis

What makes a Data Scientist?
Scientist
Engineer

Statistics
Machine Learning Data Storage
Visualisation
Optimisation
Data ProcessingData Manipulation
Programming
Data Lakes
Azure Storage
SQL Server
MySQL
PostgreSQL
Oracle DB
Azure Data Warehouse
HDFS MongoDB
Neo4jAzure Cosmos DB
Cassandra
Word2Vec
SQLite
Spark/Databricks
Azure Data Factory
Airflow
Kubernetes
Azure Event Hub
Azure Service Bus Kafka
Hadoop
Logstash/Elasticsearch
NiFi
Docker
Swarm
Python
statsmodels
scipy
Scikit-learn
PyTorch
spark.ml
SAS
R
ggplot2
TensorflowKeras
Scala
Perl
MATLAB
Node.js
M
VBA
JavaScript
Julia
Jupyter
Weka
Azure Machine Learning
MLFlow
SPSS
Bayesian Statistics
ONNX
XGBoost
Continuous Distributions
PMCC/Spearman’s Rank
Monte Carlo Methods
χ2
Probability Theory
Skewness/Curtosis
Hypothesis Testing
Covariance
matplotlib
Power BID3.js
Highcharts
plotly
sankeymatic
Tableau
seaborn
Bokeh
React-vis
Dash
CanvasJS
Chart.js
Excel
ISOMAP
PIL
ScraPy / BS4
LibROSA
Flink
lifetimes
Bonsai
dplyr
NumPy
pandas
Powershell
Bash
NLTK
spaCy
OpenCV
Gensim
Azure Cognitive Services
pytz
Dijkstra
Gradient Descent
Ant Colony Optimisation
Particle Swarm Optimisation
Evolutionary Algorithms
Mixed-integer linear programming
Differential Calculus
Simulated Annealing
Least Squares
DAX

Tools for the job
Artificial
Intelligence
Machine
Learning
Deep
Learning
Artificial Intelligence
The ability for machines to mimic human behaviours.
See “Computing Machinery and Intelligence”, Turing, 1950.
Machine Learning
The application of mathematical and statistical techniques
that learn parameters from data rather than being
explicitly programmed.
Deep Learning
Subset of machine learning in which neural networks with
many layers are used to learn highly non-linear
relationships from large amounts of training data.

What makes a Data Scientist?
Scientist
EngineerBusiness Analyst

A Simple Example
Business Context: Machinery failure costs
£500,000 but maintenance costs £1,000
Total Cost: £1,004,000
38 2
4 556

Yes
No
Yes
No
A Simple Example
Yes
No
New Total Cost: £60,000
New Accuracy = 90%
40 0
60 490

Types of Data Scientist
ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert
• Operationalisation
of models
• Focus on MLOps,
Automated Tests,
CI/CD, ETL
• Focus on A/B
Testing, Modelling
and
Experimentation
• View to
contributing to a
product
• Uses Tried &
Tested Techniques
• Experimentation
with view to
expand
community
knowledge and
understanding of
algorithms
• Uses novel
techniques
• Generalist
• Works across
modelling, ETL,
operationalisation
and app
development
• May be less
focused on depth
of modelling
understanding
• Focus on
storytelling with
data
• Wizard with
graphing libraries,
including D3.js

Data
How much data do you need?
How do we know this is a cat?
We have 140 million neurons in V1
And we have V2, V3, V4, V5 and V6

Data
How much data do you need?
88 239 33 178 38 122
208 115 215 36 119 203
229 65 52 64 4 23
92 114 26 29 155 183
101 142 222 54 187 109
45 6 95 67 35 212
93 103 142 57 207 117
174 228 201 24 101 176
100 9 141 241 144 37
8 34 198 125 138 246
178 126 255 108 161 128
How do you get a computer to recognise this
as a cat

Data
How much data do you need?
Garbage in…
…Garbage out

Data
How much data do you need?
HorsesGoats
?

Value realization is only possible
through Continuous Delivery
MLOps
Data Science solutions need to be
integrated with People, Process
and Products
Pilot
PoC
Experiment
PoV
MVP

I have a model
for you…
How do I
deploy, manage,
monitor…Wall
Of
Confusion
Data Science Ops
Data Drift
Model Decay
Stale Models
Concept Drift
Traditional DS Delivery

DevOps is the union of people,
process, and products to enable
continuous delivery of value.
“
”
Build
&
Test
Continuous
Delivery
Deploy
Operate
Monitor
&
Learn
Plan
&
Track
Develop
People
• Collaborate early and often
• Cross-disciplinary teams
• Share common goals and metrics
• Shared responsibility Process
• Agile Principles
• Streamline feedback
• Delivering value faster Products
What is DevOps?

The ability to continuously integrate, automatically
test, build, deploy and monitor Machine Learning
artifacts such as Data & Training pipelines and
models.
MLOps

Data Science is a Team Effort
Architects
Change Management
Data Engineers
Data Scientists
Project Management
App Developers
UX Designers

Conclusions
 Data Scientists are Scientists, Engineers and Business Analysts
 We work best with data of high volume, veracity and variety
 We need to keep in mind ethical considerations and act responsibly
when designing systems
 MLOps is paramount for delivering customer value
 Data Science is a team effort
 Data Science is about turning data into impact

Mais conteúdo relacionado

Mais procurados

Analytics in a Day Virtual WorkshopCCG

Afternoons with Azure - Power BI and Azure Analysis ServicesCCG

Data Ops at TripActionsRob Winters

Creating an Enterprise AI StrategyAtScale

Advanced Analytics for Investment Firms and Machine LearningCloudera, Inc.

Afternoons with Azure - Azure Machine Learning CCG

Simplifying AI and Machine Learning with Watson StudioDataWorks Summit

Domino and AWS: collaborative analytics and model governance at financial ser...Domino Data Lab

Analytics in a Day Ft. Synapse Virtual WorkshopCCG

Belladati Meetup Singapore Workshopbelladati

Data Science in EnterpriseJosh Yeh

Overview Microsoft's ML & AI toolsDavid Voyles

Webinar: Question Answering and Virtual Assistants with Deep LearningLucidworks

Data estate modernization feb webinar 2 18 2020Matthew W. Bowers

Azure databricks by usama whaba khanUsama Wahab Khan Cloud, Data and AI

Citizen Data Science Training using KNIMEAli Raza Anjum

Software Analytics for Pragmatists [DevOps Camp 2017]Markus Harrer

Global Data Science Platform : Platform for AI DemocratizationRakuten Group, Inc.

Advanced Analytics and Data Science ExpertiseSoftServe

Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta

Mais procurados (20)

Analytics in a Day Virtual Workshop

Afternoons with Azure - Power BI and Azure Analysis Services

Data Ops at TripActions

Creating an Enterprise AI Strategy

Advanced Analytics for Investment Firms and Machine Learning

Afternoons with Azure - Azure Machine Learning

Simplifying AI and Machine Learning with Watson Studio

Domino and AWS: collaborative analytics and model governance at financial ser...

Analytics in a Day Ft. Synapse Virtual Workshop

Belladati Meetup Singapore Workshop

Data Science in Enterprise

Overview Microsoft's ML & AI tools

Webinar: Question Answering and Virtual Assistants with Deep Learning

Data estate modernization feb webinar 2 18 2020

Azure databricks by usama whaba khan

Citizen Data Science Training using KNIME

Software Analytics for Pragmatists [DevOps Camp 2017]

Global Data Science Platform : Platform for AI Democratization

Advanced Analytics and Data Science Expertise

Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017

Semelhante a Data science 101 Masterclass

Lean Analytics: How to get more out of your data science teamDigital Transformation EXPO Event Series

Big Data for Data Scientists - Info SessionWeCloudData

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG

How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans

Data science presentationMSDEVMTL

Tips for Effective Data Science in the EnterpriseLisa Cohen

Self-Service Analytics Framework - Connected Brains 2018LoQutus

Tips and Tricks to be an Effective Data ScientistLisa Cohen

Brochure data science learning path board-infinity (1)NirupamNishant2

Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at DruvaAnupran Trivedi

Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon

What Managers Need to Know about Data ScienceAnnie Flippo

Data analytics on AzureElena Lopez

Gse uk-cedrinemadera-2018-sharedcedrinemadera

Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceAlex Danvy

#Datacaeer - AI Guild workshop on data roles in industry with Adam GreenAI Guild

MLOps - The Assembly Line of MLJordan Birdsell

Azure Machine LearningMostafa

Artificial Intelligence As a ServiceJohn Liu

Building enterprise advance analytics platformHaoran Du

Semelhante a Data science 101 Masterclass (20)

Lean Analytics: How to get more out of your data science team

Big Data for Data Scientists - Info Session

Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02

How to build your own Delve: combining machine learning, big data and SharePoint

Data science presentation

Tips for Effective Data Science in the Enterprise

Self-Service Analytics Framework - Connected Brains 2018

Tips and Tricks to be an Effective Data Scientist

Brochure data science learning path board-infinity (1)

Hiring for data roles - Adwait Bhave (ML Engineer and Data Scientist at Druva

Turn Data Into Actionable Insights - StampedeCon 2016

What Managers Need to Know about Data Science

Data analytics on Azure

Gse uk-cedrinemadera-2018-shared

Tour de France Azure PaaS 6/7 Ajouter de l'intelligence

#Datacaeer - AI Guild workshop on data roles in industry with Adam Green

MLOps - The Assembly Line of ML

Azure Machine Learning

Artificial Intelligence As a Service

Building enterprise advance analytics platform

Último

定制(SCU毕业证书)南十字星大学毕业证成绩单原版一比一z xss

格里菲斯大学毕业证（Griffith毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Ioannis Tzachristas Self-Presentation for MBA.pdfjtzach

Ethics of Animal Research Laika mission.pptShafqatShakeel1

Most Inspirational Leaders Empowering the Educational Sector, 2024.pdfTheKnowledgeReview2

Young Call~Girl in Pragati Maidan New Delhi 8448380779 Full Enjoy Escort ServiceStunning ➥8448380779▻ Call Girls In Hauz Khas Delhi NCR

办理哈珀亚当斯大学学院毕业证书文凭学位证书saphesg8

Jumark Morit Diezmo- Career portfolio- BPED 3Ajumarkdiezmo1

Digital Marketing Training Institute in Mohali, IndiaDigital Discovery Institute

Black and White Minimalist Co Letter.pdfpadillaangelina0023

Crack JAG. Guidance program for entry to JAG Dept. & SSB interviewNilendra Kumar

原版定制copy澳洲查尔斯达尔文大学毕业证CDU毕业证成绩单留信学历认证保障质量sehgh15heh

美国SU学位证,雪城大学毕业证书1:1制作ss846v0c

定制(UOIT学位证)加拿大安大略理工大学毕业证成绩单原版一比一Fs sss

do's and don'ts in Telephone Interview of JobRemote DBA Services

原版快速办理MQU毕业证麦考瑞大学毕业证成绩单留信学历认证nhjeo1gg

定制（ECU毕业证书）埃迪斯科文大学毕业证毕业证成绩单原版一比一fjjwgk

LinkedIn Strategic Guidelines April 2024Bruce Bennett

办理老道明大学毕业证成绩单|购买美国ODU文凭证书saphesg8

定制(NYIT毕业证书)美国纽约理工学院毕业证成绩单原版一比一2s3dgmej

Data science 101 Masterclass

1. AC&AI EMEA Masterclass Data Science 101 Wednesday 2nd December 2020 Ben Keen Shahzia Holtom

2. Introductions aka.ms/benkeen

3. AGENDA What is a Data Scientist? Data AI Ethics & Responsibility MLOps

4. What is a Data Scientist? Data Impact

5. What makes a Data Scientist? Scientist

6. Scientist Strong understanding of scientific method & hypothesis testing Asks clarifying questions and remains sceptical and objective Strong critical thinking, root cause analysis, and research skills Bases decisions on data and statistical analysis

7. What makes a Data Scientist? Scientist Engineer

8. Statistics Machine Learning Data Storage Visualisation Optimisation Data ProcessingData Manipulation Programming Data Lakes Azure Storage SQL Server MySQL PostgreSQL Oracle DB Azure Data Warehouse HDFS MongoDB Neo4jAzure Cosmos DB Cassandra Word2Vec SQLite Spark/Databricks Azure Data Factory Airflow Kubernetes Azure Event Hub Azure Service Bus Kafka Hadoop Logstash/Elasticsearch NiFi Docker Swarm Python statsmodels scipy Scikit-learn PyTorch spark.ml SAS R ggplot2 TensorflowKeras Scala Perl MATLAB Node.js M VBA JavaScript Julia Jupyter Weka Azure Machine Learning MLFlow SPSS Bayesian Statistics ONNX XGBoost Continuous Distributions PMCC/Spearman’s Rank Monte Carlo Methods χ2 Probability Theory Skewness/Curtosis Hypothesis Testing Covariance matplotlib Power BID3.js Highcharts plotly sankeymatic Tableau seaborn Bokeh React-vis Dash CanvasJS Chart.js Excel ISOMAP PIL ScraPy / BS4 LibROSA Flink lifetimes Bonsai dplyr NumPy pandas Powershell Bash NLTK spaCy OpenCV Gensim Azure Cognitive Services pytz Dijkstra Gradient Descent Ant Colony Optimisation Particle Swarm Optimisation Evolutionary Algorithms Mixed-integer linear programming Differential Calculus Simulated Annealing Least Squares DAX

9. Tools for the job Artificial Intelligence Machine Learning Deep Learning Artificial Intelligence The ability for machines to mimic human behaviours. See “Computing Machinery and Intelligence”, Turing, 1950. Machine Learning The application of mathematical and statistical techniques that learn parameters from data rather than being explicitly programmed. Deep Learning Subset of machine learning in which neural networks with many layers are used to learn highly non-linear relationships from large amounts of training data.

10. What makes a Data Scientist? Scientist EngineerBusiness Analyst

11. A Simple Example 38 2 4 556

12. A Simple Example Business Context: Machinery failure costs £500,000 but maintenance costs £1,000 Total Cost: £1,004,000 38 2 4 556

13. Yes No A Simple Example Expensive

14. Yes No Yes No A Simple Example Yes No

15. Yes No Yes No A Simple Example Yes No New Total Cost: £60,000 New Accuracy = 90% 40 0 60 490

16. Another Simple Example

17. What makes a Data Scientist? Scientist EngineerBusiness Analyst

18. Types of Data Scientist ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert • Operationalisation of models • Focus on MLOps, Automated Tests, CI/CD, ETL • Focus on A/B Testing, Modelling and Experimentation • View to contributing to a product • Uses Tried & Tested Techniques • Experimentation with view to expand community knowledge and understanding of algorithms • Uses novel techniques • Generalist • Works across modelling, ETL, operationalisation and app development • May be less focused on depth of modelling understanding • Focus on storytelling with data • Wizard with graphing libraries, including D3.js

19. Data

20. Data What data do you need?

21. Data How much data do you need?

22. Data How much data do you need? How do we know this is a cat? We have 140 million neurons in V1 And we have V2, V3, V4, V5 and V6

23. Data How much data do you need? 88 239 33 178 38 122 208 115 215 36 119 203 229 65 52 64 4 23 92 114 26 29 155 183 101 142 222 54 187 109 45 6 95 67 35 212 93 103 142 57 207 117 174 228 201 24 101 176 100 9 141 241 144 37 8 34 198 125 138 246 178 126 255 108 161 128 How do you get a computer to recognise this as a cat

24. Data How much data do you need? ? ? ? ? ?

25. Data How much data do you need? Garbage in… …Garbage out

26. Data How much data do you need? HorsesGoats ?

27. Data How much data do you need? ? ?

28. Data How much data do you need? ? ?

29. Bias – AI ethics and responsibility

30. Value realization is only possible through Continuous Delivery MLOps Data Science solutions need to be integrated with People, Process and Products Pilot PoC Experiment PoV MVP

31. I have a model for you… How do I deploy, manage, monitor…Wall Of Confusion Data Science Ops Data Drift Model Decay Stale Models Concept Drift Traditional DS Delivery

32. DevOps is the union of people, process, and products to enable continuous delivery of value. “ ” Build & Test Continuous Delivery Deploy Operate Monitor & Learn Plan & Track Develop People • Collaborate early and often • Cross-disciplinary teams • Share common goals and metrics • Shared responsibility Process • Agile Principles • Streamline feedback • Delivering value faster Products What is DevOps?

33. The ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. MLOps

34. Data Science is a Team Effort Architects Change Management Data Engineers Data Scientists Project Management App Developers UX Designers

35. Conclusions  Data Scientists are Scientists, Engineers and Business Analysts  We work best with data of high volume, veracity and variety  We need to keep in mind ethical considerations and act responsibly when designing systems  MLOps is paramount for delivering customer value  Data Science is a team effort  Data Science is about turning data into impact

36. Q&A

37. Thank you

Notas do Editor

aka.ms/benkeen is a short URL to my LinkedIn profile
So today we’re going to cover a range of topics in our Data Science 101 Masterclass. We’ll start with what a data scientist is, what the job entails and what should be expected of a data scientist. Then we’ll talk about data – The Economist described Data as “The New Oil” in 2017 and that’s up to some debate, I’m not so sure I agree, but here we’ll be taking a look at a couple of questions about data that data scientists face most commonly. We’ll touch a bit on AI Ethics and Responsibility which, of course, could be an entire masterclass in itself. Finally we’ll cover an important emerging topic in data science – MLOps for operationalising data science
Data science is not about making awesome visualisations, complicated models, or writing lots of code. Data scientists turn data into impact The job is to solve real problems using data
First and foremost, data scientists are scientists – It’s right there in the name. What does this mean? [See next slide]
I don’t come from a mathematics or computer science background myself - my PhD is in molecular genetics – but the skills required of a scientist are the same, whether I’m creating a predictive model as a data scientist now or I’m doing X-ray crystallography on proteins as a biochemist in my old life. [Read Slide Text]
Data Scientists are also engineers – they design systems to fulfil functional objectives. In the next couple of slides, we’ll explore the tools data scientists use to do this.
These are the tools for the job – just as with any other type of engineer, good data scientists will know when to use which tools for which tasks and not shoehorn things in just because they like them. These are the tools data scientists use to create the systems that fulfil those functional objectives. As a side note, this is not an exhaustive list. No single data scientist knows all of this in-depth and we’ll come back to that in a little bit (See types of data scientist). In the next slide I’m going to focus a little more down into this machine learning section as it’s probably the one that’s most associated with data science.
Artificial intelligence is the ability for machines to mimic human behaviours – including things like recognition, behaviour, reasoning etc. Machine learning is a subset of artificial intelligence. Machine Learning is then application of maths and stats techniques that learn parameters from data. Artificial intelligence is not just machine learning – there are rules-based systems in which knowledge is explicitly coded from rules understood by experts within AI too and these can be powerful such as in the cases of classical computer vision and NLP syntax and semantics and can often be combined with machine learning techniques. Deep learning is a subset of machine learning, in which we use neural nets with many hidden layers to learn highly non-linear relationships. Deep learning is often used for things like computer vision, natural language processing, speech/sound recognition, bioinformatics (genetic data), time series. Just like Machine Learning not being all of AI, Deep learning is not all Machine Learning - For work where data is tabular and there are fewer samples, deep learning is often not the best modelling technique and you’ll find a lot of Kaggle competition winners will use other techniques such as gradient boosted decision trees. And sometimes, where relationships are less complex, a simple linear or logistic regression will do just as well. Again – to re-iterate – this is just a tool in the data science job
Finally a data scientist is a business analyst. Over the next few slides we’ll see why this is so important to the role of a data scientist.
Let’s take a simple example – We’ve been brought in to a predictive maintenance engagement and have been given data to go and train a model with no real context. We go away and train a classifier model that’s 99% accurate – happy days, it all looks good, we deploy it but the customer is not happy.
Given some business context – we find out that machinery failure costs £500,000 but maintenance costs £1,000. We’ve predicted 4 cases in which maintenance of a machine was required that wasn’t – that’s a cost of £4,000. However, we predicted 2 cases in which maintenance of a machine wasn’t required, that’s a cost of £1,000,000. So now let’s take a look at how a data scientist could have dealt with this given this business context.
Our prediction is yes the machine needs maintenance in the upper light blue rectangle and no it doesn’t need maintenance in the lower light red rectangle. The shape of the markers indicates whether the machine actually needs maintenance or not, blue circles indicate that actually the machine does need maintenance and red crosses mean it doesn’t. (Transition 1) Predicting “No” but actually needing maintenance is *very* expensive.
We have 2 options Option 1 – We can change our model or parameters and re-train. In this example the sigmoid shown is shifted to the left, this has moved a number of middling values up into the “yes” rectangle Option 2 – We can change our threshold, or “decision boundary” – before we had our decision boundary at ~50%, shifting it down also moves those middling values into the “yes” rectangle Doing either of these has the effect of removing our false negative but introducing more false positives
Making sure we don’t miss required maintenance here has reduced the cost by nearly £1,000,000. This is a simple and extreme example but the point stands that data scientists need to tie technical decision making with business value understanding.
The previous example showed a classification – but how about a regression, where we’re predicting continuous variables and want to look at predictive maintenance from a perspective of remaining useful life. Let’s look at a simple linear regression. Say we want to look at the performance of machines over time and we are just given time and performance metrics. We come to the conclusion that the machines’ performance improves over time. Again our customer is not so happy. (Transition 1) Now we do some business analysis, and find out that actually there are different groups of machines represented from within this data and our original conclusion was wrong. Taken together, this shows that understanding the business context in which the data resides, is incredibly important to doing data science and without it we can cause more harm than good. It’s so important in data science projects to have data scientists engaged early on to ensure things like this don’t happen and we lose trust.
So a data scientist is not just some subset of these roles, a data scientist encompasses all 3 roles.
All of these people are data scientists and all of them will have significant overlap in skills with the others.
Now that we know who are data scientist is, what their tools are and what their objective is – let’s talk about data No “Data Science 101” talk would be complete without a discussion about data. These are possibly the 2 questions we get asked most: - What data do you need? - How much data do you need. As consultants, our diplomatic answer is always – it depends.
This question – what data do you need? - is highly use case dependent and requires BA Workshops. If you cast your mind back to a few slides ago, where we showed the regression was positive without information on the machine groups, and then it became negative. Without business analysis we wouldn’t know that we needed that group data. This is why it’s so important to have data scientists on board so early on in the engagement, so we can cover off some of these requirements. So keeping in mind that the machine learning aim is to mimic human behaviour, if it’s tough for an SME to identify or predict, it is most likely hard for a ML to be trained. There are, of course, as with everything, exceptions to this rule. Exceptions to the rule – is this consulting or is this a research task that we should be taking on?
How much data do you need? Commonly people will just think about volume, they want a number. I can get you 100 data points and that will all be fine. However, we need to get people out of that thinking – we need to combine volume, with veracity (reliability) and variety. The data sample we’re given needs to be representative of the population and only with all 3 will we get this.
Let’s take a computer vision problem - The computer vision examples almost always have cats. How do we know this is a cat? We have 140 million interconnected neurons in our primary visual cortex With 10s of billions of connections between them That’s just V1, we have v2, v3, v4 and v5 for fine tuning We are very good at learning pattern recognition as a result
This is an easy task for us but incredibly difficult for a computer Computers just see pixel values We can’t just write a program to recognise cats – Too many options to encode and too many edge cases So we use machine learning to recognize patterns. I won’t go into the details of convolutional neural networks but as you can imagine, the transformation to go from these numbers here to the label of a cat is highly non-linear and complex. There’s no y = mx + c linear regression here. This kind of pattern recognition is going to need many, many examples to determine how to get from picture of cat to a label of cat. The InceptionV3 image recognition model from Google has just shy of 24 million parameters it needs to get right
Image data might have more complex transformation but if we have tabular data, we still need enough data in order to reason about distributions of data. Let’s say we’re trying to classify our blue circles and red crosses. And we have a new sample – green question mark. (Transition 1) This data alone could be a sample of any number of distributions. Any of these decision boundaries might be valid – we need more volume to get a better idea of distribution (Transition 2) When we have more data, all of these distributions include those first 6 points but we can see how wildly different they are, our top example would classify this as a red cross but the next 2 would be blue circles
If you feed a model with incorrect or unreliable data, the results you get will be unreliable or incorrect. Depending on the accuracy you need, for each incorrect data point you feed in, you might need 5, 10, 50, 100… similar correct examples to drown the effects of it out.
Let’s train a model on these pictures of goats and these pictures of horses (I didn’t use cats!) (Transition 1) Now, given this picture of a horse, what do we think our model would predict? It’s never seen a side profile of a horse – it’s got 4 legs, it’s and a relatively rectangular body – it’s a goat. We could have hundreds of thousands of images like this, but we will still predict this is a goat. You may also have heard about the model that determined dogs from wolves not from the animal itself but the background, if there was snow – it was a wolf. Sample training data needs to be representative of the potential scoring population. If you’re doing crack detection for example, it’s better to have a thousand different images of cracks than a thousand images of the same crack. There is another conversation on how we go about getting this labelled data but it’s perhaps a topic for another day.
This isn’t just true of image data – let’s take a look at some more tabular data. We have two classes of data – represented by blue circles and red crosses. We have a new sample represented by a green question mark and we want to know how to classify this. (Transition 1) If we have a lot of data but all our data is from the same two clusters of data – we may get a separation that looks something like this – this is something an SVC might look do to classify these. But notice that the green question mark is not near either of these clusters – we’ve trained a model the best we can based on the knowledge we have of this training data but this training data is clearly not representative of the population.
Now we have sparser data, there’s much fewer points, they’re actually just a different sample from the same population but this data is a better representation of the population. (Transition 1) Now we’re a little more certain about where this green question mark should be placed. Although previously it would have been in the blue area, now it’s on the red side of our decision boundary. So variety of data is just as, if not more, important as volume of data.
As part of this discussion around the types of data required of data scientists and why our samples need to be representative of the population, we also need to talk about AI ethics and responsibility, which will ultimately fall on the data scientists that design this system. Bias has different meanings in ML and stats – but here I’m talking about bias in which a model is skewed based on the data it is fed relative to accepted legal or moral principles. If you train a model to determine best candidates based on your historical successful hires, but all you’ve hired in the past is men, that model is going to be skewed to hiring men. This is an example where your training sample isn’t necessarily representative of the population. Similarly if you train a model mostly on data you’ve collected from white men, it’s going to perform better on white men. Again, you need a representative sample of the population. The Facebook example here is similar but also highlights another key issue. Even if you remove the actual labels that indicate protected classes like race, sex, religion. We need to consider proxies that might indicate these classes – hormone levels in medical records, certain words in CVs, sports activities or post codes. There are a number of techniques for reducing this kind of bias in models – including pre-processing, in-processing and post-processing algorithms, and class balancing algorithms like SMOTE can help augment under-represented classes but ultimately I think data scientists can best tackle this kind of bias will be to have a good domain understanding and a good understanding of the data that they are using to try and ensure they are not discriminating against any class.
The final thing I want to discuss is MLOps as it’s one of the most important aspects of a data scientist’s role. Whilst experimentation, proof of concepts and proof of values are important – value realization for our customers is only possible through continuous delivery. For successful value realisation, data science solutions need to be integrated with people, process and products.
Historically there has been a disconnect between data scientists and other developers, in which a model is made, perhaps through data science experimentation and then thrown over a wall to developers to deploy. Model requirements in this manner are often poorly understood by developers and changes can be difficult, resulting in model drift.
Modern data science delivery integrates machine learning and DevOps, using tools designed for continuous integration and continuous delivery. Although the goal is to enable each of the technical delivery tasks shown in the top right here, there is a focus on people, process and products in order to make this a success. (Transition 1) People: Project managers, architects, data engineers, data scientists, developers, and testers should all be involved in a use case from an early stage. There is not a specific KPI for data scientists such as an RMSE value, and a different goal for developers such as API response times – the whole team shares common goals. (Transition 2) Process: We follow agile methodology principles of short sprints of 2 or 3 weeks, tracking the story points and burndown of a sprint to use in planning for the next sprints as well as retrospectives to feedback to each other what’s going well, what’s not going well and what actions we can take to maintain velocity. (Transition 3) Products: We should use products that enhance our productivity, not products shoehorned in because they are an individual’s favourite tool. Knowledge sharing through wikis and Teams is encouraged across the team so that we can make sure everyone can contribute to a range of tasks.
MLOps is the integration of machine learning into DevOps processes and, as we see on screen, it is the ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. The aim of data science modelling is not necessarily to be right today but to be less wrong each day through an iterative feedback cycle. Monitoring models and, for those that have done scrum principles or operations management training, the principles of Kaizen (a Japanese term that means “continuous improvement”) are therefore of paramount importance.
Data science isn’t something that happens in a silo. This is a team effort among a team that share common goals. Not all projects are going to need all of these resources but all projects require the concerted effort of a team to make them a success. We want to work with others in order to make sure our engagements are successful.

Data science 101 Masterclass

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data science 101 Masterclass

Semelhante a Data science 101 Masterclass (20)

Último

Último (20)

Data science 101 Masterclass

Notas do Editor