In this presentation, Microsoft data scientists Ben Keen and Shahzia Holtom cover an introduction to data science with respect to:
- What is a data scientist?
- What data does a data scientist need?
- AI ethics and responsibility
- What is MLOps and how does it drive value?
6. Scientist
Strong understanding of scientific method &
hypothesis testing
Asks clarifying questions and remains
sceptical and objective
Strong critical thinking, root cause analysis,
and research skills
Bases decisions on data and statistical
analysis
8. Statistics
Machine Learning Data Storage
Visualisation
Optimisation
Data ProcessingData Manipulation
Programming
Data Lakes
Azure Storage
SQL Server
MySQL
PostgreSQL
Oracle DB
Azure Data Warehouse
HDFS MongoDB
Neo4jAzure Cosmos DB
Cassandra
Word2Vec
SQLite
Spark/Databricks
Azure Data Factory
Airflow
Kubernetes
Azure Event Hub
Azure Service Bus Kafka
Hadoop
Logstash/Elasticsearch
NiFi
Docker
Swarm
Python
statsmodels
scipy
Scikit-learn
PyTorch
spark.ml
SAS
R
ggplot2
TensorflowKeras
Scala
Perl
MATLAB
Node.js
M
VBA
JavaScript
Julia
Jupyter
Weka
Azure Machine Learning
MLFlow
SPSS
Bayesian Statistics
ONNX
XGBoost
Continuous Distributions
PMCC/Spearman’s Rank
Monte Carlo Methods
χ2
Probability Theory
Skewness/Curtosis
Hypothesis Testing
Covariance
matplotlib
Power BID3.js
Highcharts
plotly
sankeymatic
Tableau
seaborn
Bokeh
React-vis
Dash
CanvasJS
Chart.js
Excel
ISOMAP
PIL
ScraPy / BS4
LibROSA
Flink
lifetimes
Bonsai
dplyr
NumPy
pandas
Powershell
Bash
NLTK
spaCy
OpenCV
Gensim
Azure Cognitive Services
pytz
Dijkstra
Gradient Descent
Ant Colony Optimisation
Particle Swarm Optimisation
Evolutionary Algorithms
Mixed-integer linear programming
Differential Calculus
Simulated Annealing
Least Squares
DAX
9. Tools for the job
Artificial
Intelligence
Machine
Learning
Deep
Learning
Artificial Intelligence
The ability for machines to mimic human behaviours.
See “Computing Machinery and Intelligence”, Turing, 1950.
Machine Learning
The application of mathematical and statistical techniques
that learn parameters from data rather than being
explicitly programmed.
Deep Learning
Subset of machine learning in which neural networks with
many layers are used to learn highly non-linear
relationships from large amounts of training data.
10. What makes a Data Scientist?
Scientist
EngineerBusiness Analyst
17. What makes a Data Scientist?
Scientist
EngineerBusiness Analyst
18. Types of Data Scientist
ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert
• Operationalisation
of models
• Focus on MLOps,
Automated Tests,
CI/CD, ETL
• Focus on A/B
Testing, Modelling
and
Experimentation
• View to
contributing to a
product
• Uses Tried &
Tested Techniques
• Experimentation
with view to
expand
community
knowledge and
understanding of
algorithms
• Uses novel
techniques
• Generalist
• Works across
modelling, ETL,
operationalisation
and app
development
• May be less
focused on depth
of modelling
understanding
• Focus on
storytelling with
data
• Wizard with
graphing libraries,
including D3.js
30. Value realization is only possible
through Continuous Delivery
MLOps
Data Science solutions need to be
integrated with People, Process
and Products
Pilot
PoC
Experiment
PoV
MVP
31. I have a model
for you…
How do I
deploy, manage,
monitor…Wall
Of
Confusion
Data Science Ops
Data Drift
Model Decay
Stale Models
Concept Drift
Traditional DS Delivery
32. DevOps is the union of people,
process, and products to enable
continuous delivery of value.
“
”
Build
&
Test
Continuous
Delivery
Deploy
Operate
Monitor
&
Learn
Plan
&
Track
Develop
People
• Collaborate early and often
• Cross-disciplinary teams
• Share common goals and metrics
• Shared responsibility Process
• Agile Principles
• Streamline feedback
• Delivering value faster Products
What is DevOps?
33. The ability to continuously integrate, automatically
test, build, deploy and monitor Machine Learning
artifacts such as Data & Training pipelines and
models.
MLOps
34. Data Science is a Team Effort
Architects
Change Management
Data Engineers
Data Scientists
Project Management
App Developers
UX Designers
35. Conclusions
Data Scientists are Scientists, Engineers and Business Analysts
We work best with data of high volume, veracity and variety
We need to keep in mind ethical considerations and act responsibly
when designing systems
MLOps is paramount for delivering customer value
Data Science is a team effort
Data Science is about turning data into impact
aka.ms/benkeen is a short URL to my LinkedIn profile
So today we’re going to cover a range of topics in our Data Science 101 Masterclass.
We’ll start with what a data scientist is, what the job entails and what should be expected of a data scientist.
Then we’ll talk about data – The Economist described Data as “The New Oil” in 2017 and that’s up to some debate, I’m not so sure I agree, but here we’ll be taking a look at a couple of questions about data that data scientists face most commonly.
We’ll touch a bit on AI Ethics and Responsibility which, of course, could be an entire masterclass in itself.
Finally we’ll cover an important emerging topic in data science – MLOps for operationalising data science
Data science is not about making awesome visualisations, complicated models, or writing lots of code.
Data scientists turn data into impact
The job is to solve real problems using data
First and foremost, data scientists are scientists – It’s right there in the name. What does this mean? [See next slide]
I don’t come from a mathematics or computer science background myself - my PhD is in molecular genetics – but the skills required of a scientist are the same, whether I’m creating a predictive model as a data scientist now or I’m doing X-ray crystallography on proteins as a biochemist in my old life.
[Read Slide Text]
Data Scientists are also engineers – they design systems to fulfil functional objectives.
In the next couple of slides, we’ll explore the tools data scientists use to do this.
These are the tools for the job – just as with any other type of engineer, good data scientists will know when to use which tools for which tasks and not shoehorn things in just because they like them. These are the tools data scientists use to create the systems that fulfil those functional objectives. As a side note, this is not an exhaustive list.
No single data scientist knows all of this in-depth and we’ll come back to that in a little bit (See types of data scientist).
In the next slide I’m going to focus a little more down into this machine learning section as it’s probably the one that’s most associated with data science.
Artificial intelligence is the ability for machines to mimic human behaviours – including things like recognition, behaviour, reasoning etc.
Machine learning is a subset of artificial intelligence. Machine Learning is then application of maths and stats techniques that learn parameters from data.
Artificial intelligence is not just machine learning – there are rules-based systems in which knowledge is explicitly coded from rules understood by experts within AI too and these can be powerful such as in the cases of classical computer vision and NLP syntax and semantics and can often be combined with machine learning techniques.
Deep learning is a subset of machine learning, in which we use neural nets with many hidden layers to learn highly non-linear relationships. Deep learning is often used for things like computer vision, natural language processing, speech/sound recognition, bioinformatics (genetic data), time series.
Just like Machine Learning not being all of AI, Deep learning is not all Machine Learning - For work where data is tabular and there are fewer samples, deep learning is often not the best modelling technique and you’ll find a lot of Kaggle competition winners will use other techniques such as gradient boosted decision trees. And sometimes, where relationships are less complex, a simple linear or logistic regression will do just as well.
Again – to re-iterate – this is just a tool in the data science job
Finally a data scientist is a business analyst. Over the next few slides we’ll see why this is so important to the role of a data scientist.
Let’s take a simple example – We’ve been brought in to a predictive maintenance engagement and have been given data to go and train a model with no real context.
We go away and train a classifier model that’s 99% accurate – happy days, it all looks good, we deploy it but the customer is not happy.
Given some business context – we find out that machinery failure costs £500,000 but maintenance costs £1,000.
We’ve predicted 4 cases in which maintenance of a machine was required that wasn’t – that’s a cost of £4,000.
However, we predicted 2 cases in which maintenance of a machine wasn’t required, that’s a cost of £1,000,000.
So now let’s take a look at how a data scientist could have dealt with this given this business context.
Our prediction is yes the machine needs maintenance in the upper light blue rectangle and no it doesn’t need maintenance in the lower light red rectangle.
The shape of the markers indicates whether the machine actually needs maintenance or not, blue circles indicate that actually the machine does need maintenance and red crosses mean it doesn’t.
(Transition 1) Predicting “No” but actually needing maintenance is *very* expensive.
We have 2 options
Option 1 – We can change our model or parameters and re-train. In this example the sigmoid shown is shifted to the left, this has moved a number of middling values up into the “yes” rectangle
Option 2 – We can change our threshold, or “decision boundary” – before we had our decision boundary at ~50%, shifting it down also moves those middling values into the “yes” rectangle
Doing either of these has the effect of removing our false negative but introducing more false positives
Making sure we don’t miss required maintenance here has reduced the cost by nearly £1,000,000.
This is a simple and extreme example but the point stands that data scientists need to tie technical decision making with business value understanding.
The previous example showed a classification – but how about a regression, where we’re predicting continuous variables and want to look at predictive maintenance from a perspective of remaining useful life.
Let’s look at a simple linear regression. Say we want to look at the performance of machines over time and we are just given time and performance metrics.
We come to the conclusion that the machines’ performance improves over time. Again our customer is not so happy.
(Transition 1) Now we do some business analysis, and find out that actually there are different groups of machines represented from within this data and our original conclusion was wrong.
Taken together, this shows that understanding the business context in which the data resides, is incredibly important to doing data science and without it we can cause more harm than good. It’s so important in data science projects to have data scientists engaged early on to ensure things like this don’t happen and we lose trust.
So a data scientist is not just some subset of these roles, a data scientist encompasses all 3 roles.
All of these people are data scientists and all of them will have significant overlap in skills with the others.
Now that we know who are data scientist is, what their tools are and what their objective is – let’s talk about data
No “Data Science 101” talk would be complete without a discussion about data.
These are possibly the 2 questions we get asked most:
- What data do you need?
- How much data do you need.
As consultants, our diplomatic answer is always – it depends.
This question – what data do you need? - is highly use case dependent and requires BA Workshops.
If you cast your mind back to a few slides ago, where we showed the regression was positive without information on the machine groups, and then it became negative. Without business analysis we wouldn’t know that we needed that group data. This is why it’s so important to have data scientists on board so early on in the engagement, so we can cover off some of these requirements.
So keeping in mind that the machine learning aim is to mimic human behaviour, if it’s tough for an SME to identify or predict, it is most likely hard for a ML to be trained. There are, of course, as with everything, exceptions to this rule.
Exceptions to the rule – is this consulting or is this a research task that we should be taking on?
How much data do you need?
Commonly people will just think about volume, they want a number. I can get you 100 data points and that will all be fine.
However, we need to get people out of that thinking – we need to combine volume, with veracity (reliability) and variety.
The data sample we’re given needs to be representative of the population and only with all 3 will we get this.
Let’s take a computer vision problem - The computer vision examples almost always have cats.
How do we know this is a cat?
We have 140 million interconnected neurons in our primary visual cortex
With 10s of billions of connections between them
That’s just V1, we have v2, v3, v4 and v5 for fine tuning
We are very good at learning pattern recognition as a result
This is an easy task for us but incredibly difficult for a computer
Computers just see pixel values
We can’t just write a program to recognise cats – Too many options to encode and too many edge cases
So we use machine learning to recognize patterns.
I won’t go into the details of convolutional neural networks but as you can imagine, the transformation to go from these numbers here to the label of a cat is highly non-linear and complex. There’s no y = mx + c linear regression here.
This kind of pattern recognition is going to need many, many examples to determine how to get from picture of cat to a label of cat.
The InceptionV3 image recognition model from Google has just shy of 24 million parameters it needs to get right
Image data might have more complex transformation but if we have tabular data, we still need enough data in order to reason about distributions of data.
Let’s say we’re trying to classify our blue circles and red crosses. And we have a new sample – green question mark.
(Transition 1) This data alone could be a sample of any number of distributions. Any of these decision boundaries might be valid – we need more volume to get a better idea of distribution
(Transition 2) When we have more data, all of these distributions include those first 6 points but we can see how wildly different they are, our top example would classify this as a red cross but the next 2 would be blue circles
If you feed a model with incorrect or unreliable data, the results you get will be unreliable or incorrect.
Depending on the accuracy you need, for each incorrect data point you feed in, you might need 5, 10, 50, 100… similar correct examples to drown the effects of it out.
Let’s train a model on these pictures of goats and these pictures of horses (I didn’t use cats!)
(Transition 1) Now, given this picture of a horse, what do we think our model would predict? It’s never seen a side profile of a horse – it’s got 4 legs, it’s and a relatively rectangular body – it’s a goat.
We could have hundreds of thousands of images like this, but we will still predict this is a goat.
You may also have heard about the model that determined dogs from wolves not from the animal itself but the background, if there was snow – it was a wolf.
Sample training data needs to be representative of the potential scoring population. If you’re doing crack detection for example, it’s better to have a thousand different images of cracks than a thousand images of the same crack.
There is another conversation on how we go about getting this labelled data but it’s perhaps a topic for another day.
This isn’t just true of image data – let’s take a look at some more tabular data. We have two classes of data – represented by blue circles and red crosses. We have a new sample represented by a green question mark and we want to know how to classify this.
(Transition 1) If we have a lot of data but all our data is from the same two clusters of data – we may get a separation that looks something like this – this is something an SVC might look do to classify these.
But notice that the green question mark is not near either of these clusters – we’ve trained a model the best we can based on the knowledge we have of this training data but this training data is clearly not representative of the population.
Now we have sparser data, there’s much fewer points, they’re actually just a different sample from the same population but this data is a better representation of the population.
(Transition 1) Now we’re a little more certain about where this green question mark should be placed.
Although previously it would have been in the blue area, now it’s on the red side of our decision boundary.
So variety of data is just as, if not more, important as volume of data.
As part of this discussion around the types of data required of data scientists and why our samples need to be representative of the population, we also need to talk about AI ethics and responsibility, which will ultimately fall on the data scientists that design this system.
Bias has different meanings in ML and stats – but here I’m talking about bias in which a model is skewed based on the data it is fed relative to accepted legal or moral principles.
If you train a model to determine best candidates based on your historical successful hires, but all you’ve hired in the past is men, that model is going to be skewed to hiring men. This is an example where your training sample isn’t necessarily representative of the population.
Similarly if you train a model mostly on data you’ve collected from white men, it’s going to perform better on white men. Again, you need a representative sample of the population.
The Facebook example here is similar but also highlights another key issue. Even if you remove the actual labels that indicate protected classes like race, sex, religion. We need to consider proxies that might indicate these classes – hormone levels in medical records, certain words in CVs, sports activities or post codes.
There are a number of techniques for reducing this kind of bias in models – including pre-processing, in-processing and post-processing algorithms, and class balancing algorithms like SMOTE can help augment under-represented classes but ultimately I think data scientists can best tackle this kind of bias will be to have a good domain understanding and a good understanding of the data that they are using to try and ensure they are not discriminating against any class.
The final thing I want to discuss is MLOps as it’s one of the most important aspects of a data scientist’s role.
Whilst experimentation, proof of concepts and proof of values are important – value realization for our customers is only possible through continuous delivery.
For successful value realisation, data science solutions need to be integrated with people, process and products.
Historically there has been a disconnect between data scientists and other developers, in which a model is made, perhaps through data science experimentation and then thrown over a wall to developers to deploy.
Model requirements in this manner are often poorly understood by developers and changes can be difficult, resulting in model drift.
Modern data science delivery integrates machine learning and DevOps, using tools designed for continuous integration and continuous delivery.
Although the goal is to enable each of the technical delivery tasks shown in the top right here, there is a focus on people, process and products in order to make this a success.
(Transition 1) People: Project managers, architects, data engineers, data scientists, developers, and testers should all be involved in a use case from an early stage. There is not a specific KPI for data scientists such as an RMSE value, and a different goal for developers such as API response times – the whole team shares common goals.
(Transition 2) Process: We follow agile methodology principles of short sprints of 2 or 3 weeks, tracking the story points and burndown of a sprint to use in planning for the next sprints as well as retrospectives to feedback to each other what’s going well, what’s not going well and what actions we can take to maintain velocity.
(Transition 3) Products: We should use products that enhance our productivity, not products shoehorned in because they are an individual’s favourite tool. Knowledge sharing through wikis and Teams is encouraged across the team so that we can make sure everyone can contribute to a range of tasks.
MLOps is the integration of machine learning into DevOps processes and, as we see on screen, it is the ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models.
The aim of data science modelling is not necessarily to be right today but to be less wrong each day through an iterative feedback cycle. Monitoring models and, for those that have done scrum principles or operations management training, the principles of Kaizen (a Japanese term that means “continuous improvement”) are therefore of paramount importance.
Data science isn’t something that happens in a silo. This is a team effort among a team that share common goals.
Not all projects are going to need all of these resources but all projects require the concerted effort of a team to make them a success.
We want to work with others in order to make sure our engagements are successful.