Goal: explain the nature of the work of an analytics team to a manager, and enable people on those teams to explain what a data science team needs to a manager.
It seems as if every organization wants to enable analytical-decision making and embed analytics into operational processes. What can you do with analytics? It looks like anything is possible. What can you really do? Probably a lot less than you expect. Why is this? Vendors promise easy-to-use analytics tools and services but they rarely deliver. The products may be easy but the work is still hard.
Using analytics to solve problems depends on many factors beyond the math: people, processes, the skills of the analyst, the technology used, the data. Technology is the easy part. Figuring out what to do and how to do it is a lot harder. Despite this, fancy new tools get all the attention and budget.
People and data are the truly hard parts. People, because many believe that data is absolute rather than relative, and that analytic models produce an answer rather than a range of answers with varying degrees of truth, accuracy and applicability. Data, because managing data for analytics is a nuanced, detail-oriented and seemingly dull task left to back-office IT.
If your goal is to build a repeatable analytics capability rather than a one-off analytics project then you will need to address the parts that are rarely mentioned. This talk will explain some of the unseen and little-discussed aspects involved when building and deploying analytics.
Pay no attention to the man behind the curtain - the unseen work behind data science
1. Pay no attention to
the man behind the
curtain…
The unseen work behind data
science and analytics
Accelerate Data Science conference
October 18, 2017
Mark Madsen
www.ThirdNature.net
@markmadsen
5. Copyright Third Nature, Inc.
So we shifted to data publishing
Industrialized data delivery for self-service access.
6. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Increased data capture and BI maturity leads to
more data-intensive practices, rising complexity
Pareto analysis of the share of buyers who make up 80%
of sales volume for products, in this case Coke.
Data source: CMO council
7. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
What makes these customers different? How does
this affect a new product launch, or line extensions?
These are not the
type of questions
you can answer
with only queries
and reporting.
Data source: CMO council
8. Copyright Third Nature, Inc.
Compounding the problem: observations, not transactions
Event data doesn’t fit well with current methods of collection and
storage, or with the technology to process and analyze it.
Copyright Third Nature, Inc.
10. Copyright Third Nature, Inc.
The applied view of data science
Five basic things you can do:
▪Prediction – what is most likely to happen?
▪Estimation – what’s the future value of a variable?
▪Description – what relationships exist in the data?
▪Simulation – what could happen?
▪Prescription – what should you do?
Slide 10
Copyright Third Nature, Inc.
11. Copyright Third Nature, Inc.
Applying analytics isn’t just putting them on a screen
There are different models of use at machine and human speed
Decision-
Action
Human
decision
support
Humans
moderating
machine
decisions
Machine
decisions
Monitor-
Alert
Human
monitoring
Machine
monitoring
12. Copyright Third Nature, Inc.
THE NATURE OF THE PROBLEM FOR
ORGANIZATIONS
Implementing data science is a problem of multiple perspectives
13. Copyright Third Nature, Inc.
We don’t have an analytics problem, just like we
didn’t have a BI problem
The origin of analytics as “business
intelligence” was stated well in 1958:
…the ability to apprehend the
interrelationships of presented facts in
such a way as to guide action towards a
desired goal. ~ H. P. Luhn
“A Business Intelligence System”, http://altaplana.com/ibmrd0204H.pdf
”
“
Our goal is analytics as a capability, not a technology
14. Copyright Third Nature, Inc.
Three constituencies
Stakeholder Analyst Builder
aka the recipient aka the data scientist aka the engineer
15. Copyright Third Nature, Inc.
Starting points
Many organizations choose to start with
the analysts. Create a data science team.
Turn them loose to find a problem.
Many more start with builders: technology
solutions looking for problems, e.g. 55% of
the IT driven Hadoop and Spark projects
over the last five years.
The right place to start? Stakeholders. The
goal to achieve, the problem to solve.
16. Copyright Third Nature, Inc.
NATURE OF THE PROBLEM FROM
THE STAKEHOLDER’S PERSPECTIVE
Each constituency has their own set of problems to deal with
17. Copyright Third Nature, Inc.
The myth that still drives analytics – analytic gold
All we need is a fat
pipe and pans
working in parallel…
18. Copyright Third Nature, Inc.
Analytic insights that result in no action are expensive trivia.
It’s not the insight, but what you do with it, that matters
As a manager: what would you do in this situation?
19. Copyright Third Nature, Inc.
Perennially difficult: What question do you address?
What’s possible?
How do you know what’s
feasible and what isn’t? (both
technically and financially)
You don’t, unless you know the
data science and the business
(and even then maybe not, ML
makes no guarantees)
It takes domain expertise and
analytic expertise and intuition
- that’s why you need analysts.
20. Copyright Third Nature, Inc.
Important questions for managers
1. What is the goal?
2. Is the goal worth achieving?
3. Do you have a clearly stated, measureable goal?
4. Do you have the data required?
If they don’t realize this is important, they complain about
analysts asking them a bunch of (obvious*) questions.
There are processes you can put in place to find problems
to address, prioritize them and determine how to deploy
the solutions for them.
*Not really
21. Copyright Third Nature, Inc.
Applying analytics is not an analytics problem
Applying analytics is not in the
analyst’s control.
It’s not in the engineer’s control.
It’s in the control of the people
involved in the process.
Failures are often in execution, not
in analytics development.
For example, we saw unexpectedly
poor performance in a number of
geographies. Was it the new
analytics we tried? Was it a data
problem? No, it was a simple
compliance problem.
24. Copyright Third Nature, Inc.
The nature of analytics problems is researching the
unknown rather than accessing the known.
Repeat for each new problem
Diagram: Kate Matsudaira
25. Copyright Third Nature, Inc.
Important: no two analytics projects are entirely alike
Different goals = different data, preparation, algorithm
Different algorithms have different resource consumption
profiles and scaling ability.
Each requires it’s own custom engineered data features
26. Copyright Third Nature, Inc.
Starting at the start: Do you have a clearly stated,
measureable goal?
27. Copyright Third Nature, Inc.
The main hurdle: just getting the data
Do you know where to find it? Because it’s
unlikely to be in the data warehouse.
Do you have access to it?
Is access fast enough? Because DWs are for
QRD, not for moving huge piles of data. And
ERP systems and SaaS apps are right out.
28. Copyright Third Nature, Inc.
Do you have the right data?
Many machine learning
techniques require labeled
(known good) training data:
Supervised learning: a person
has to define the correct
output for some portion of
the data. Data is divided into
training sets used for model
building and test sets for
validating the results.
• What is spam and what isn’t?
• What does a fraudulent
transaction look like
28
29. Copyright Third Nature, Inc.
Do you have enough of the right data?
ML needs a lot, you may be disappointed in your own efforts
30. Copyright Third Nature, Inc.
Define the
business problem
Translate the
problem into an
analytic context Select
appropriate data
Learn the data
Create a model
set
Fix problems
with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
Slide 30Copyright Third Nature, Inc.
What does an expert analyst really do?
31. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
What does an expert analyst do?
You can’t model data for this in advance.
32. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Where do analysts spend their time? mostly data work
Define the
business problem
Translate the
problem into an
analytic context
Select appropriate
data
Learn the data
Create a model set
Fix problems with
data
Transform data
Build models
Assess models
Deploy models
Assess results
% of time spent
70% 30%
Source: Michael Berry, Data Miners Inc.
Slide 32
33. Copyright Third Nature, Inc.
Feature engineering is the core of the process
Lots of data (as attributes) makes things harder
Lots of data (instances) makes things slow
Often, the raw data is not in a form that is amenable
to learning, but you can construct features from it
that are.
Cleaning up data, choosing attributes, deriving
features is not a technical problem as much as a
creative one.
The best way to enable data scientists is to remove
data management obstacles.
34. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Where do most of the analytics tools focus?
Define the
business problem
Translate the
problem into an
analytic context Select
appropriate data
Learn the data
Create a model
set
Fix problems
with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
Slide 34
35. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Where do most of the analytics aaS focus?
Define the
business problem
Translate the
problem into an
analytic context Select
appropriate data
Learn the data
Create a model
set
Fix problems
with data
Transform data
Build models
Assess models
Deploy models
Assess results
Source: Michael Berry, Data Miners Inc.
Slide 35
39. Copyright Third Nature, Inc.
IT and Ops people want to know “what to build?”
Giant data platform? Self service tools?
40. Copyright Third Nature, Inc.
Analytics requires different processes and workloads
None of this analytics work
is the same as what IT
considered “analysis” to be,
which is usually equated
with BI or ad-hoc query.
Ad-hoc analysis =
Exploratory data analysis =
Batch analytics =
Real-time analytics
A real analytics production workflow
Hatch, CIKM ‘11 Slide 40
42. Copyright Third Nature, Inc.
Things engineering and operations worry about
Engineering time and effort
▪ Introduction of new technology, complexity
▪ Integration - Deployment of models requirements linking different types of
environments, creating supportable workflows for the analysts
▪ Ability to develop and deploy at the required speed
Supportability
▪ Automation
▪ The environment requires additional monitoring, other technology and
processes, particularly for customer-facing work
▪ Support costs (time and money)
SLAs:
▪ Availability – if analytics are tied to production operations, particularly
customer facing, this becomes important and difficult because it’s not
standard application work
▪ Performance and scalability – have to manage unpredictable workloads,
resource conflicts between model development with model execution
43. Copyright Third Nature, Inc.
The world changes, do the models?
In BI you maintain ETL and
schemas, in ML you maintain
models.
“Model decay” happens as the
assumptions around which a
model is built change, e.g. spam
techniques change.
When you adjust the model you
need to know it is better again
▪ Better save the data used to
build the model
▪ Better save the model
▪ Baseline and measurements
45. Copyright Third Nature, Inc.
THREE PERSPECTIVES, ONE SOLUTION?
There are requirements from all constituents. You need to put them
together to have a complete picture of what’s needed.
46. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
The missing stakeholder
There is another stakeholder:
analytics management - the
CAO, CDO, VP of analytics, aka
“your boss” if you’re a data
scientist.
The perspective and problems
of the person responsible for
oversight of the team and
efforts is across the
organization and across
multiple projects
50. Copyright Third Nature, Inc.
Analytics solutions are interdisciplinary
Team composition is best
when the skills and
backgrounds are mixed.
Domain knowledge is still
valuable – ignore the AI and
ML hype saying that it’s all
math and engineering.
Data management and
engineering is a necessary
part for much of this work.
51. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
Data scientists and engineers work from opposing directions
exploration
modeling
integration
applications
infrastructure
help people ask the right questions,
frame them, define measurable goals
define models that run to determine
answers or carry out actions
deliver the results / product in
production, at scale
build data science models into
applications and delivery systems
provide the systems and practices to
build and run the desired models
Diagram concept: Paco Nathan
52. Using a matrix to plan the project team
Image: Paco Nathan
53. This is a team sport, not a solo act
Image: Paco Nathan
54. Copyright Third Nature, Inc.
We already know the craft model doesn’t scale. How
do we industrialize like we did for BI?
55. Copyright Third Nature, Inc.Copyright Third Nature, Inc.
There is an extensive list of requirements to support
Primary requirements needed by constituents S D E
Data catalog and ability to search it for datasets X X
Self-service access to curated data X
Self-service access to uncurated (unknown, new) data X X
Temporary storage for working with data X
Data integration, cleaning, transformation, preparation tools and environment X X
Persistent storage for source data used by production models X X
Persistent storage for training, testing, production data used by models X X
Storage and management of models X X
Deployment, monitoring, decommissioning models X
Lineage, traceability of changes made for data used by models X X
Lineage, traceability for model changes X X X
Managing baseline data / metrics for comparing model performance X X X
Managing ongoing data / metrics for tracking ongoing model performance X X X
S = stakeholder, user, D = data scientist, analyst, E = engineer, developer
56. Copyright Third Nature, Inc.
Non-answer #1: “Innovation as Procurement”
Software vendors want to sell you
one thing: high margin software.
Most assume the data is there and
ready to use by their application –
just load it.
Most of the work lies in data
integration, cleaning and data
management.
Embedding analytics in a process
adds infrastructure that most
organizations don’t have and can’t
support. It takes new infrastructure.
57. Copyright Third Nature, Inc.
Non-answer #2: Best Practices
“78% of high performing
companies have a centralized data
science team in place in their
organization” – follow their lead!
This is called survival bias. Flipping
a coin is often as effective as “Do
what they did.”
The problem: you have directions
to cross a minefield but no map of
where to start.
58. Copyright Third Nature, Inc.
The enterprise focus needs to be on
repeatability - where it can be supported
59. Copyright Third Nature, Inc.
Key focus for the organization:
Infrastructure vs Application
Infrastructure enables value,
applications deliver value.
Enable applications by pushing
the reusable elements down
into the platform.
The infrastructure is a hidden
combination of technology,
process and methods.
60. Copyright Third Nature, Inc.
Data management is a key element of infrastructure
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking,
you can’t unmake dough once it’s mixed.
61. Copyright Third Nature, Inc.
Manage your data
(or it will manage you)
Data management is where
both analysts and
developers are weakest.
Modern engineering
practices are where data
management is weakest.
You need to bridge the
groups and practices in the
organization if you want to
make this work repeatable.
63. Copyright Third Nature, Inc.
About the Presenter
Mark Madsen is president of Third
Nature, an advisory firm focused on
analytics, data and technology strategy.
Mark is an award-winning author,
architect and CTO who has received
awards for his work from the American
Productivity & Quality Center,
Smithsonian Institute and industry
associations.
He is an international speaker, a
contributor to Forbes, and member of
the O’Reilly Artificial Intelligence and
Strata program committees. For more
information or to contact Mark, follow
@markmadsen on Twitter or visit
http://ThirdNature.net
64. Copyright Third Nature, Inc.
About Third Nature
Third Nature is an advisory firm focused on practices and technology in
analytics, information strategy, business intelligence and data management.
Our goal is to help organizations solve problems using data. We offer
education, advisory and research services to support business and IT
organizations. We also provide product-related consulting to software
vendors in the data industry.
We specialize in strategy and architecture, so we look at emerging
technologies and markets, evaluating how technologies are applied to solve
problems rather than simply comparing product features. We fill the gap
between what industry analyst firms cover and what organizations need.