The document discusses best practices for managing data science teams based on lessons learned. It outlines common pitfalls such as solving the wrong problem, having the wrong tools, or results being used incorrectly. Issues include data science being different from software development and forgetting other stakeholders. Recommendations include establishing processes for the full lifecycle from ideation to monitoring, using modular systems thinking, and defining roles like data scientists, managers, and product owners to address organizational challenges. The goal is to deliver measurable, reliable, and scalable insights.
2. What You’ll Learn Today
GOALS
What is the bar for data science teams
PITFALLS
What are common data science struggles
DIAGNOSES
Why so many of our efforts fail to deliver value
RECOMMENDATIONS
How to address these struggles with best practices
3. Lots of Legitimate
Promises
Saved $40M
In claims with predictive analytics
200
180
160
140
120
100
80
60
40
20
0
Q1-08 Q2-09 Q3-10 Q4-11 Q1-13 Q2-14 Q3-15
Companies Mentioning
‘Artificial Intelligence’
On Earnings Calls
Q4-16
Lots of Hype
35% of Sales
Come from product recommendations
Saved $450M
By detecting fraudulent tax returns
4. Lots of Risk of
Disappointment
This Sounds
Eerily Familiar
MACHINE
LEARNING
TIME
Innovation
Trigger
Peak of
Inflated
Expectations
Trough of
Disillusionment
Slope of
Enlightenment
Plateau of
Productivity
EXPECTATION
S
TIME
RELATIVE
IMPORTANCE WITHIN
ENTERPRISE
1997 20302010
Software
Developers
Data
Scientists
5. What is the Goal?
Measurable
Your “quality” indicator.
Reliable
Your “hit rate.”
Scalable
Your “throughput.”
7. I SOLVED THE PROBLEM BUT…
Oops, already
solved by
someone else
It was the
wrong problem
Solved the
wrong way
Have the wrong
tools for this problem
Too slow for it
to matter
World changes
while solving
problem
Problems mulitply,
can’t tackle all
at once
Results used
Wrong way
9. Data Science is Different from Software Development
• Research versus development focus
• No answer is a valid answer
• Traditional testing is insufficient given
non-deterministic nature
• No generally accepted process metrics (e.g.
story points)
• Data must be tracked
10. Forget About Other Stakeholders in the Process
Access powerful infrastructure &
preferred tools
For Data Scientists For IT Leaders
•Ensure stability & security
•Leverage existing infrastructure
•Minimize operational burden
For Business Leaders
•Understand real-world impact
•Reliable, predictable insights
•Minimize change to existing workflows
For Data Science
Managers
• Accelerate project delivery through reuse,
knowledge management
• Mitigate key-man risk / accelerate onboarding
• Hire & retain top talent
14. Missing Some Key
Personnel Muscles
• The full stack data scientist is
a myth
• Gap in ”soft” skills training
15. Artisan Thinking vs.
Modular System Thinking
• Limited culture of re-use and
compounding
• Not planning for future iterations
(e.g., no reproducibility /
documentation)
16. Bad Incentive Structures
• Key responsibilities fall between
gaps
• Significant information loss in
project transitions
18. Best Practices Take Many Forms
Process
Both a single project and portfolio of projects
People
Types of capabilities and org design
Technology
Flexible infrastructure and tooling without the
wild west
19. Data science system at many levels
Single
Step
Data
Exploration
Single Project
Ideation
Validation
& Review
Deployment
&
Publishing
Monitoring
& Feedback
Data
Exploration R & D
21. Managing the lifecycle
• Expect and embrace iteration
• Enable compounding collaboration
• Ensure auditability and
reproducibility, even if you’re not
regulated (yet)
22. Ideation
• Problem first, not data first
• Practice and master order of
magnitude ROI math
• Maintain repo of past work
• Create and enforce templates for
MRDs
• Maintain a stakeholder-driven
backlog
23. Artifact Selection
• Leverage rapid prototyping and
design sprint methodology
• Create multiple mock-ups of
different deliverable types
• Consider creating synthetic data
with baseline models
24. Research & Development
• Establish standard software
configurations, but give flexibility
to experiment
• Abstract away compute
provisioning
• Build simple models first
• Set a cadence for delivering
insights
• Ensure business KPI tracked
consistently over time
25. Validation
• More than just code review, get
stakeholder and IT sign-off
• Ensure reproducibility and clear
lineage
• Use automated validation checks
to support human inspection
• Preserve results (even nulls) to
central repo
WHAT INFLUENCES A RESULT?
Results
The statistical analyses selected
The R scripts that implemented the analyses
The R libraries that implement the statistical functions
The C libraries that perform the mathematical computations
The operating system running the computational framework
Reduced data
Scripts that reduce the data
Raw data
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
Depend on
26. Delivery
• Support for many deliverable
artifacts (reports, dashboards,
apps, batch APIs, real-time APIs)
• Define a promote-to-production
workflow
• Flag upstream and downstream
dependencies
27. Monitoring
• Build ROI testing into all major
deliverables
• Require monitoring plans before
considering “done”
• Integrate with tools where people
spend most of their time (e.g.,
email / Slack)
• Anticipate risk and change
management burdens
28. Keeping all the balls
in the air
• Measure everything, including
yourself
• Focus on reducing time to iterate
• Socialize aggregate portfolio
impact
29. The many hats of data science
PRIORITIES PITTFALLS WITHOUT THEM
Creating engaging visual and narrative journeys
for analytical solutionsData Storyteller
Articulating the business problem, translating to
day-to-day work, ensuring ongoing engagement.
Data Product
Manager
Vetting the priortization and ROI, providing ongoing
feedback
Business
Stakeholder
ROLE
Low engagement and
adoption from
end users
Projects miss the mark, don’t
translate into tangible business
value
ROI decisions aren’t made
sensibly, not knowing when to pull
the plug
Generating and communicating insights,
understanding the strengths and risksData Scientist
Naïve or low power insights
Building scalable pipelines and infrastructure that
make it possible to do the higher levels of needs.
Data
Infrastructure
Engineer
Insight generation is slow,
because DS is spending their
time doing infrastructure work
30. Organizational Design Dilemmas
• False centralization /
decentralization dichotomy
• Most evolve as they scale
and as business demands
shift
• Technology can help
bridge the gap
• Deeper understanding
of business processes
and priorities
• Easier change
management
• Less technical
knowledge
compounding
• Harder to codify best
practices
• Risk of shadow IT
DECENTRALIZATIONCENTRALIZATION
• Community and
mentorship
• easier transparency for
managers and IT
• More passive technical
knowledge sharing
• Isolation on data
science island
• Loss of credibility with
business
• Frustrated data
scientists
Pros
Cons
31. What We Covered Today
GOALS
What is the bar for data science teams
PITFALLS
What are common data science struggles
DIAGNOSES
Why so many of our efforts fail to deliver value
RECOMMENDATIONS
How to address these struggles with best practices
Who am I?
I work at Domino Data Lab. We build a data science platform that helps organizations build a more mature data science practice. In my role, I get to work with large enterprises and small start-ups to understand how data science is changing their business. What I’m going to talk about today is largely just a synthesis of what we’ve heard over the past few years from companies that have failed hard and those that have had great success.
What you’ll learn today
What are common data science struggles
Why so many of our efforts fail to deliver value
How to address these struggles with best practices
Who is doing this well today and what are their principles
Where to focus your efforts tomorrow
Let’s start by saying something really obvious. Everyone is really excited about data science. There is lots of legitimate promise, with companies like Google, Facebook, and Amazon building defensible businesses around the breadth and quality of their models. At the same time, the pervasive hype has created risk of disappointment and disillusionment if not proactively addressed.
We believe data science is in the throes of a transition from a niche capability leveraged by a few pioneers to a core capability across every enterprise. What was once a “nice to have” has become a survival imperative. As with the evolution of software development, the tooling has advanced dramatically in recent years. But also like software development, tooling alone is not enough. The hardening of a new roles (people), processes, and technology will be key to cementing data science’s position as a core function.
The goal of any data science organization should be measurable, reliable, and scalable impact on the business decisions and metrics that they are charged with improving.
Were business decisions positively changed in an observable and ideally, quantifiable, way?
If I take on five projects, I want 3-4 to deliver business value.
If my reliability is 80% with five projects and seven people, can I expand that to 50 projects and 40 people?
Opening talk track: Everyone talks about data being the problem. “Data Scientists spend 80% of their time cleaning data.” Don’t get me wrong, it’s a big pain point. But I think it’s a convenient scapegoat that distracts from some of the industry’s real problems that are more easily solvable.
Wrong problem: Over-zealous data science teams often dive straight into the data looking for “something interesting.” We’ve seen large organizations hire 30+ PhD’s with no clear mandate. They then emerge from a six week hole only to realize they had misunderstood the target variable, rendering the analysis irrelevant.
Solved by someone else: We consistently hear data scientists complain about re-inventing the wheel. Anecdotal estimates put it at 30-40% of their time in large organizations with significant amounts of prior art. In the fortunate situation where a past project can be discovered, reproducing it is often impossible given inconsistent preservation of relevant artifacts like data, packages, documentation, and intermediate results.
Wrong tools: Given the explosion of data and tooling functionality, data scientists are still often dramatically ill-equipped to explore the full range of possible domains and solutions. Analysis is still often confined to individual laptops that are easily overwhelmed. We’ve heard of organizations where it can take 6+ months to approve a widely-utilized open source Python package for research purposes, prompting employees to bring their personal laptops and work under their desks.
Right problem/Too slow:
That data scientist who will spend an extra two weeks to eek out a bit more AUC on a targeting model, only to realize the marketing team’s deadline passed
Wrong way: For example, the team that builds a powerful predictive model for underwriters, wraps it in a standalone scoring front end and realizes the underwriters never actually click to a new tab from their existing workflow. One large insurer described it as, “We don’t fail because of the math… we fail because we don’t understand how people will use the math.”
Used inappropriately: Google describes this as the undeclared consumer problem. Results can be thrown “over the fence” and data science teams have little control or even visibility into how those results are being used. For example, someone builds a model for predicting the value of California residential mortgages but then an over-zealous banker uses it to predict the value of Florida commercial mortgages even though the original model creator knew that would be a bad idea.
World changes: Models are by definition an approximation of the real-world. If you don’t keep track of how the world is changing and monitor your models ongoing performance, you imperil the business and likely leave value on the table. My favorite story in this space was a large financial institution that issued credit cards. They had a probability of default model that expected a credit score. The credit bureau changed how they report “not present in the DB” from a null to a 999. Their model didn’t account for this and they just thought a bunch of risky people had perfect credit scores. It took weeks and millions of dollars in bad loans before they caught it.
Can’t solve 100 at once: Many teams have had early wins from their low hanging fruit. Working in a tight-knit team on a single business initiative is great. However, they start to experience negative returns to scale as their existing processes can’t cope with a swollen backlog, an influx of new hires, and heightened expectations from the business.
Opening talk track: Everyone talks about data being the problem. “Data Scientists spend 80% of their time cleaning data.” Don’t get me wrong, it’s a big pain point. But I think it’s a convenient scapegoat that distracts from some of the industry’s real problems that are more easily solvable.
First, let me say that there could be a whole series of talks on this topic alone.
Data science is bigger than just data scientists. Obviously data scientists are a critical component, but there are a whole host of other stakeholders who must come along on the journey for their to be reliable wins at scale. And those stakeholder have very different backgrounds and priorities from data scientists.
Data science managers often act as the bridge to the business and are focused on the quality and pace of output. They worry about things like key man risk and the pace of onboarding
Business leaders don’t care as much about how the sausage is made, but they need to know they can count on data science output to make better decisions without having to drastically change how they and their teams work.
IT leaders care about stability and serving their internal customers. They have KPIs like uptime and 20 minute SLAs, plus initiatives like cloud migration and enterprise standardization. They also want to ensure that new tools fit within existing infrastructure.
The typical data science process neglects most of these stakeholders, letting the legitimately great promise of data science go unrealized.
Reddit blogs on the optimal data science organizational structure don’t get the same traction as throwdowns about Python and R
Data scientists’ wear their tool wrangling as a badge of honor and wrapped up in their identity.
Many organizations have not built a culture of delivery and iteration. This could be a result of many data scientist’s extensive academic backgrounds, though it likely also stems from a confusion between what type of work is really happening: “pure research” and “applying templates to novel business situations.”
Teams are often hired into disconnected Innovation Labs without real business accountability to hone their process. Data science becomes “those people over there in the corner.” This also means they don’t have a deep understanding of the target KPIs and the nuances of how a team works today which is critical to ensuring adoption of their results.
The full stack data scientist is dead, if she ever existed at all. The move towards specialization isn’t just data engineers, it’s a whole host of other roles that cover the concepts of change management, feasibility assessments, rapid prototyping, ROI estimation, training, and stakeholder education. Data science training often focuses on the technical skills, which are necessary but insufficient for driving impact. Increasingly, the role is being partitioned into many roles, as happened with software development over the last two decades.
Data scientists often think of their work as bespoke and highly specialized. While their skillset may be, there are often many artifacts they create that can and should be re-used. Whether those are software packages, data viz suites, feature stores, or anything else. Moreover, many data scientists barely document their development process, much less modularize it.
Many data scientists have told me “I get paid for what I build this year, not maintaining what I built last year.” That leads to huge gaps in monitoring live production models as the responsibility falls to IT who focuses on the system performance, but doesn’t understand if the model is “still right” or being used appropriately.
Who am I?
Work with large enterprises and small start-ups to understand how data science is changing their business
Worked at Bridgewater Associates, hyper-focused on research
Managed (often poorly) data scientists and data engineers
What you’ll learn today
What are common data science struggles
Why so many of our efforts fail to deliver value
How to address these struggles with best practices
Who is doing this well today and what are their principles
Where to focus your efforts tomorrow
I’m here to tell you that there’s a better way. You can overcome this by getting the right people and processes working through a centralized system of record.
Domino offers a detailed inventory of existing projects and moels, supporting knowledge management efforts. We track all research and development work to ensure auditability and facilitate best practices like subject matter expert involvement and code review. We support publication to human consumers via interactive web applications and automated reports, along with publication to other software systems via enterprise-grade APIs. Finally, we track how different projects relate and who is working on them, providing meta-analytics to better direct your future investment.
As more and more of your business decisions are augmented or replaced entirely by data science products, you can’t afford an ad hoc system.
I’m here to tell you that there’s a better way. You can overcome this by getting the right people and processes working through a centralized system of record.
Domino offers a detailed inventory of existing projects and moels, supporting knowledge management efforts. We track all research and development work to ensure auditability and facilitate best practices like subject matter expert involvement and code review. We support publication to human consumers via interactive web applications and automated reports, along with publication to other software systems via enterprise-grade APIs. Finally, we track how different projects relate and who is working on them, providing meta-analytics to better direct your future investment.
As more and more of your business decisions are augmented or replaced entirely by data science products, you can’t afford an ad hoc system.
Some of the most important work in the overall lifecycle happens before a line of code is written. If done well, the ideation stage dramatically de-risks a project by driving alignment across stakeholders. This is where the business objective is identified, success criteria laid out, prior art is reviewed, and initial ROI calculations are performed. Ideation is when feasibility is assessed, both in terms of “Does the data even exist?” and “Can we realistically change how the business process works?” It is also where prioritization happens relative to other potential projects. Below are some best practices we’ve observed that get to the root of many of the problems discussed earlier.
Expect and embrace iteration
Data science is never linear. All the flow charts in the world won’t stop a project from frequently needing to go back to find new data or re-validate a solution. That’s OK.
Enable high-impact collaboration
Collaboration means being able to find, discuss, understand, and build on past work. It shouldn’t matter if that person has left the company, or if the project was 4 versions of Pandas ago.
Ensure auditability and reproducibility
For regulated industries, understanding all the steps in a model’s lifecycle is mandated by law. Even for those that aren’t, as more and more of your models affect critical parts of citizen’s lives (what they read, what disease they’re diagnosed with, etc.) then it’s critical to lay the groundwork for seamless auditability.
Some of the most important work in the overall lifecycle happens before a line of code is written. If done well, the ideation stage dramatically de-risks a project by driving alignment across stakeholders. This is where the business objective is identified, success criteria laid out, prior art is reviewed, and initial ROI calculations are performed. Ideation is when feasibility is assessed, both in terms of “Does the data even exist?” and “Can we realistically change how the business process works?” It is also where prioritization happens relative to other potential projects. Below are some best practices we’ve observed that get to the root of many of the problems discussed earlier.
Problem first, not data first
Many organizations start with the data and look for something “interesting” rather than building a deep understanding of the existing business process and then pinpointing the decision point that can be augmented or automated. Leading organizations go so far as to literally map existing business processes in tools like Vizio, PPT, or LucidChart and then circle on that map the exact points that data science could potentially focus.
Practice and master order of magnitude ROI math
The ability to estimate the potential business impact of a change in a statistical measure is one the best predictors of success for a data science team. For example, if we reduce fraudulent insurance claims by 1%, how much would we save? What is a conservative estimate of how much improvement we can expect by the data scientist's’ efforts? Settle on a number based on past experiences but erring on the conservative side.
Maintain repo of past work with business domain and technical experts
As teams grow, no person can be an expert in everything. It’s critical to have a way to search to see who is most familiar with the latest version of TensorFlow or who has done the most work in the marketing attribution space. Code search is helpful, but ideally includes relevant discussion, environments, and data.
Create and enforce templates for model requirements documents
Documentation up front saves time 10:1 down the road. Create a template for 80% of cases, knowing there will always be exceptions.
Maintain a stakeholder-driven backlog
Your stakeholders should always be able to see what’s in flight and what’s been put in the backlog. Like any product org, they don’t necessarily get to change it, but you should have recurring check-in’s with them to ensure priorities haven’t shifted.
This is where the shape of the final deliverable is agreed upon. It’s always possible to amend the agreed upon deliverable or to have multiple, but visualizing the ultimate consumption medium and working backwards is key. Are you building a one-off answer to support a strategic decision, a standalone lightweight app for stakeholders to use, or a real-time data product that integrates into other systems? The best organizations start simple, get the result into the business, learn and measure before updating the model with a more sophisticated approach (more features, more complex algorithm, deeper integration).
Create multiple mock-ups of different deliverable types - A leading e-commerce company creates 3-5 mocks for every data science project they take on, even bringing in a designer to make it feel real. For example, they discovered exposing their model as a HipChat bot was the most user-friendly way to leverage the model. By iterating on design possibilities before they get data, they ensure they’ve surfaced any previously undiscovered requirements and maximize their odds of adoption.
Bring IT and engineering stakeholders in early - A model may work spectacularly in the lab, but not have any hope of ever working in production the way envisioned by the business. IT and engineering stakeholders need a seat at the table this early in order to identify constraints like “We only backfill that data monthly from the vendor, so we can’t do a real-time scoring engine.”
Consider creating synthetic data with baseline models - Some organizations even create synthetic data and naive baseline models to show how the model would impact existing business processes. A leading agriculture company devotes an entire team to creating synthetic “perfect” data (e.g., no nulls, full history, realistic distribution) to establish potential value with the business before they go contract with expensive satellite data providers to get “real” data
Establish standard software configurations, but give flexibility to experiment
Data scientists’ can often spend the first 8 weeks on the job configuring their workstation rather than exploring existing work and understanding their stakeholder’s priorities. Having a few standard environments gets people onboarded faster. Yet, it’s important they retain flexibility to try new tools and techniques. The tool acquisition process can be so arduous that some data scientists covertly bring their personal machines to work so they don’t have to wait 8 months for a Python package to be approved. Technologies like Docker can eliminate much of this headache.
Abstract away compute provisioning
Data scientists can wait weeks or even months to get the hardware necessary to accelerate their workflows. At Monsanto, they were able to take a research task that took 24 hours to run and complete it in 30 minutes by running it in parallel across dozens of EC2 machines.
Build simple models first
Resist the temptation to use 500 features. One company we know did this, spent weeks engineering the features and tuning the hyperparameters, only to learn that many of them were either a) not collected in real-time so couldn’t be used in the target use case or b) not allowed for compliance reasons. They ended up using a simple 5 features model and then working with their IT team to capture other data in real-time.
Set a cadence for delivering insights
The most common failure mode is data science delivers results that are either too late or don’t fit into how the business works today so results gather dust. Share insights early and often. One leading organization has their data scientists share an insight every 3-4 days. If they can’t publish a short post on incremental findings in business-friendly language, then chances are they are down a rabbit hole. This lets the manager coach more junior or academically-oriented team members, plus gives an easily consumable timeline of the progress for stakeholders.
Ensure business KPI tracked consistently over time
Too often, data scientists lose sight of the business KPI they are trying to affect and instead focus on a narrow statistical measure. Leading teams ensure that the relevant KPI is never far their experiments, whether it’s the Sharpe ratio of a hedge fund’s backtest or the Customer Acquisition Cost for an e-commerce company.
More than just code review, get stakeholder and IT sign-off
This helps prevent delays in the process of delivery and avoids user adoption hurdles down the road.
Ensure reproducibility and clear lineage of project
Quality validation entails dissecting a model and checking assumptions and sensitivities. This is nearly impossible if a validator spends 90% of their time just gathering documentation.
Use automated validation checks to support human inspection
While data science’s non-deterministic nature means that unit testing does not directly apply, there are often repeated steps in a validation process that can be automated. That may be a set of summary statistics and charts, a portfolio backtest, or any other step that could turned into an automated diagnostic.
Preserve null results
Even if a project yields no material uplift and doesn’t get deployed into production, it’s critical to document it and preserve it in the same knowledge repo. Too often, we hear that data scientists’ re-do’ing work someone explored without knowledge of previous inquiries.
Support for many deliverable artifacts (reports, dashboards, apps, batch APIs, real-time APIs)
While real-time scoring gets all the glory, the vast majority of models will at one time or another be prototype apps, dashboards, or batch scoring engines. It’s important to keep a link between all those deliverables because it saves time and avoids risk that key feedback is lost.
Have a promote-to-production workflow
Too often data science teams throw a result over the wall. If you establish the workflow ahead of time, you lower the burden for iterating on a new version of the model. Know what environments and packages are acceptable in production. Know who can make those decisions and what the escalation path is.
Flag upstream and downstream dependencies
A model is at it’s most risky when it finally makes it to production. Ensure that you know the upstream dependencies: what training data was used, what transformations were done with what tools, what modeling packages were used, etc. Also make sure you know the downstream dependencies (e.g., this nightly batch model is stacked on another model).
Monitoring is often forgotten but because getting to delivery is so hard. It’s important not to monitor just the system performance (uptime, latency), but also the usage (more or less than expected) and the statistical performance (is the model degrading?). IT often owns this process but is ill-equipped with traditional tools like New Relic and little context from data scientists who handed off the model.
Build testing into all major deliverables
Build testing into all major deliverables - One leading organization established a global holdout group from all of their customer segmentation and price elasticity models. After a year, they compared the average revenue from holdout group to the customers whose experience was guided by the predictive models. The overall lift was more than $1 billion, which gave them the credibility to dramatically expand the team and push models into more steps of the customer journey.
Require monitoring plans for proactive alerting, acceptable uses, and notification thresholds
The data scientist who created the model is the one best positioned to know what risks are inherent from their approach. Rather than wait for the business to notice something is wrong or a metric to drift, codify that knowledge into your monitoring system. Do you expect certain input types and ranges? If it’s outside of those, what should you do? Rollback? Stop serving predictions? What if someone in a totally different department starts consuming the model in a way that may be risky or outright wrong? Working collaboratively with IT or engineering, data scientists can put the appropriate guardrails on their creations.
Integrate with tools where people spend most of their time (e.g., email / Slack)
High performing teams realize that monitoring is only good if someone acknowledges, inspects, and changes behavior if necessary. We’ve seen organizations build alerts into chatbots or email systems to ensure they can keep up with the alerts as their number of production models scales.
Anticipate risk and change management burdens
At one large insurer, they have a team called Business Analytics Engineers who proactively assess and address change management problems when they deploy a data science like a new claims pricing app. They cover things like training, provide pre-determined feedback channels, and measure usage and engagement to ensure success.
Measure everything, including yourself
Ironically, data scientists live in the world of measurement yet rarely turn that lens on themselves. Tracking patterns in aggregate workflows helps create modular templates, disseminate best practices from high-performing teams, and guide investment to in internal tooling and people to alleviate bottlenecks. Monsanto, a large AWS customer, tracks more than 350 simultaneous projects across 10 business units with more than 200 data scientists. By looking at the aggregate portfolio, they can see that they may need to devote more resources to hiring data engineers. Another large tech company similarly estimated that across their entire body of work that there were fundamentally only 15 ”unique” types of problems and set about formulating templates to streamline their process.
Focus on reducing time to iterate
This “feature” is ultimately the best predictor of data science organizational success we’ve seen. Minimal obstacles (without sacrificing rigorous review and checks) to test real results is another great predictor of data science success. Big tech companies deploy new models in minutes, whereas large financial services companies can take 18 months.
Socialize aggregate portfolio impact
Even if it’s not precise, it’s critical to socialize the impact of the whole portfolio of data science projects. Doing so addresses data scientists’ concerns about impact and helps address executive level concerns about investing further in data science. Importantly, don’t claim the credit for yourselves, but as a collective achievement of all the stakeholders.
As mentioned earlier, the full stack data scientist no longer exists and the roles are increasingly specialized. This is a natural evolution that we expect will continue as data science becomes ingrained into the fabric of how organizations function.
The most consistent feedback we’ve heard is the increasing demand for a “product manager” type role as most organizations move from delivering mathematical results to stakeholder-facing apps. In large tech organizations, data science sits peer with product management to drive strategic priorities and ongoing optimization of engagement and impact.
Most evolve as they scale and as business demands shift
We see many organizations start with a centralized “Center of Excellence” for data science to build their core technical infrastructure before evolving to a hybrid structure. In this structure, the central team focuses on building templates (documentation, software environments, project stage flows) and codifying best practices while embedded groups sit next to each major business line to address the “bookend” problems of identifying the right data science problem and maximizing adoption of solutions. Sometimes a full data science guild (to borrow Spotify’s term) exists and meets regardless of their day-to-day functional department.
Technology is much better positioned to help address the pains of decentralization.
A data science platform can facilitate technical knowledge sharing, encourage or enforce best practices, and provide transparency while still allowing data scientists to be closer to the business