by William Whipple Neely
Director of Data Science at Electronic Arts
Data scientists and analysts write code, sometimes a lot of code, so we are also software developers as much as model builders and algorithm creators. This talk is about the challenges a team of data scientists and analysts face when trying to scale their work, to make their work repeatable and testable. I’ll talk about how our data science team is leveling-up their skills as software developers, the challenges we’ve faced and the strategies that are helping.
1. DATA SCIENTISTS AND ANALYSTS
ARE ALSO SOFTWARE ENGINEERS
W.Whipple Neely
Director of Data Science, EA
2. THIS TALK IS ABOUT …..
Moving data science and analytics teams to a software development
model.
• The motivation is so that we can created repeatable, verifiable
processes.
• It also means that we can bring powerful but “personal” analysis
environments (such as R) into producing enterprise level systems, to
create work that typical dashboarding systems cannot achieve.
• In many ways this is a story about one set of teams, it may not apply
to all groups, but it has helped ours.
3. THE TYPICAL VENN DIAGRAM: WHO IS A DATA SCIENTIST
Statistics
SomeVersion
of Domain
Expertise
Computer
Science
“hacker skills”
Data
Science
“What kind of person does all this?
What abilities make a data scientist
successful?Think of him or her as a
hybrid of data hacker, analyst,
communicator, and trusted adviser.”
Davenport and Patil, Data Scientist: The
Sexiest Job of the 21st Century , Harvard
Business Review, 2012
“Hacker skills” is the wrong term
4. Click to add call out
GOOGLE IMAGE SEARCH: “WHO DATA SCIENTIST VENN DIAGRAM”
5. WHAT WE DO INSTEAD OF WHO WE ARE
Engineering
CollaborationScience
Data
Science
data engineering, coding
discipline, software
engineering, style guides
reproducibility, source code
control, regression tests
math, stats, computer science,
machine learning, probability
models, economics,
“substantive domain
expertise”, vast quantities of
common sense
Rules of engagement,
empathy, communication
and listening skills,
flexibility, reliability,
extreme social skills
6. THE PROBLEMS
We have a team of data scientists who are experts at probability modeling,
machine learning, and a few of them are pretty good at programming in R,
Matlab or Python on a laptop. However …
1. Most have no experience of team programming
2. Many come without experience of creating software that others can use, or
that is robust enough of to run
3. Creating an enterprise-level repeatable process can’t be left to the kind of
programming that most of us do on our laptops
4. There is no easy intermediate step between working on a laptop and
something that works on the enterprise platform.
7. WHERE WE STARTED
Write R or
Python Script
Run Script
Manually
Update
Report
Write R or
Python Script
Run Script
Manually
Update A Static
Model
Implementation
OR
8. THE PROBLEMS WITH WHERE WE
STARTED
• Code/methods/models got lost.
• Lots of manual work.
• No automated checks for correctness or robustness of
models or predictions.
9. WE TALKED TO THE TEAMS ABOUT WHAT
WAS WRONG
“Our analysts are pretty good at writing scripts and generating
reports, but our team needs help with the bookends: scheduling
tasks and serving the reports automatically” – Colleen Chrisco,
Director of Analytics, PopCap Games
10. IN TERMS OF OUR DIAGRAM
Engineering
CollaborationScience
Data
Science
data engineering, coding
discipline, software
engineering, style guides
reproducibility, source code
control, regression tests
math, stats, computer science,
machine learning, probability
models, economics,
“substantive domain
expertise”, vast quantities of
common sense
Rules of engagement,
empathy, communication
and listening skills,
flexibility, reliability,
extreme social skills
11. Click to add call out
THIS WAS A LITTLE SCARY FOR SOME OF OUR TEAMS ….
We’re not
programmers.
I don’t even know
where to start
I’ve never
scheduled a job
before.
12. Click to add call out
SO, TO ANSWER THESE CONCERNS WE
DID THE FOLLOWING…
Perforce R Server
Script Inputs:
csv, DBs, URL, logs,
RDS
Script Outputs:
csv, DBs, email, doc,
pdf, html, shiny, RDS
1. Check in Code
P4V, R-Checkin
2. Submit Job
Schedule file, API, Web
3. Run Script
Reporting, Models,
ETLs, Forecasting
R Script
By “we did the following’ I really mean that we hired a brilliant computer
scientist named Ben Weber who became part of the team. Ben learned
the workflows of the team members and created this system for us.
13. WHERE IT LANDED US
• We’d automated.
• We’d gotten the “bookends” covered.
• Many analytics teams, including the data science team are using the
system.
As a result …
• Teams started using the technology to improve their work
• Teams became more efficient: “I no longer have to be a walking
dashboard.”
• Astonishingly these teams now have their routine code in source
control.
14. BUT IT DIDN’T SOLVE EVERYTHING
• We had produced more tools, simplified tasks, but hadn’t really
created a culture of being a software producing organization.
• We had extended the laptop model … a little by introducing VMs that
could run the code.
And giving teams more tools had introduced some issues …
• A proliferation of models/predictions being run without curating the
processes.
• People leave, and their work continues to be run automatically …. This
is not always a bad thing, but it is often not a good thing either.
15. WHAT WE KNEW WE HAD TO DO NEXT
We needed to make a cultural change from what is essentially
“hacking” to engineering.
• So, we did start hiring people with more software engineering
skills.
• Introduced a style guide for our R code.
• We started code and project reviews.
• Hired a very non-technical writer to start helping the team
produce documentation on our internal Confluence site.
• Start providing training in team programming, engineering,
new languages (Spark, Python).
• Assign some of the positions on the team to be the
software/coding gurus.
16. WHAT’S NEXT
• Dev/Test/Prod environments.
• Upgrading our toolset to work with Rstudio Server and Git.
• Pair programming: a team member with software skills as
their primary background team programming with a data
scientist who has focused on statistical modeling and
machine learning.