An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Big Data Talent in Academic and Industry R&D
1. A Confluence of Big Data Skills in
Academic and Industry R&D
Bill Howe, PhD
Associate Director
University of Washington eScience Institute
2. The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
3. “All across our campus, the process of discovery will
increasingly rely on researchers’ ability to extract
knowledge from vast amounts of data… In order to
remain at the forefront, UW must be a leader in
advancing these techniques and technologies, and in
making [them] accessible to researchers in the
broadest imaginable range of fields.”
2005-2008
In other words:
• Data-intensive research will be ubiquitous
• It’s about intellectual infrastructure and software infrastructure,
not only computational infrastructure
http://escience.washington.edu
4. A 5-year, US $37.8 million cross-institutional
collaboration to create a data science environment
4
2014
5. 5
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
7. 5/7/2015 Bill Howe, UW 7
…the new breed of scientist must be a broadly-
trained expert in statistics, in computing, in
algorithm-building, in software design
The skills required to be a successful scientific
researcher are increasingly indistinguishable from
the skills required to be successful in industry.
Jake Vanderplas
9. “Data Science” is not the only example…
• Strong Math + PhD Quant, on Wall Street
• Strong “Data” + PhD Data Scientist, anywhere
5/7/2015 Bill Howe, UW 9
16. 5/7/2015 Bill Howe, UW 18
“I worry that the Data Scientist role is like
the mythical “webmaster” of the 90s:
master of all trades.”
-- Aaron Kimball, CTO of Zymergen,
formerly CTO of Wibidata, formerly
co-founder of Cloudera
17. 5/7/2015 Bill Howe, UW eScience 19
tools principles
desktop cloud
data structures statistics
hackers analysts
What to look for in data science skills
18. 5/7/2015 Bill Howe, UW 20
Cambrian Explosion of Big Data Systems tools principles
19. 5/7/2015 Bill Howe, UW 22
What are the abstractions of
data science?
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
Translation: “We have no idea what
this is all about”
tools principles
20. 5/7/2015 Bill Howe, UW 23
1850s: matrices and linear algebra (today: engineers and scientists)
1950s: arrays and custom algorithms (today: C/Fortran performance junkies)
1950s: s-expressions and pure functions (today: language purists)
1960s: objects and methods (today: software engineers)
1970s: files and scripts (today: system administrators)
1970s: relations and relational algebra (today: industry data pros)
1980s: data frames and functions (today: statisticians)
2000s: key-value pairs + one of the above (today: NoSQL hipsters)
But what are the abstractions of
data science?
tools principles
21. 5/7/2015 Bill Howe, UW 24
“80% of analytics is sums and averages”
-- Aaron Kimball, wibidata
data structures statistics
22. “The intuition behind this ought to be very simple: Mr. Obama
is maintaining leads in the polls in Ohio and other states that
are sufficient for him to win 270 electoral votes.”
Nate Silver, Oct. 26, 2012
“…the argument we’re making is exceedingly simple. Here it
is: Obama’s ahead in Ohio.”
Nate Silver, Nov. 2, 2012
“The bar set by the competition was invitingly low. Someone could
look like a genius simply by doing some fairly basic research into
what really has predictive power in a political campaign.”
Nate Silver, Nov. 10, 2012
DailyBeast
fivethirtyeight.com
fivethirtyeight.com
source: randy stewart
Nate Silver
data structures statistics
23. Data Science Workflow
5/7/2015 Bill Howe, UW 26
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
Academia puts far too much
emphasis on this step
data structures statistics
24. Problem
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
data structures statistics
25. “[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
engineering.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
desk cloud
26. …up to 1 GB (volume)
…up to 10 data sources (variety)
…up to 1% churn/day (velocity)
…up to 1% bad data (veracity)
…up to 10 collaborators
5/7/2015 Bill Howe, UW 30/57
With “manual” approaches,
you can comfortably handle…
But we’re seeing a 10x-100x increase in every
dimension, even under modest assumptions
desk cloud data
structures
statistics
27. US faces shortage of 140,000 to 190,000 people “with
deep analytical skills, as well as 1.5 million managers
and analysts with the know-how to use the analysis of
big data to make effective decisions.”
5/7/2015 Bill Howe, UW 31
--Mckinsey Global Institute
hackers analysts
28. Where do you store your data?
src: Conversations with Research Leaders (2008)
src: Faculty Technology Survey (2011)
5%
6%
12%
27%
41%
66%
87%
0% 20% 40% 60% 80% 100%
Other
Department-managed data center
External (non-UW) data center
Server managed by research group
Department-managed server
External device (hard drive, thumb drive)
My computer
Lewis et al 2011
29. Conversations with DS Hiring Managers
• “How to ask the right questions and communicate
results”
– DS: "I tried three methods, two didn't work, achieved 80%
accuracy”
– Manager: “Ok, so….what do we do?”
• “Can you properly tell a story with the data, and
properly persuade people?”
• "For my team, engineering/stats skills need to be
good, not great."
5/7/2015 Bill Howe, UW 35
hackers analysts
30. If I had to pick 2…
• Experimental Design
– How to design a statistical test?
– How to interpret significance of a test?
– A/B tests
– More complicated sampling methods
– Sources of bias
– Skewed data
• SQL and Databases
– Mentioned on nearly evey DS job description
– Why? Easy scalability, production data sources, IT integration
5/7/2015 Bill Howe, UW 36
32. 5/7/2015 Bill Howe, UW 38
http://escience.washington.edu
Data Scientist and Research Scientist positions available
Who We Are Join Us
Notas do Editor
I want to talk about not just partnerships, but more broadly about the fact that the needs of industry and academia are becoming aligned, and what this alignment means for science.
2
Institutional change rather than specific research projects
It used to be a lot harder to have this conversation about data-intensive science.
As data-intensive science and technology has moved to the forefront of attention
Jake Vanderplas, our Director of Research in the Physical S wrote a piece about the brain drain, making a couple of key points
The argument goes like this:
…
Data-intensive implies software-intensive. Research has become data-intensive and therefore software intensive.
Jake is exemplary of Pi-shaped-ness: A PhD in Astronomy, a postdoc in Computer Science, and is now a Data Scientist at large working deeply in Astronomy, Machine Learning, and Open Source software.
The title and messge of the article emphasize the potential negative effects of these trends: as the skills required by industry and academia align, there is a greater draw away from science.
We use this device to talk about this idea: the pi-shaped researcher.
Academia is adapting to incentivize and reward software development activities
Industry is adapting to incentivize and reward statistical rigor and data-driven decision-making
There are even organizations explicitly advancing the brain drain: Insight Data Science Fellows positions those with advanced degrees from other disciplines for data scientist jobs.
Other examples exist, including Biotechnology and Life Science Advising group (BALSA). Not just data science, but designed to help prepare students for academic and non-acdemic career paths.
Maybe this isn’t so bad:
1) We produce way too many PhDs
2) PhDs in many fields have many of the raw materials needed to become data scientists.
The problem is
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
matrices and linear algebra is a terrible programming model, but there’s just so god damn much math that has been developed around them, that it’s here to stay.
the functional programming crowd has been poised to solve all the world’s ills for 60 years, but they tend to have trouble pulling their heads out of their own navels long enough to solve someone’s actual problem in practice
objects and methods are great for building software systems, but get in the way for data analysis
files and scripts aren’t really data analysis – they are low-level operating system concepts
data frames are just relations
key-value pairs -- I’ll talk more about this in a bit
Scale
“While the community was skeptical that this new method could possibly outperform hand-coding, it reduced the number of programming statements necessary to operate a machine by a factor of 20, and quickly gained acceptance. “
“Relational model was buggy and slow, but you only had to write 5% of the code you used to have to write”
R and files vs. databases
Hadoop and friends vs. databases
God created …. Codd created….
In november 2012, Nate Silver predicted the electoral college map precisely.
He’l be the first one to tell you that the methods used were straightforward: Look at what worked in the past, and use it to predict the future. In this case, the average of state polls have historically done a great job – this is what Nate Silver used.
Perhaps two important takeaways:
1) simple methods and good data are powerful – the right answer does not depend on sophisticated techniques.
2) Most of Silver’s effort went into communicating his results: creating data products such as maps, carefully modeling the uncertainty (which can and did require some mathematical sophistication), and blogging about his reasoning.
Simple methods, and the importance of communication: these themes will come up over and over.
(granted we had a minute for Bill (clearly Bill) to describe this new eScience movement)
We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.
Essentially, we want to remove the speed-bump of data handling from the scientists.
Our collaborators tell us that loading data into memory with R is the major bottleneck.
It actually changes the science they can do:
I would say that we can start answering questions about macro-ecology (study of relationships between organisms and their environment at large spatial scales).
emailing files, using spreadsheets, cleaning by direct inspection
D
We looked at 20+ job descriptions for Data scientists. As you can imagine, lots oThe only common requirement was SQL.