SlideShare uma empresa Scribd logo
1 de 32
A Confluence of Big Data Skills in
Academic and Industry R&D
Bill Howe, PhD
Associate Director
University of Washington eScience Institute
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
“All across our campus, the process of discovery will
increasingly rely on researchers’ ability to extract
knowledge from vast amounts of data… In order to
remain at the forefront, UW must be a leader in
advancing these techniques and technologies, and in
making [them] accessible to researchers in the
broadest imaginable range of fields.”
2005-2008
In other words:
• Data-intensive research will be ubiquitous
• It’s about intellectual infrastructure and software infrastructure,
not only computational infrastructure
http://escience.washington.edu
A 5-year, US $37.8 million cross-institutional
collaboration to create a data science environment
4
2014
5
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
5/7/2015 Bill Howe, UW 6
Jake Vanderplas
5/7/2015 Bill Howe, UW 7
…the new breed of scientist must be a broadly-
trained expert in statistics, in computing, in
algorithm-building, in software design
The skills required to be a successful scientific
researcher are increasingly indistinguishable from
the skills required to be successful in industry.
Jake Vanderplas
5/7/2015 Bill Howe, UW 8
“Data Science” is not the only example…
• Strong Math + PhD  Quant, on Wall Street
• Strong “Data” + PhD  Data Scientist, anywhere
5/7/2015 Bill Howe, UW 9
increased
statistical rigor and
data-driven
decision-making
increased
sophistication in
the use and
development of
software
Industry
Academia
5/7/2015 Bill Howe, UW 11
Maximiliaan Schillebeeckx, Brett Maricque & Cory Lewis
Nature Biotechnology 31, 938–941 (2013) doi:10.1038/nbt.2706
WHAT SKILLS ARE NEEDED?
5/7/2015 Bill Howe, UW 13
5/7/2015 Bill Howe, UW 14
Drew Conway’s Data Science Venn Diagram
5/7/2015 Bill Howe, UW 15
5/7/2015 Bill Howe, UW 18
“I worry that the Data Scientist role is like
the mythical “webmaster” of the 90s:
master of all trades.”
-- Aaron Kimball, CTO of Zymergen,
formerly CTO of Wibidata, formerly
co-founder of Cloudera
5/7/2015 Bill Howe, UW eScience 19
tools principles
desktop cloud
data structures statistics
hackers analysts
What to look for in data science skills
5/7/2015 Bill Howe, UW 20
Cambrian Explosion of Big Data Systems tools principles
5/7/2015 Bill Howe, UW 22
What are the abstractions of
data science?
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
Translation: “We have no idea what
this is all about”
tools principles
5/7/2015 Bill Howe, UW 23
1850s: matrices and linear algebra (today: engineers and scientists)
1950s: arrays and custom algorithms (today: C/Fortran performance junkies)
1950s: s-expressions and pure functions (today: language purists)
1960s: objects and methods (today: software engineers)
1970s: files and scripts (today: system administrators)
1970s: relations and relational algebra (today: industry data pros)
1980s: data frames and functions (today: statisticians)
2000s: key-value pairs + one of the above (today: NoSQL hipsters)
But what are the abstractions of
data science?
tools principles
5/7/2015 Bill Howe, UW 24
“80% of analytics is sums and averages”
-- Aaron Kimball, wibidata
data structures statistics
“The intuition behind this ought to be very simple: Mr. Obama
is maintaining leads in the polls in Ohio and other states that
are sufficient for him to win 270 electoral votes.”
Nate Silver, Oct. 26, 2012
“…the argument we’re making is exceedingly simple. Here it
is: Obama’s ahead in Ohio.”
Nate Silver, Nov. 2, 2012
“The bar set by the competition was invitingly low. Someone could
look like a genius simply by doing some fairly basic research into
what really has predictive power in a political campaign.”
Nate Silver, Nov. 10, 2012
DailyBeast
fivethirtyeight.com
fivethirtyeight.com
source: randy stewart
Nate Silver
data structures statistics
Data Science Workflow
5/7/2015 Bill Howe, UW 26
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
Academia puts far too much
emphasis on this step
data structures statistics
Problem
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
data structures statistics
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
engineering.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
desk cloud
…up to 1 GB (volume)
…up to 10 data sources (variety)
…up to 1% churn/day (velocity)
…up to 1% bad data (veracity)
…up to 10 collaborators
5/7/2015 Bill Howe, UW 30/57
With “manual” approaches,
you can comfortably handle…
But we’re seeing a 10x-100x increase in every
dimension, even under modest assumptions
desk cloud data
structures
statistics
US faces shortage of 140,000 to 190,000 people “with
deep analytical skills, as well as 1.5 million managers
and analysts with the know-how to use the analysis of
big data to make effective decisions.”
5/7/2015 Bill Howe, UW 31
--Mckinsey Global Institute
hackers analysts
Where do you store your data?
src: Conversations with Research Leaders (2008)
src: Faculty Technology Survey (2011)
5%
6%
12%
27%
41%
66%
87%
0% 20% 40% 60% 80% 100%
Other
Department-managed data center
External (non-UW) data center
Server managed by research group
Department-managed server
External device (hard drive, thumb drive)
My computer
Lewis et al 2011
Conversations with DS Hiring Managers
• “How to ask the right questions and communicate
results”
– DS: "I tried three methods, two didn't work, achieved 80%
accuracy”
– Manager: “Ok, so….what do we do?”
• “Can you properly tell a story with the data, and
properly persuade people?”
• "For my team, engineering/stats skills need to be
good, not great."
5/7/2015 Bill Howe, UW 35
hackers analysts
If I had to pick 2…
• Experimental Design
– How to design a statistical test?
– How to interpret significance of a test?
– A/B tests
– More complicated sampling methods
– Sources of bias
– Skewed data
• SQL and Databases
– Mentioned on nearly evey DS job description
– Why? Easy scalability, production data sources, IT integration
5/7/2015 Bill Howe, UW 36
http://cds.nyu.edu/ http://bids.berkeley.edu/ http://escience.washington.edu/
5/7/2015 Bill Howe, UW 38
http://escience.washington.edu
Data Scientist and Research Scientist positions available
Who We Are  Join Us

Mais conteúdo relacionado

Mais procurados

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)James Hendler
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Amit Sheth
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Artificial Intelligence Institute at UofSC
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
The Future(s) of the World Wide Web
The Future(s) of the World Wide WebThe Future(s) of the World Wide Web
The Future(s) of the World Wide WebJames Hendler
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data ScienceFeyzi R. Bagirov
 
Facilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic MarkupFacilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic MarkupJames Hendler
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
The Semantic Web: It's for Real
The Semantic Web: It's for RealThe Semantic Web: It's for Real
The Semantic Web: It's for RealJames Hendler
 
Citizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsCitizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsAmit Sheth
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
 

Mais procurados (20)

Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)
 
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...Smart Data - How you and I will exploit Big Data for personalized digital hea...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
 
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
Knowledge-empowered Probabilistic Graphical Models for Physical-Cyber-Social ...
 
Broad Data
Broad DataBroad Data
Broad Data
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
The Future(s) of the World Wide Web
The Future(s) of the World Wide WebThe Future(s) of the World Wide Web
The Future(s) of the World Wide Web
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
Facilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic MarkupFacilitating Web Science Collaboration through Semantic Markup
Facilitating Web Science Collaboration through Semantic Markup
 
2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
The Semantic Web: It's for Real
The Semantic Web: It's for RealThe Semantic Web: It's for Real
The Semantic Web: It's for Real
 
Citizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and ApplicationsCitizen Sensor Data Mining, Social Media Analytics and Applications
Citizen Sensor Data Mining, Social Media Analytics and Applications
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 

Semelhante a Big Data Talent in Academic and Industry R&D

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Introduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptxIntroduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptxmark828
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptxshalini s
 
Research Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon PorterResearch Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon PorterCASRAI
 
Biomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not AloneBiomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not AlonePhilip Bourne
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your RoleJay Gendron
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextMurad Daryousse
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesDr. Amarjeet Singh
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
Rising tide of data update 20171024
Rising tide of data update 20171024Rising tide of data update 20171024
Rising tide of data update 20171024Keith Russell
 
Rising tide of data update
Rising tide of data update Rising tide of data update
Rising tide of data update ARDC
 
Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Han Woo PARK
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadKelly Technologies
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxdatapro2
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxSanmati Jain
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 

Semelhante a Big Data Talent in Academic and Industry R&D (20)

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Introduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptxIntroduction to Data Science 1113.pptx
Introduction to Data Science 1113.pptx
 
Real-time applications of Data Science.pptx
Real-time applications  of Data Science.pptxReal-time applications  of Data Science.pptx
Real-time applications of Data Science.pptx
 
Research Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon PorterResearch Metadata Mechanics - Simon Porter
Research Metadata Mechanics - Simon Porter
 
Biomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not AloneBiomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not Alone
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Semantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data ContextSemantic Web Investigation within Big Data Context
Semantic Web Investigation within Big Data Context
 
A Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: ChallengesA Survey on Big Data Analytics: Challenges
A Survey on Big Data Analytics: Challenges
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Rising tide of data update 20171024
Rising tide of data update 20171024Rising tide of data update 20171024
Rising tide of data update 20171024
 
Rising tide of data update
Rising tide of data update Rising tide of data update
Rising tide of data update
 
Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생Mapping (big) data science (15 dec2014)대학(원)생
Mapping (big) data science (15 dec2014)대학(원)생
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptx
 
Introduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptxIntroduction to Data Science 5-13.pptx
Introduction to Data Science 5-13.pptx
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 

Mais de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 

Mais de University of Washington (16)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

Último

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 

Último (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Big Data Talent in Academic and Industry R&D

  • 1. A Confluence of Big Data Skills in Academic and Industry R&D Bill Howe, PhD Associate Director University of Washington eScience Institute
  • 2. The Fourth Paradigm 1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive Jim Gray
  • 3. “All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data… In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields.” 2005-2008 In other words: • Data-intensive research will be ubiquitous • It’s about intellectual infrastructure and software infrastructure, not only computational infrastructure http://escience.washington.edu
  • 4. A 5-year, US $37.8 million cross-institutional collaboration to create a data science environment 4 2014
  • 5. 5 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
  • 6. 5/7/2015 Bill Howe, UW 6 Jake Vanderplas
  • 7. 5/7/2015 Bill Howe, UW 7 …the new breed of scientist must be a broadly- trained expert in statistics, in computing, in algorithm-building, in software design The skills required to be a successful scientific researcher are increasingly indistinguishable from the skills required to be successful in industry. Jake Vanderplas
  • 9. “Data Science” is not the only example… • Strong Math + PhD  Quant, on Wall Street • Strong “Data” + PhD  Data Scientist, anywhere 5/7/2015 Bill Howe, UW 9
  • 10. increased statistical rigor and data-driven decision-making increased sophistication in the use and development of software Industry Academia
  • 12. Maximiliaan Schillebeeckx, Brett Maricque & Cory Lewis Nature Biotechnology 31, 938–941 (2013) doi:10.1038/nbt.2706
  • 13. WHAT SKILLS ARE NEEDED? 5/7/2015 Bill Howe, UW 13
  • 15. Drew Conway’s Data Science Venn Diagram 5/7/2015 Bill Howe, UW 15
  • 16. 5/7/2015 Bill Howe, UW 18 “I worry that the Data Scientist role is like the mythical “webmaster” of the 90s: master of all trades.” -- Aaron Kimball, CTO of Zymergen, formerly CTO of Wibidata, formerly co-founder of Cloudera
  • 17. 5/7/2015 Bill Howe, UW eScience 19 tools principles desktop cloud data structures statistics hackers analysts What to look for in data science skills
  • 18. 5/7/2015 Bill Howe, UW 20 Cambrian Explosion of Big Data Systems tools principles
  • 19. 5/7/2015 Bill Howe, UW 22 What are the abstractions of data science? “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about” tools principles
  • 20. 5/7/2015 Bill Howe, UW 23 1850s: matrices and linear algebra (today: engineers and scientists) 1950s: arrays and custom algorithms (today: C/Fortran performance junkies) 1950s: s-expressions and pure functions (today: language purists) 1960s: objects and methods (today: software engineers) 1970s: files and scripts (today: system administrators) 1970s: relations and relational algebra (today: industry data pros) 1980s: data frames and functions (today: statisticians) 2000s: key-value pairs + one of the above (today: NoSQL hipsters) But what are the abstractions of data science? tools principles
  • 21. 5/7/2015 Bill Howe, UW 24 “80% of analytics is sums and averages” -- Aaron Kimball, wibidata data structures statistics
  • 22. “The intuition behind this ought to be very simple: Mr. Obama is maintaining leads in the polls in Ohio and other states that are sufficient for him to win 270 electoral votes.” Nate Silver, Oct. 26, 2012 “…the argument we’re making is exceedingly simple. Here it is: Obama’s ahead in Ohio.” Nate Silver, Nov. 2, 2012 “The bar set by the competition was invitingly low. Someone could look like a genius simply by doing some fairly basic research into what really has predictive power in a political campaign.” Nate Silver, Nov. 10, 2012 DailyBeast fivethirtyeight.com fivethirtyeight.com source: randy stewart Nate Silver data structures statistics
  • 23. Data Science Workflow 5/7/2015 Bill Howe, UW 26 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work” Academia puts far too much emphasis on this step data structures statistics
  • 24. Problem How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” data structures statistics
  • 25. “[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used). In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants) So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format. I guess in total [I spent] 6 months [on this project].” At least 3 months on issues of scale, file handling, and feature engineering. Martin Kircher, Genome SciencesWhy? 3k NSF postdocs in 2010 $50k / postdoc at least 50% overhead maybe $75M annually at NSF alone? desk cloud
  • 26. …up to 1 GB (volume) …up to 10 data sources (variety) …up to 1% churn/day (velocity) …up to 1% bad data (veracity) …up to 10 collaborators 5/7/2015 Bill Howe, UW 30/57 With “manual” approaches, you can comfortably handle… But we’re seeing a 10x-100x increase in every dimension, even under modest assumptions desk cloud data structures statistics
  • 27. US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” 5/7/2015 Bill Howe, UW 31 --Mckinsey Global Institute hackers analysts
  • 28. Where do you store your data? src: Conversations with Research Leaders (2008) src: Faculty Technology Survey (2011) 5% 6% 12% 27% 41% 66% 87% 0% 20% 40% 60% 80% 100% Other Department-managed data center External (non-UW) data center Server managed by research group Department-managed server External device (hard drive, thumb drive) My computer Lewis et al 2011
  • 29. Conversations with DS Hiring Managers • “How to ask the right questions and communicate results” – DS: "I tried three methods, two didn't work, achieved 80% accuracy” – Manager: “Ok, so….what do we do?” • “Can you properly tell a story with the data, and properly persuade people?” • "For my team, engineering/stats skills need to be good, not great." 5/7/2015 Bill Howe, UW 35 hackers analysts
  • 30. If I had to pick 2… • Experimental Design – How to design a statistical test? – How to interpret significance of a test? – A/B tests – More complicated sampling methods – Sources of bias – Skewed data • SQL and Databases – Mentioned on nearly evey DS job description – Why? Easy scalability, production data sources, IT integration 5/7/2015 Bill Howe, UW 36
  • 32. 5/7/2015 Bill Howe, UW 38 http://escience.washington.edu Data Scientist and Research Scientist positions available Who We Are  Join Us

Notas do Editor

  1. I want to talk about not just partnerships, but more broadly about the fact that the needs of industry and academia are becoming aligned, and what this alignment means for science.
  2. 2
  3. Institutional change rather than specific research projects
  4. It used to be a lot harder to have this conversation about data-intensive science. As data-intensive science and technology has moved to the forefront of attention
  5. Jake Vanderplas, our Director of Research in the Physical S wrote a piece about the brain drain, making a couple of key points
  6. The argument goes like this: … Data-intensive implies software-intensive. Research has become data-intensive and therefore software intensive. Jake is exemplary of Pi-shaped-ness: A PhD in Astronomy, a postdoc in Computer Science, and is now a Data Scientist at large working deeply in Astronomy, Machine Learning, and Open Source software. The title and messge of the article emphasize the potential negative effects of these trends: as the skills required by industry and academia align, there is a greater draw away from science.
  7. We use this device to talk about this idea: the pi-shaped researcher.
  8. Academia is adapting to incentivize and reward software development activities Industry is adapting to incentivize and reward statistical rigor and data-driven decision-making
  9. There are even organizations explicitly advancing the brain drain: Insight Data Science Fellows positions those with advanced degrees from other disciplines for data scientist jobs. Other examples exist, including Biotechnology and Life Science Advising group (BALSA). Not just data science, but designed to help prepare students for academic and non-acdemic career paths. Maybe this isn’t so bad: 1) We produce way too many PhDs 2) PhDs in many fields have many of the raw materials needed to become data scientists. The problem is
  10. “Data Jujitsu” “Data Wrangling” “Data Munging”
  11. matrices and linear algebra is a terrible programming model, but there’s just so god damn much math that has been developed around them, that it’s here to stay. the functional programming crowd has been poised to solve all the world’s ills for 60 years, but they tend to have trouble pulling their heads out of their own navels long enough to solve someone’s actual problem in practice objects and methods are great for building software systems, but get in the way for data analysis files and scripts aren’t really data analysis – they are low-level operating system concepts data frames are just relations key-value pairs -- I’ll talk more about this in a bit Scale “While the community was skeptical that this new method could possibly outperform hand-coding, it reduced the number of programming statements necessary to operate a machine by a factor of 20, and quickly gained acceptance. “ “Relational model was buggy and slow, but you only had to write 5% of the code you used to have to write”
  12. R and files vs. databases Hadoop and friends vs. databases God created …. Codd created….
  13. In november 2012, Nate Silver predicted the electoral college map precisely. He’l be the first one to tell you that the methods used were straightforward: Look at what worked in the past, and use it to predict the future. In this case, the average of state polls have historically done a great job – this is what Nate Silver used. Perhaps two important takeaways: 1) simple methods and good data are powerful – the right answer does not depend on sophisticated techniques. 2) Most of Silver’s effort went into communicating his results: creating data products such as maps, carefully modeling the uncertainty (which can and did require some mathematical sophistication), and blogging about his reasoning. Simple methods, and the importance of communication: these themes will come up over and over.
  14. (granted we had a minute for Bill (clearly Bill) to describe this new eScience movement) We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve. Essentially, we want to remove the speed-bump of data handling from the scientists.
  15. Our collaborators tell us that loading data into memory with R is the major bottleneck. It actually changes the science they can do: I would say that we can start answering questions about macro-ecology (study of relationships between organisms and their environment at large spatial scales).
  16. emailing files, using spreadsheets, cleaning by direct inspection
  17. D
  18. We looked at 20+ job descriptions for Data scientists. As you can imagine, lots oThe only common requirement was SQL.