SlideShare uma empresa Scribd logo
1 de 30
Bill Howe, PhD
Director of
Research, Scalable Data
Analytics
University of Washington
eScience Institute
Big Data Curricula at the
University of Washington
eScience Institute
8/7/2013 Bill Howe, UW 1
2
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
1. Theory (last 2000 yrs)
2. Experiment (last 200
yrs)
3. Simulation (last 50 yrs)
4. Data-Driven Discovery
(last 5 yrs)
The University of Washington
eScience Institute
• Rationale
– The exponential increase in sensors is transitioning all fields of science
and engineering from data-poor to data-rich
– As a result, the techniques and technologies of data science must be
widely practiced and widely adopted
• Mission
– Advance the forefront of research both in modern data science
techniques and technologies, and in the fields that depend upon them
• Strategy
– Provide an umbrella organization for Big Data activities at UW and
beyond (new curricula, collaborations, funding sources, hiring practices)
– Bootstrap a national network of partners and peer institutes
– Attract, develop, and retain “Pi-shaped people”
8/7/2013 Bill Howe, UW 4
π-shaped researchers
Broad in many areas; deep in at least two
UW Data Science Education Efforts
8/7/2013 Bill Howe, UW 6
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
Graduate Certificate in Big Data
CS Data Management Courses
eScience workshops
Intro to data programming
eScience Masters (planned)
MOOC: Intro to Data Science
Incubator: On-the-job-training
Previous courses:
Scientific Data Management, Graduate CS, Summer 2006, Portland State University
Scientific Data Management, Graduate CS, Spring 2010, University of Washington
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 7
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 8
8/7/2013 Bill Howe, UW 9
• 8600 completed all programming assignments
• 7000 earned a certificate
Syllabus
• Data Science Landscape (~1 week)
• Data Manipulation at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Analytics
– Statistics Pearls (~1 week)
– Machine Learning Pearls (~1 week)
• Visualization (~1 week)
8/7/2013 Bill Howe, UW 12
8/7/2013 Bill Howe, UW 13
tools abstr.
desk cloud
structs stats
hackers analysts
This Course
8/7/2013 Bill Howe, UW 14
What are the abstractions of
data science?
tools abstr.
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
Translation: “We have no idea what
this is all about”
8/7/2013 Bill Howe, UW 15
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and scripts?
data frames and functions?
What are the abstractions of
data science?
tools abstr.
16
Data Access Hitting a Wall
Current practice based on data download (FTP/GREP)
Will not scale to the datasets of tomorrow
• You can GREP 1 MB in a second
• You can GREP 1 GB in a minute
• You can GREP 1 TB in 2 days
• You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~5,000 disks
• At some point you need
indices to limit search
parallel data search and analysis
• This is where databases can help
• You can FTP 1 MB in 1 sec
• You can FTP 1 GB / min (~1$)
• … 2 days and 1K$
• … 3 years and 1M$
desk cloud
[slide src: Jim Gray]
US faces shortage of 140,000 to 190,000
people “with deep analytical skills, as well
as 1.5 million managers and analysts with
the know-how to use the analysis of big
data to make effective decisions.”
8/7/2013 Bill Howe, UW 17
--Mckinsey Global Institute
hackers analysts
Three types of tasks:
8/7/2013 Bill Howe, UW 18
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
-- Aaron Kimball
structs stats
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 19
New Phd Track: “Big Data U”
• Open to all departments
• New courses to “level the playing field”
– “Molecular Biology for Computer Scientists” offered this Fall
• Dual advising in two disciplines
• Joint projects leading to multiple theses
– Each methods thesis will include domain impact component
– Each domain thesis will include methods impact component
• Contribution to a shared cyberinfrastructure
– Software engineering experience as a side effect
• “Application Assistantships”
– Like RAs and TAs; focused on solving a concrete problem
8/7/2013 Bill Howe, UW 20
Magda
Balazinska
Carlos
Guestrin
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 21
Data Science Incubator: Motivation
• We need the right people
– We produce “builders,” but 99% of them go to industry to
“make people click on ads”
– They aren’t motivated by writing papers
– No viable career path in the academy
• We need the right processes
– Hands-on, extended, intensive experience is required to
produce π-shaped people
– Data-driven discovery requires intensive collaboration
8/7/2013 Bill Howe, UW 22
Science Domains
Stats, Computer
Science, Applied Math
• “Where’s the funding?”
• “How does this help me write a paper in my field”?
• Thin collaborations; nobody to work on the short-
term, high-risk, high-impact “triage” projects
• “Does method X work on dataset Y?”
Domain Labs
Research Programmers
• Expensive; doesn’t scale
• “Code Monkey” – No viable career path
• Can’t attract top people
• No sharing, no community, no cross-pollination
Data Science Incubator: Structure
• Recruit top-flight data science talent
• Give them autonomy to select collaborations and projects
• Promote them according to “altmetrics” and project impact
– “Data Scientist”  “Senior Data Scientist”  “Technical Fellow”
– “Data Science Fellows”
• Perhaps non-tenure, but 3-5 year commitments
• Funded with contributions from Academic units, IT,
Libraries, and soft money
8/7/2013 Bill Howe, UW 25
Data Science Incubator: Seed Grants
• Domain researchers submit Seed Grant applications
for short, intensive 1-6 month projects
– Reviewed by the Data Scientists themselves
• Awardees send 1+ students, postdocs, staff, or faculty
to come and physically sit in the incubator space X
days per week for the project duration
– Application may or may not include funding for the student
8/7/2013 Bill Howe, UW 26
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awardees leave with skills and knowledge; become “disciples”
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awardees leave with skills and knowledge; become “disciples”
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 29
MOOC “Introduction to Data Science:”
https://www.coursera.org/course/datasci
Certificate program:
http://www.pce.uw.edu/courses/data-science-intro
8/7/2013 Bill Howe, UW 30
http://escience.washington.edu
billhowe@cs.washington.edu

Mais conteúdo relacionado

Mais procurados

Today's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's CitizensToday's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's Citizens
Communication and Media Studies, Carleton University
 

Mais procurados (19)

25
2525
25
 
Making Biomedical Research More Like Airbnb
Making Biomedical Research More Like AirbnbMaking Biomedical Research More Like Airbnb
Making Biomedical Research More Like Airbnb
 
20
2020
20
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Project
 
Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...
 
2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review
 
Information is beautiful
Information is beautifulInformation is beautiful
Information is beautiful
 
Towards a Platform for Global Health
Towards a Platform for Global HealthTowards a Platform for Global Health
Towards a Platform for Global Health
 
The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...
 
The NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentThe NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training Environment
 
Moving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisMoving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT Analysis
 
Internet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam UniversityInternet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam University
 
Health Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big DataHealth Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big Data
 
BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020
 
Bw dave pattern lidp
Bw dave pattern lidpBw dave pattern lidp
Bw dave pattern lidp
 
Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
The African Open Science Platform/Susan Veldsman
The African Open Science Platform/Susan VeldsmanThe African Open Science Platform/Susan Veldsman
The African Open Science Platform/Susan Veldsman
 
Today's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's CitizensToday's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's Citizens
 

Semelhante a Big Data Curricula at the UW eScience Institute, JSM 2013

Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
SEAD
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
Feyzi R. Bagirov
 

Semelhante a Big Data Curricula at the UW eScience Institute, JSM 2013 (20)

2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate Students
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community Responses
 
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...
 
The Rise of the Data Journal
The Rise of the Data JournalThe Rise of the Data Journal
The Rise of the Data Journal
 
Yafei liang resume
Yafei liang resumeYafei liang resume
Yafei liang resume
 
Yafei (debbie) Liang resume
Yafei (debbie) Liang resume  Yafei (debbie) Liang resume
Yafei (debbie) Liang resume
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Big Data
Big Data Big Data
Big Data
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
 
Yafei liang resume
Yafei liang resume Yafei liang resume
Yafei liang resume
 
Yafei liang resume
Yafei liang resume Yafei liang resume
Yafei liang resume
 
2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr
 

Mais de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
University of Washington
 

Mais de University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Último (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 

Big Data Curricula at the UW eScience Institute, JSM 2013

  • 1. Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute Big Data Curricula at the University of Washington eScience Institute 8/7/2013 Bill Howe, UW 1
  • 2. 2 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
  • 3. 1. Theory (last 2000 yrs) 2. Experiment (last 200 yrs) 3. Simulation (last 50 yrs) 4. Data-Driven Discovery (last 5 yrs)
  • 4. The University of Washington eScience Institute • Rationale – The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich – As a result, the techniques and technologies of data science must be widely practiced and widely adopted • Mission – Advance the forefront of research both in modern data science techniques and technologies, and in the fields that depend upon them • Strategy – Provide an umbrella organization for Big Data activities at UW and beyond (new curricula, collaborations, funding sources, hiring practices) – Bootstrap a national network of partners and peer institutes – Attract, develop, and retain “Pi-shaped people” 8/7/2013 Bill Howe, UW 4
  • 5. π-shaped researchers Broad in many areas; deep in at least two
  • 6. UW Data Science Education Efforts 8/7/2013 Bill Howe, UW 6 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) MOOC: Intro to Data Science Incubator: On-the-job-training Previous courses: Scientific Data Management, Graduate CS, Summer 2006, Portland State University Scientific Data Management, Graduate CS, Spring 2010, University of Washington
  • 7. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 7
  • 8. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 8
  • 10. • 8600 completed all programming assignments • 7000 earned a certificate
  • 11.
  • 12. Syllabus • Data Science Landscape (~1 week) • Data Manipulation at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Analytics – Statistics Pearls (~1 week) – Machine Learning Pearls (~1 week) • Visualization (~1 week) 8/7/2013 Bill Howe, UW 12
  • 13. 8/7/2013 Bill Howe, UW 13 tools abstr. desk cloud structs stats hackers analysts This Course
  • 14. 8/7/2013 Bill Howe, UW 14 What are the abstractions of data science? tools abstr. “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about”
  • 15. 8/7/2013 Bill Howe, UW 15 matrices and linear algebra? relations and relational algebra? objects and methods? files and scripts? data frames and functions? What are the abstractions of data science? tools abstr.
  • 16. 16 Data Access Hitting a Wall Current practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow • You can GREP 1 MB in a second • You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days • You can GREP 1 PB in 3 years. • Oh!, and 1PB ~5,000 disks • At some point you need indices to limit search parallel data search and analysis • This is where databases can help • You can FTP 1 MB in 1 sec • You can FTP 1 GB / min (~1$) • … 2 days and 1K$ • … 3 years and 1M$ desk cloud [slide src: Jim Gray]
  • 17. US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” 8/7/2013 Bill Howe, UW 17 --Mckinsey Global Institute hackers analysts
  • 18. Three types of tasks: 8/7/2013 Bill Howe, UW 18 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work” -- Aaron Kimball structs stats
  • 19. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 19
  • 20. New Phd Track: “Big Data U” • Open to all departments • New courses to “level the playing field” – “Molecular Biology for Computer Scientists” offered this Fall • Dual advising in two disciplines • Joint projects leading to multiple theses – Each methods thesis will include domain impact component – Each domain thesis will include methods impact component • Contribution to a shared cyberinfrastructure – Software engineering experience as a side effect • “Application Assistantships” – Like RAs and TAs; focused on solving a concrete problem 8/7/2013 Bill Howe, UW 20 Magda Balazinska Carlos Guestrin
  • 21. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 21
  • 22. Data Science Incubator: Motivation • We need the right people – We produce “builders,” but 99% of them go to industry to “make people click on ads” – They aren’t motivated by writing papers – No viable career path in the academy • We need the right processes – Hands-on, extended, intensive experience is required to produce π-shaped people – Data-driven discovery requires intensive collaboration 8/7/2013 Bill Howe, UW 22
  • 23. Science Domains Stats, Computer Science, Applied Math • “Where’s the funding?” • “How does this help me write a paper in my field”? • Thin collaborations; nobody to work on the short- term, high-risk, high-impact “triage” projects • “Does method X work on dataset Y?”
  • 24. Domain Labs Research Programmers • Expensive; doesn’t scale • “Code Monkey” – No viable career path • Can’t attract top people • No sharing, no community, no cross-pollination
  • 25. Data Science Incubator: Structure • Recruit top-flight data science talent • Give them autonomy to select collaborations and projects • Promote them according to “altmetrics” and project impact – “Data Scientist”  “Senior Data Scientist”  “Technical Fellow” – “Data Science Fellows” • Perhaps non-tenure, but 3-5 year commitments • Funded with contributions from Academic units, IT, Libraries, and soft money 8/7/2013 Bill Howe, UW 25
  • 26. Data Science Incubator: Seed Grants • Domain researchers submit Seed Grant applications for short, intensive 1-6 month projects – Reviewed by the Data Scientists themselves • Awardees send 1+ students, postdocs, staff, or faculty to come and physically sit in the incubator space X days per week for the project duration – Application may or may not include funding for the student 8/7/2013 Bill Howe, UW 26
  • 27. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
  • 28. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
  • 29. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 29
  • 30. MOOC “Introduction to Data Science:” https://www.coursera.org/course/datasci Certificate program: http://www.pce.uw.edu/courses/data-science-intro 8/7/2013 Bill Howe, UW 30 http://escience.washington.edu billhowe@cs.washington.edu

Notas do Editor

  1. Observe the world vs. Observe the dataInstruments vs. Algorithms
  2. So in part as an attempt to relate “eSciene” and “data science,” and in part to make sure the idea of data science wasn’t completely taken over by the machine learning people, we ran a massively open online course last Spring called Introduction to Data ScienceWe taught Scalable Databases, MapReduce, Statistics, Machine Learning, Visualization
  3. “Data Jujitsu”“Data Wrangling”“Data Munging”
  4. Our collaborators tell us that loading data into memory with R is the major bottleneck.It actually changes the science they can do:I would say that we can start answering questions about macro-ecology (study of relationships between organisms and their environment at large spatial scales).