SlideShare uma empresa Scribd logo
1 de 41
Bill Howe, PhD
Introduction to Data Science
This morning
• Context for “Data Science”
• Databases and Relational Algebra
• NoSQL
6/17/2015 Bill Howe, UW 2
http://commons.wikimedia.org/wiki/File:ElectoralCollege2012.svg
(public domain)
“The intuition behind this ought to be very simple: Mr. Obama
is maintaining leads in the polls in Ohio and other states that
are sufficient for him to win 270 electoral votes.”
Nate Silver, Oct. 26, 2012
“…the argument we’re making is exceedingly simple. Here it
is: Obama’s ahead in Ohio.”
Nate Silver, Nov. 2, 2012
“The bar set by the competition was invitingly low. Someone could
look like a genius simply by doing some fairly basic research into
what really has predictive power in a political campaign.”
Nate Silver, Nov. 10, 2012
DailyBeast
fivethirtyeight.com
fivethirtyeight.comsource: randy stewart
Nate Silver
6/17/2015 Bill Howe, UW 5
“…the biggest win came from good old SQL on a Vertica data
warehouse and from providing access to data to dozens of
analytics staffers who could follow their own curiosity and
distill and analyze data as they needed.”
Dan Woods
Jan 13 2013, CITO Research
“The decision was made to have Hadoop do the aggregate generations
and anything not real-time, but then have Vertica to answer sort of
‘speed-of-thought’ queries about all the data.”
Josh Hendler, CTO of H & K Strategies
Related: Obama campaign’s data-driven ground game
"In the 21st century, the candidate with [the] best data,
merged with the best messages dictated by that data, wins.”
Andrew Rasiej, Personal Democracy Forum
Hurricane Sandy
http://rpubs.com/JoFrhwld/sandy Josef Fruehwald
http://rpubs.com/JoFrhwld/sandy
Hurricane Sandy
Josef Fruehwald
Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of
Emotions in 20th Century Books. PLoS ONE 8(3): e59030.
doi:10.1371/journal.pone.0059030
1) Convert all the digitized books in the 20th century into n-grams
(Thanks, Google!)
(http://books.google.com/ngrams/)
2) Label each 1-gram (word) with a mood score.
(Thanks, WordNet!)
3) Count the occurences of each mood word
A 1-gram: “yesterday”
A 5-gram: “analysis is often described as”
Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of
Emotions in 20th Century Books. PLoS ONE 8(3): e59030.
doi:10.1371/journal.pone.0059030
6/17/2015 Bill Howe, UW 10
Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of
Emotions in 20th Century Books. PLoS ONE 8(3): e59030.
doi:10.1371/journal.pone.0059030
6/17/2015 Bill Howe, UW 11
…
2. Michel J-P, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011)
Quantitative analysis of culture using millions of digitized books.
Science 331: 176–182. doi: 10.1126/science.1199644. Find this article
online
3. Lieberman E, Michel J-P, Jackson J, Tang T, Nowak MA (2007)
Quantifying the evolutionary dynamics of language. Nature 449: 713–
716. doi: 10.1038/nature06137. Find this article online
4. Pagel M, Atkinson QD, Meade A (2007) Frequency of word-use
predicts rates of lexical evolution throughout Indo-European history.
Nature 449: 717–720. doi: 10.1038/nature06176. Find this article online
…
6. DeWall CN, Pond RS Jr, Campbell WK, Twenge JM (2011) Tuning in to
Psychological Change: Linguistic Markers of Psychological Traits and
Emotions Over Time in Popular U.S. Song Lyrics. Psychology of
Aesthetics, Creativity and the Arts 5: 200–207. doi: 10.1037/a0023195. Find
this article online
…
What is Data Science?
• Fortune
– “Hot New Gig in Tech”
• Hal Varian, Google’s Chief Economist, NYT, 2009:
– “The next sexy job”
– “The ability to take data—to be able to understand it, to
process it, to extract value from it, to visualize it, to
communicate it—that’s going to be a hugely important skill.”
• Mike Driscoll, CEO of metamarkets:
– “Data science, as it's practiced, is a blend of Red-Bull-fueled
hacking and espresso-inspired statistics.”
– “Data science is the civil engineering of data. Its acolytes
possess a practical knowledge of tools & materials, coupled
with a theoretical understanding of what's possible.”
6/17/2015 Bill Howe, UW 12
Drew Conway’s Data Science Venn Diagram
6/17/2015 Bill Howe, UW 13
What do data scientists do?
“They need to find nuggets of truth in data and then explain it to the
business leaders”
Data scientists “tend to be “hard scientists”, particularly physicists, rather
than computer science majors. Physicists have a strong mathematical
background, computing skills, and come from a discipline in which survival
depends on getting the most from the data. They have to think about the
big picture, the big problem.”
6/17/2015 Bill Howe, UW 14
-- DJ Patil, Chief Scientist at LinkedIn
-- Rchard Snee, EMC
Mike Driscoll’s three sexy skills of data geeks
• Statistics
– traditional analysis
• Data Munging
– parsing, scraping, and formatting data
• Visualization
– graphs, tools, etc.
6/17/2015 Bill Howe, UW 15
“Data Science refers to an emerging area of work
concerned with the collection, preparation, analysis,
visualization, management and preservation of large
collections of information.”
6/17/2015 Bill Howe, UW 16
Jeffrey Stanton
Syracuse University School of Information Studies
An Introduction to Data Science
Data Science is about Data Products
• “Data-driven apps”
– Spellchecker
– Machine Translator
• Interactive visualizations
– Google flu application
– Global Burden of Disease
• Online Databases
– Enterprise data warehouse
– Sloan Digital Sky Survey
6/17/2015 Bill Howe, UW 17
(Mike Loukides)
Data science is about building data
products, not just answering questions
Data products empower others to use
the data.
May help communicate your results
(e.g., Nate Silver’s maps)
May empower others to do their own
analysis
(e.g., Global Burden of Disease)
A Typical Data Science Workflow
6/17/2015 Bill Howe, UW 18
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
6/17/2015 Bill Howe, UW 19
What are the abstractions of
data science?
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
Translation: “We have no idea what
this is all about”
6/17/2015 Bill Howe, UW 20
1850s: matrices and linear algebra (today: engineers and scientists)
1950s: arrays and custom algorithms (today: C/Fortran performance junkies)
1950s: s-expressions and pure functions (today: language purists)
1960s: objects and methods (today: software engineers)
1970s: files and scripts (today: system administrators)
1970s: relations and relational algebra (today: large-scale data engineers)
1980s: data frames and functions (today: statisticians)
2000s: key-value pairs + one of the above (today: NoSQL hipsters)
But what are the abstractions of
data science?
DATABASES AND
RELATIONAL ALGEBRA
6/17/2015 Bill Howe, UW 21
6/17/2015 Bill Howe, eScience Institute 22
Pre-Relational: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but required only 5% of
the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.”
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database History
-- Codd 1979
6/17/2015 Bill Howe, eScience Institute 23
Key Idea: “Physical Data Independence”
physical data independence
files and
pointers
relations
SELECT seq
FROM ncbi_sequences
WHERE seq = ‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
6/17/2015 Bill Howe, eScience Institute 24
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
Equivalent logical expressions
25
σp=knows(R) o=s (σp=holdsAccount(R) o=s σp=accountHomepage(R))
(σp=knows(R) o=s σp=holdsAccount(R)) o=s σp=accountHomepage(R)
σp1=knows & p2=holdsAccount & p3=accountHomepage (R x R x R)
right associative
left associative
distributive
6/17/2015 Bill Howe, eScience Institute 26
Why do we care? Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
So what? RA is now ubiquitous
• Galaxy – “bioinformatics workflows”
• Pandas and Blaze: High Performance Arrays in Python
merge(left, right, on=‘key’)
• dplyr in R
filter(x), select(x), arrange(x), groupby(x),
inner_join(x, y), left_join(x, y)
• Hadoop and contemporaries all evolved to support RA-like interfaces:
Pig, HIVE, Cascalog, Flume, Spark/Shark, Dremel
“…Operate on Genomics Intervals -> Join”
“NOSQL” SYSTEMS
6/17/2015 Bill Howe, UW 28
Year
System/
Paper
Scale to
1000s
Primary
Index
Secondary
Indexes Transactions
Joins/
Analytics
Integrity
Constraints Views
Language/
Algebra
Data
model my label
1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like
2003 memcached ✔ ✔ O O O O O O key-val nosql
2004 MapReduce ✔ O O O ✔ O O O key-val batch
2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql
2006 BigTable/Hbase ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql
2007 Dynamo ✔ ✔ O O O O O O ext. record nosql
2008 Pig ✔ O O O ✔ / O ✔ tables sql-like
2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like
2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql
2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql
2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql
2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like
2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql
2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like
2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like
2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like
2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like
NoSQL and related systems, by feature
6/17/2015 Bill Howe, UW 30
Year
System/
Paper
Scale to
1000s
Primary
Index
Secondary
Indexes Transactions
Joins/
Analytics
Integrity
Constraints Views
Language/
Algebra
Data
model my label
1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like
2003 memcached ✔ ✔ O O O O O O key-val nosql
2004 MapReduce ✔ O O O ✔ O O O key-val batch
2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql
2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql
2007 Dynamo ✔ ✔ O O O O O O ext. record nosql
2008 Pig ✔ O O O ✔ / O ✔ tables sql-like
2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like
2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql
2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql
2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql
2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like
2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql
2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like
2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like
2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like
2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like
Scale was the primary motivation!
6/17/2015 Bill Howe, UW 31
Rick Cattel’s clustering from
“Scalable SQL and NoSQL Data Stores”
SIGMOD Record, 2010
extensible record stores
document stores
key-value stores
Year
System/
Paper
Scale to
1000s
Primary
Index
Secondary
Indexes Transactions
Joins/
Analytics
Integrity
Constraints Views
Language/
Algebra
Data
model my label
1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like
2003 memcached ✔ ✔ O O O O O O key-val nosql
2004 MapReduce ✔ O O O ✔ O O O key-val batch
2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql
2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql
2007 Dynamo ✔ ✔ O O O O O O key-val nosql
2008 Pig ✔ O O O ✔ / O ✔ tables sql-like
2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like
2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql
2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql
2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql
2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like
2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql
2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like
2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like
2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like
2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like
6/17/2015 Bill Howe, UW 32
Year
System/
Paper
Scale to
1000s
Primary
Index
Secondary
Indexes Transactions
Joins/
Analytics
Integrity
Constraints Views
Language/
Algebra
Data
model my label
1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like
2003 memcached ✔ ✔ O O O O O O key-val nosql
2004 MapReduce ✔ O O O ✔ O O O key-val batch
2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql
2006
BigTable
(Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql
2007 Dynamo ✔ ✔ O O O O O O ext. record nosql
2008 Pig ✔ O O O ✔ / O ✔ tables sql-like
2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like
2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql
2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql
2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql
2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like
2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql
2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like
2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like
2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like
2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like
MapReduce-based Systems
6/17/2015 Bill Howe, UW 33
2004
Hadoop
2005
MapReduce
2006
2007
2008
2009
2010
2011
2012
MapReduce-based Systems
non-Google open
source implementation
direct influence /
shared features
compatible
implementation of
Pig
HIVE
Tenzing
Impala
6/17/2015 Bill Howe, UW 34
Year
System/
Paper
Scale to
1000s
Primary
Index
Secondary
Indexes Transactions
Joins/
Analytics
Integrity
Constraints Views
Language/
Algebra
Data
model my label
1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like
2003 memcached ✔ ✔ O O O O O O key-val nosql
2004 MapReduce ✔ O O O ✔ O O O key-val batch
2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql
2006
BigTable
(Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql
2007 Dynamo ✔ ✔ O O O O O O ext. record nosql
2008 Pig ✔ O O O ✔ / O ✔ tables sql-like
2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like
2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql
2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql
2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql
2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like
2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql
2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like
2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like
2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like
2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql
2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like
6/17/2015 Bill Howe, UW 35
BigTable
Cassandra
2004
2005
memcached
2006
2007
2008
2009
Spanner
Megastore
2010
2011
2012
NoSQL Systems
direct influence /
shared features
compatible
implementation of
Dynamo
Voldemort Riak
Accumulo
2003
CouchDB
MongoDB
6/17/2015 Bill Howe, UW 36
A lot of these systems give up joins!
Year source
System/
Paper
Scale to
1000s
Primary
Index
Secondary
Indexes Transactions
Joins/
Analytics
Integrity
Constraints Views
Language/
Algebra
Data
model my label
1971 many RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like
2003 other memcached ✔ ✔ O O O O O O key-val lookup
2004 Google MapReduce ✔ O O O ✔ O O O key-val MR
2005 couchbase CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR
2006 Google BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR
2007 10gen MongoDB ✔ ✔ ✔ EC, record O O O O document filter
2007 Amazon Dynamo ✔ ✔ O O O O O O key-val lookup
2007 Amazon SimpleDB ✔ ✔ ✔ O O O O O ext. record filter
2008 Yahoo Pig ✔ O O O ✔ / O ✔ tables RA-like
2008 Facebook HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like
2008 Facebook Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter
2009 other Voldemort ✔ ✔ O EC, record O O O O key-val lookup
2009 basho Riak ✔ ✔ ✔ EC, record MR O key-val filter
2010 Google Dremel ✔ O O O / ✔ O ✔ tables SQL-like
2011 Google Megastore ✔ ✔ ✔ entity groups O / O / tables filter
2011 Google Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables SQL-like
2011 Berkeley Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like
2012 Google Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like
2012 Accumulo Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter
2013 Cloudera Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like
Joins
• Ex: Show all comments by “Sue” on any blog post by “Jim”
• Method 1:
– Lookup all blog posts by Jim
– For each post, lookup all comments and filter for “Sue”
• Method 2:
– Lookup all comments by Sue
– For each comment, lookup all posts and filter for “Jim”
• Method 3:
– Filter comments by Sue, filter posts by Jim,
– Sort all comments by blog id, sort all blogs by blog id
– Pull one from each list to find matches
6/17/2015 Bill Howe, UW 37
6/17/2015 Bill Howe, UW 38
Year
System/
Paper
Scale to
1000s
Primary
Index
Secondary
Indexes Transactions
Joins/
Analytics
Integrity
Constraints Views
Language/
Algebra
Data
model my label
1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like
2003 memcached ✔ ✔ O O O O O O key-val lookup
2004 MapReduce ✔ O O O ✔ O O O key-val MR
2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR
2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR
2007 MongoDB ✔ ✔ ✔ EC, record O O O O document filter
2007 Dynamo ✔ ✔ O O O O O O key-val lookup
2008 Pig ✔ O O O ✔ / O ✔ tables RA-like
2008 HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like
2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter
2009 Voldemort ✔ ✔ O EC, record O O O O key-val lookup
2009 Riak ✔ ✔ ✔ EC, record MR O key-val filter
2010 Dremel ✔ O O O / ✔ O ✔ tables SQL-like
2011 Megastore ✔ ✔ ✔ entity groups O / O / tables filter
2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables SQL-like
2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like
2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like
2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter
2013 Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like
• Two value propositions
– Performance: “I started with MySQL, but
had a hard time scaling it out in a
distributed environment”
– Flexibility: “My data doesn’t conform to a
rigid schema”
6/17/2015 Bill Howe, UW 39
NoSQL Criticism
Stonebraker CACM (blog 2)
NoSQL Criticism: flexibility argument
• Who are the customers of NoSQL?
– Lots of startups
• Very few enterprises. Why? most
applications are traditional OLTP on
structured data; a few other applications
around the “edges”, but considered less
important
6/17/2015 Bill Howe, UW 40
Stonebraker CACM (blog 2)
Some Takeaways
• Data wrangling is the hard part of data
science, not statistics
• Relational algebra is the right
abstraction for reasoning about data
wrangling
• Even “NoSQL” systems that explicitly
rejected relational concepts eventually
brought them back
6/17/2015 Bill Howe, UW 41

Mais conteúdo relacionado

Mais procurados

Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data ScienceEdureka!
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Gregg Barrett
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroData ScienceTech Institute
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Gabriel Moreira
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 

Mais procurados (20)

Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?Data Science Introduction - Data Science: What Art Thou?
Data Science Introduction - Data Science: What Art Thou?
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 

Destaque

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science IntroductionGang Tao
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningNik Spirin
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Austin Ogilvie
 
Introduction to Data Science with Hadoop
Introduction to Data Science with HadoopIntroduction to Data Science with Hadoop
Introduction to Data Science with HadoopDr. Volkan OBAN
 
Introduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsIntroduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsJay (Jianqiang) Wang
 
Introduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data AnalyticsIntroduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data AnalyticsIvan Khvostishkov
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project LifecycleJason Geng
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 ramuletc
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceKoo Ping Shung
 

Destaque (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
 
Introduction to Data Science with Hadoop
Introduction to Data Science with HadoopIntroduction to Data Science with Hadoop
Introduction to Data Science with Hadoop
 
Introduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsIntroduction to data science and candidate data science projects
Introduction to data science and candidate data science projects
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data AnalyticsIntroduction to Data Science: A Practical Approach to Big Data Analytics
Introduction to Data Science: A Practical Approach to Big Data Analytics
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 r
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science
Data ScienceData Science
Data Science
 

Semelhante a Intro to Data Science Concepts

Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsBrand Niemann
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your RoleJay Gendron
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningStefan Dietze
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...Carole Goble
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Sciencedatasciencekorea
 
Roger hoerl say award presentation 2013
Roger hoerl say award presentation 2013Roger hoerl say award presentation 2013
Roger hoerl say award presentation 2013Roger Hoerl
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 

Semelhante a Intro to Data Science Concepts (20)

BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data Dashboards
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
Roger hoerl say award presentation 2013
Roger hoerl say award presentation 2013Roger hoerl say award presentation 2013
Roger hoerl say award presentation 2013
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 

Mais de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 

Mais de University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 

Último

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 

Último (20)

原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 

Intro to Data Science Concepts

  • 1. Bill Howe, PhD Introduction to Data Science
  • 2. This morning • Context for “Data Science” • Databases and Relational Algebra • NoSQL 6/17/2015 Bill Howe, UW 2
  • 4. “The intuition behind this ought to be very simple: Mr. Obama is maintaining leads in the polls in Ohio and other states that are sufficient for him to win 270 electoral votes.” Nate Silver, Oct. 26, 2012 “…the argument we’re making is exceedingly simple. Here it is: Obama’s ahead in Ohio.” Nate Silver, Nov. 2, 2012 “The bar set by the competition was invitingly low. Someone could look like a genius simply by doing some fairly basic research into what really has predictive power in a political campaign.” Nate Silver, Nov. 10, 2012 DailyBeast fivethirtyeight.com fivethirtyeight.comsource: randy stewart Nate Silver
  • 5. 6/17/2015 Bill Howe, UW 5 “…the biggest win came from good old SQL on a Vertica data warehouse and from providing access to data to dozens of analytics staffers who could follow their own curiosity and distill and analyze data as they needed.” Dan Woods Jan 13 2013, CITO Research “The decision was made to have Hadoop do the aggregate generations and anything not real-time, but then have Vertica to answer sort of ‘speed-of-thought’ queries about all the data.” Josh Hendler, CTO of H & K Strategies Related: Obama campaign’s data-driven ground game "In the 21st century, the candidate with [the] best data, merged with the best messages dictated by that data, wins.” Andrew Rasiej, Personal Democracy Forum
  • 8. Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030 1) Convert all the digitized books in the 20th century into n-grams (Thanks, Google!) (http://books.google.com/ngrams/) 2) Label each 1-gram (word) with a mood score. (Thanks, WordNet!) 3) Count the occurences of each mood word A 1-gram: “yesterday” A 5-gram: “analysis is often described as”
  • 9. Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
  • 10. 6/17/2015 Bill Howe, UW 10 Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
  • 11. 6/17/2015 Bill Howe, UW 11 … 2. Michel J-P, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative analysis of culture using millions of digitized books. Science 331: 176–182. doi: 10.1126/science.1199644. Find this article online 3. Lieberman E, Michel J-P, Jackson J, Tang T, Nowak MA (2007) Quantifying the evolutionary dynamics of language. Nature 449: 713– 716. doi: 10.1038/nature06137. Find this article online 4. Pagel M, Atkinson QD, Meade A (2007) Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449: 717–720. doi: 10.1038/nature06176. Find this article online … 6. DeWall CN, Pond RS Jr, Campbell WK, Twenge JM (2011) Tuning in to Psychological Change: Linguistic Markers of Psychological Traits and Emotions Over Time in Popular U.S. Song Lyrics. Psychology of Aesthetics, Creativity and the Arts 5: 200–207. doi: 10.1037/a0023195. Find this article online …
  • 12. What is Data Science? • Fortune – “Hot New Gig in Tech” • Hal Varian, Google’s Chief Economist, NYT, 2009: – “The next sexy job” – “The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.” • Mike Driscoll, CEO of metamarkets: – “Data science, as it's practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics.” – “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what's possible.” 6/17/2015 Bill Howe, UW 12
  • 13. Drew Conway’s Data Science Venn Diagram 6/17/2015 Bill Howe, UW 13
  • 14. What do data scientists do? “They need to find nuggets of truth in data and then explain it to the business leaders” Data scientists “tend to be “hard scientists”, particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem.” 6/17/2015 Bill Howe, UW 14 -- DJ Patil, Chief Scientist at LinkedIn -- Rchard Snee, EMC
  • 15. Mike Driscoll’s three sexy skills of data geeks • Statistics – traditional analysis • Data Munging – parsing, scraping, and formatting data • Visualization – graphs, tools, etc. 6/17/2015 Bill Howe, UW 15
  • 16. “Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.” 6/17/2015 Bill Howe, UW 16 Jeffrey Stanton Syracuse University School of Information Studies An Introduction to Data Science
  • 17. Data Science is about Data Products • “Data-driven apps” – Spellchecker – Machine Translator • Interactive visualizations – Google flu application – Global Burden of Disease • Online Databases – Enterprise data warehouse – Sloan Digital Sky Survey 6/17/2015 Bill Howe, UW 17 (Mike Loukides) Data science is about building data products, not just answering questions Data products empower others to use the data. May help communicate your results (e.g., Nate Silver’s maps) May empower others to do their own analysis (e.g., Global Burden of Disease)
  • 18. A Typical Data Science Workflow 6/17/2015 Bill Howe, UW 18 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work”
  • 19. 6/17/2015 Bill Howe, UW 19 What are the abstractions of data science? “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about”
  • 20. 6/17/2015 Bill Howe, UW 20 1850s: matrices and linear algebra (today: engineers and scientists) 1950s: arrays and custom algorithms (today: C/Fortran performance junkies) 1950s: s-expressions and pure functions (today: language purists) 1960s: objects and methods (today: software engineers) 1970s: files and scripts (today: system administrators) 1970s: relations and relational algebra (today: large-scale data engineers) 1980s: data frames and functions (today: statisticians) 2000s: key-value pairs + one of the above (today: NoSQL hipsters) But what are the abstractions of data science?
  • 22. 6/17/2015 Bill Howe, eScience Institute 22 Pre-Relational: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation Relational Database History -- Codd 1979
  • 23. 6/17/2015 Bill Howe, eScience Institute 23 Key Idea: “Physical Data Independence” physical data independence files and pointers relations SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’; f = fopen(‘table_file’); fseek(10030440); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
  • 24. 6/17/2015 Bill Howe, eScience Institute 24 Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
  • 25. Equivalent logical expressions 25 σp=knows(R) o=s (σp=holdsAccount(R) o=s σp=accountHomepage(R)) (σp=knows(R) o=s σp=holdsAccount(R)) o=s σp=accountHomepage(R) σp1=knows & p2=holdsAccount & p3=accountHomepage (R x R x R) right associative left associative distributive
  • 26. 6/17/2015 Bill Howe, eScience Institute 26 Why do we care? Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
  • 27. So what? RA is now ubiquitous • Galaxy – “bioinformatics workflows” • Pandas and Blaze: High Performance Arrays in Python merge(left, right, on=‘key’) • dplyr in R filter(x), select(x), arrange(x), groupby(x), inner_join(x, y), left_join(x, y) • Hadoop and contemporaries all evolved to support RA-like interfaces: Pig, HIVE, Cascalog, Flume, Spark/Shark, Dremel “…Operate on Genomics Intervals -> Join”
  • 29. Year System/ Paper Scale to 1000s Primary Index Secondary Indexes Transactions Joins/ Analytics Integrity Constraints Views Language/ Algebra Data model my label 1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like 2003 memcached ✔ ✔ O O O O O O key-val nosql 2004 MapReduce ✔ O O O ✔ O O O key-val batch 2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql 2006 BigTable/Hbase ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql 2007 Dynamo ✔ ✔ O O O O O O ext. record nosql 2008 Pig ✔ O O O ✔ / O ✔ tables sql-like 2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like 2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql 2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql 2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql 2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like 2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql 2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like 2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like 2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like 2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like NoSQL and related systems, by feature
  • 30. 6/17/2015 Bill Howe, UW 30 Year System/ Paper Scale to 1000s Primary Index Secondary Indexes Transactions Joins/ Analytics Integrity Constraints Views Language/ Algebra Data model my label 1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like 2003 memcached ✔ ✔ O O O O O O key-val nosql 2004 MapReduce ✔ O O O ✔ O O O key-val batch 2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql 2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql 2007 Dynamo ✔ ✔ O O O O O O ext. record nosql 2008 Pig ✔ O O O ✔ / O ✔ tables sql-like 2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like 2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql 2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql 2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql 2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like 2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql 2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like 2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like 2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like 2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like Scale was the primary motivation!
  • 31. 6/17/2015 Bill Howe, UW 31 Rick Cattel’s clustering from “Scalable SQL and NoSQL Data Stores” SIGMOD Record, 2010 extensible record stores document stores key-value stores Year System/ Paper Scale to 1000s Primary Index Secondary Indexes Transactions Joins/ Analytics Integrity Constraints Views Language/ Algebra Data model my label 1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like 2003 memcached ✔ ✔ O O O O O O key-val nosql 2004 MapReduce ✔ O O O ✔ O O O key-val batch 2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql 2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql 2007 Dynamo ✔ ✔ O O O O O O key-val nosql 2008 Pig ✔ O O O ✔ / O ✔ tables sql-like 2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like 2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql 2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql 2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql 2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like 2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql 2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like 2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like 2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like 2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like
  • 32. 6/17/2015 Bill Howe, UW 32 Year System/ Paper Scale to 1000s Primary Index Secondary Indexes Transactions Joins/ Analytics Integrity Constraints Views Language/ Algebra Data model my label 1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like 2003 memcached ✔ ✔ O O O O O O key-val nosql 2004 MapReduce ✔ O O O ✔ O O O key-val batch 2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql 2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql 2007 Dynamo ✔ ✔ O O O O O O ext. record nosql 2008 Pig ✔ O O O ✔ / O ✔ tables sql-like 2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like 2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql 2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql 2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql 2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like 2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql 2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like 2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like 2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like 2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like MapReduce-based Systems
  • 33. 6/17/2015 Bill Howe, UW 33 2004 Hadoop 2005 MapReduce 2006 2007 2008 2009 2010 2011 2012 MapReduce-based Systems non-Google open source implementation direct influence / shared features compatible implementation of Pig HIVE Tenzing Impala
  • 34. 6/17/2015 Bill Howe, UW 34 Year System/ Paper Scale to 1000s Primary Index Secondary Indexes Transactions Joins/ Analytics Integrity Constraints Views Language/ Algebra Data model my label 1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like 2003 memcached ✔ ✔ O O O O O O key-val nosql 2004 MapReduce ✔ O O O ✔ O O O key-val batch 2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql 2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql 2007 Dynamo ✔ ✔ O O O O O O ext. record nosql 2008 Pig ✔ O O O ✔ / O ✔ tables sql-like 2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like 2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql 2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql 2009 Riak ✔ ✔ ✔ EC, record MR O key-val nosql 2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like 2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql 2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like 2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like 2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like 2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql 2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like
  • 35. 6/17/2015 Bill Howe, UW 35 BigTable Cassandra 2004 2005 memcached 2006 2007 2008 2009 Spanner Megastore 2010 2011 2012 NoSQL Systems direct influence / shared features compatible implementation of Dynamo Voldemort Riak Accumulo 2003 CouchDB MongoDB
  • 36. 6/17/2015 Bill Howe, UW 36 A lot of these systems give up joins! Year source System/ Paper Scale to 1000s Primary Index Secondary Indexes Transactions Joins/ Analytics Integrity Constraints Views Language/ Algebra Data model my label 1971 many RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like 2003 other memcached ✔ ✔ O O O O O O key-val lookup 2004 Google MapReduce ✔ O O O ✔ O O O key-val MR 2005 couchbase CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR 2006 Google BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR 2007 10gen MongoDB ✔ ✔ ✔ EC, record O O O O document filter 2007 Amazon Dynamo ✔ ✔ O O O O O O key-val lookup 2007 Amazon SimpleDB ✔ ✔ ✔ O O O O O ext. record filter 2008 Yahoo Pig ✔ O O O ✔ / O ✔ tables RA-like 2008 Facebook HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like 2008 Facebook Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter 2009 other Voldemort ✔ ✔ O EC, record O O O O key-val lookup 2009 basho Riak ✔ ✔ ✔ EC, record MR O key-val filter 2010 Google Dremel ✔ O O O / ✔ O ✔ tables SQL-like 2011 Google Megastore ✔ ✔ ✔ entity groups O / O / tables filter 2011 Google Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables SQL-like 2011 Berkeley Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like 2012 Google Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like 2012 Accumulo Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter 2013 Cloudera Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like
  • 37. Joins • Ex: Show all comments by “Sue” on any blog post by “Jim” • Method 1: – Lookup all blog posts by Jim – For each post, lookup all comments and filter for “Sue” • Method 2: – Lookup all comments by Sue – For each comment, lookup all posts and filter for “Jim” • Method 3: – Filter comments by Sue, filter posts by Jim, – Sort all comments by blog id, sort all blogs by blog id – Pull one from each list to find matches 6/17/2015 Bill Howe, UW 37
  • 38. 6/17/2015 Bill Howe, UW 38 Year System/ Paper Scale to 1000s Primary Index Secondary Indexes Transactions Joins/ Analytics Integrity Constraints Views Language/ Algebra Data model my label 1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like 2003 memcached ✔ ✔ O O O O O O key-val lookup 2004 MapReduce ✔ O O O ✔ O O O key-val MR 2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR 2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR 2007 MongoDB ✔ ✔ ✔ EC, record O O O O document filter 2007 Dynamo ✔ ✔ O O O O O O key-val lookup 2008 Pig ✔ O O O ✔ / O ✔ tables RA-like 2008 HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like 2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter 2009 Voldemort ✔ ✔ O EC, record O O O O key-val lookup 2009 Riak ✔ ✔ ✔ EC, record MR O key-val filter 2010 Dremel ✔ O O O / ✔ O ✔ tables SQL-like 2011 Megastore ✔ ✔ ✔ entity groups O / O / tables filter 2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables SQL-like 2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like 2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like 2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter 2013 Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like
  • 39. • Two value propositions – Performance: “I started with MySQL, but had a hard time scaling it out in a distributed environment” – Flexibility: “My data doesn’t conform to a rigid schema” 6/17/2015 Bill Howe, UW 39 NoSQL Criticism Stonebraker CACM (blog 2)
  • 40. NoSQL Criticism: flexibility argument • Who are the customers of NoSQL? – Lots of startups • Very few enterprises. Why? most applications are traditional OLTP on structured data; a few other applications around the “edges”, but considered less important 6/17/2015 Bill Howe, UW 40 Stonebraker CACM (blog 2)
  • 41. Some Takeaways • Data wrangling is the hard part of data science, not statistics • Relational algebra is the right abstraction for reasoning about data wrangling • Even “NoSQL” systems that explicitly rejected relational concepts eventually brought them back 6/17/2015 Bill Howe, UW 41