SlideShare uma empresa Scribd logo
1 de 65
Myria: Scalable
Analytics as a Service
Bill Howe, PhD
University of Washington
XLDB South America 2014
This morning
• UW eScience Institute
– A “Data Science Environment”
• SQLShare and High Variety Data
• Myria and “Relational Algorithmics”
7/10/2014 Bill Howe, UW 2
3
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
7/10/2014 Bill Howe, UW 4
“All across our campus, the process of discovery will increasingly rely
on researchers’ ability to extract knowledge from vast amounts of
data… In order to remain at the forefront, UW must be a leader in
advancing these techniques and technologies, and in making [them]
accessible to researchers in the broadest imaginable range of fields.”
2005-2008
In other words:
• Data-driven discovery will be ubiquitous
• UW must be a leader in inventing the
capabilities
• UW must be a leader in translational
activities – in putting these capabilities
to work
• It’s about intellectual infrastructure (human capital) and software
infrastructure (shared tools and services – digital capital)
A 5-year, US$37.8 million cross-institutional
collaboration to create a data science environment
6
2014
7/10/2014 Bill Howe, UW 7
Data Science Kickoff Session:
137 posters from 30+ departments and units
Establish a virtuous cycle
• 6 working groups, each with
• 3-6 faculty from each institution
UW Data Science Education Efforts
7/10/2014 Bill Howe, UW 9
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
MOOC Intro to Data Science
IGERT: Big Data PhD Track
New CS Courses
Bootcamps and workshops
Intro to Data Programming
Data Science Masters (planned)
Incubator: hands-on training
7/10/2014 Bill Howe, UW 10
Next Session begins June 30, 2014
https://www.coursera.org/course/datasci
MOOC Participation numbers
• “Registered”: 119,517 totally irrelevant
• Clicked play in first 2 weeks: 78,589
• Turned in 1st homework: 10,663
• Completed all assignments: ~9000 typical attrition for a MOOC
• “Passed”: 7022
• Forum threads: 4661
• Forum posts: 22,900
Fairly consistent with Coursera data across “hard” courses
11
Educational transformation:
A new generation of “Pi-shaped” scientists
12
PhD  πhD
Educational
transformation
Magda Balazinska
13
Educational
transformation
Big Data access
and management
Big Data
modeling
Big Data analytics
Collaborative
Big Data scienceData
Education and Research in Data Science
• Ultimate goal: A new PhD program
– Initial goal: A new certificate based on Big Data tracks in all departments
– Education highlights: data science courses, co-advising, and internships
• End-to-End Research Agenda
– Big Data mgmt, analytics, modeling, & collaboration
• Cyberinfrastructure Development
– Big Data analysis service
The Data Science Studio
• An open collaborative research space
• A resident data science team
– Permanent staff of ~5 data scientists – applied research and
development
– ~15-20 data science fellows (research scientists, visitors, postdocs,
students)
• How to Engage:
– Drop-in open workspace
– Studio “Office Hours”
– Incubation Program
14
15
6th floor Physics Astronomy
Building
A partnership among …
• Provost
• UW Libraries
• Physics, Astronomy,
Arts & Sciences
• eScience Institute
16
Estimated Timeline:
• Design Phase Jan-June
• Construction June – Sep
• Target: October 1, 2014
7/10/2014 Bill Howe, UW 17
The rest of this talk…
7/10/2014 Bill Howe, UW 18
How can we deliver 1000 little SDSSs
to anyone who wants one?
7/10/2014 Bill Howe, UW 19
#ofbytes
# of data sources
telescopes
spectra
LSST (~100PB; images, spectra)
PanSTARRS (~40PB; images, trajectories)
OOI (~50TB/year; sims, RSN)
IOOS (~50TB/year; sims, satellite, gliders,
AUVs, vessels, more)
CMOP (~10TB/year; sims, stations, gliders,
AUVs, vessels, more)
SDSS (~400TB; images, spectra, catalogs)
n-body
sims
models
AUVs
stations
cruises, CTDs
flow cytometry
gliders
ADCP
satellites
Astronomy
Ocean Sciences
3 V’s of Big Data
Volume
Variety
Velocity
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
7/10/2014 Bill Howe, UW 20
Key question: How can we reduce this “data overhead”?
7/10/2014 Bill Howe, UW
Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN,
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf f
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf f
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf f
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
…
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
…
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
…
COGAnnotation_coastal_sample.txt
SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
21
Data Science Workflow:
7/10/2014 Bill Howe, UW 22
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
engineering.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
0
30
60
90
120
Benchmark 1 Benchmark 2
Old system Your system Our system
A typical Computer Science paper….
slide src: Dan Halperin
0
2500
5000
7500
10000
12500
Benchmark 1 Benchmark 2
Old system Your system
Our system What people use
The reality of the situation….
slide src: Dan Halperin
A modest goal:
Expose all the world’s science data
through declarative query interfaces
7/10/2014 Bill Howe, UW 26
QUERY-AS-A-SERVICE
27
2010 - present
Version 1
1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
Howe, et al., CISE 2013
Steven
Roberts
SQL as a lab notebook:
http://bit.ly/16Xj2JP
Calculate #
methylated CGs
Calculate #
all CGs
Calculate
methylation ratio
Link methylation
with gene description
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Join
Reorder
columns
Count Count
JoinJoin
Reorder
columns
Reorder
columns
Compute
Trim
Excel
Join Join
misstep: join
w/ wrong fill
Calculate #
methylated
CGs
Calculate #
all CGs
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Calculate
methylation ratio
and link with gene
description
Popular service for
Bioinformatics Workflows
Halperin, Howe, et al. SSDBM 2013
Two Problems with SQLShare
• No help for truly big datasets
• No help for “algorithmics”
33
Limitations of SQLShare
7/10/2014 Bill Howe, UW 34
Relational Algorithmics-as-a-Service
Version 2
http://myria.cs.washington.edu
Myria is…
• MyriaQ: A compiler framework for multiple
iterative RA-based languages and multiple
big data back ends
• MyriaX: A parallel, shared-nothing,
iterative execution engine
• MyriaWeb: A RESTful Analytics-as-a-
Service platform and web-based interface
35
Myria is …
Magda Balazinska, Bill Howe, and Dan Suciu
Dan Halperin (technical lead)
Victor Almeida
Andrew Whitaker
PhD Students
Shumo Chu
Eric Gribkoff
Jeremy Hyrkas
Paris Koutris
Ryan Maas
Dominik Moritz
Laurel Orr
Jennifer Ortiz
Emad Soroush
Jingjing Wang
ShengLiang Xu
Undergraduate Students
Lee Lee Choo
Vaspol Ruamviboonsuk
Myria Team
Myria Architecture
Coordinator
Language Parser
Myria
Compiler
Logical Optimizer for RA+While
REST Server
Worker Catalog
Catalog
…
json query plan
netty
protocols
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
MyriaX (Java)
C Compiler Grappa
Web UI
MyriaQ (Python)
HDFS HDFS HDFS
Datalog SQL MyriaL
REST
SciDB
SparkSerial C++GrappaMyriaX SQL
SQLDatalogMyriaL ??
Relational Algebra + Iteration
Compiler Compiler Compiler Compiler Compiler
MyriaQ
Oceanography, Astronomy, Biology, Medical Informatics
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC
(Forward scatter)
Orange fluo
Red fluo
EX: SeaFlow
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
Ex: SeaFlow
10
0
10
1
10
2
10
3
10
4
100
10
1
10
2
10
3
10
4
ps3.fcs…Focus
D1/FSC
D2/FSC
d1/FSC
d2 / FSC
10
0
10
1
10
2
10
3
10
4
100
101
10
2
10
3
10
4
ps3.fcs…subset
FSC
692-40REDfluorescence
FSC
Picoplankton
Nanoplankton
100
101
102
103
104
10
0
10
1
10
2
103
104
P35-surf
FSC Small Stuff
580-30
IS
Ultraplankton
Prochlorococcus
Continuous observations of various phytoplankton groups from 1-20
mm in size
Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton
Based on ORANGE fluo: Synechococcus, Cryptophytes
Based on FSC: Coccolithophores
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
Ex: SeaFlow Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
SeaFlow in Myria
• “That 5-line MyriaL program was 100x faster than my R cluster,
and much simpler”
Dan Halperin Sophie Clayton
7/10/2014 Bill Howe, UW 43
1) BD experiments are ridiculously labor-intensive
– N systems x M real-world applications
– Big clusters and big datasets
2) No “one size fits all solution”
– Realistic environments will use more than one system
3) A return to distributed, federated databases
– Erase the distinction between ETL and Analytics
Why a big data middleware?
Pregel
(Malewicz)
Hadoop 2008
2009
2010
2011
2012
2013
2014
HaLoop
(Bu)
Spark
(Zakaria)
Vertica
(Pavlo)
~100x faster
SystemML
(Ghoting)
Hyracks
(Borkar)
GraphLab
(Low)
faster
Cumulon
(Huang)
comparable or
inconclusive
Giraph
(Tian)
Dremel
(Melnik)
SimSQL
(Cai)
epiC
(Jiang)
Impala
(Cloudera)
Shark
(Xin)
HIVE
(Thusoo)
“The good old days”
“The age of uncertainty”
7/10/2014 Bill Howe, UW 45
What can we conclude?
Hadoop was probably just pretty bad
The rest of the story not so clear
Relational Algebra is the Calculus of Big Data
• Hadoopspawn: Pig, HIVE, blah
• Hadoop contemporaries: Cascalog, Flume, blah
• Post-Hadoop: Spark/Shark, Dremel, blah
• etc.
7/10/2014 Bill Howe, UW 46
HBase
7/10/2014 Bill Howe, UW 47
BigTable
Dremel
Tenzing
2004
Pregel
Hadoop
2005
MapReduce
2006
2007
2008
2009
Spanner
Megastore
2010
2011
2012
Google Big Data Systems
non-Google open
source implementation
direct influence /
shared features
compatible
implementation of
SQL-like interface
BigQuery
Relational Algebra is the Calculus of Small Data
• Galaxy – “bioinformatics workflows”
• Pandas (Python)
merge(left, right, on=‘key’)
• dplyr (R)
filter(x), select(x), arrange(x), groupby(x),
inner_join(x, y), left_join(x, y), ….
• Manimal, Pyxis/StatusQuo, others
– Extract RA operators implemented manually in Java code
7/10/2014 Bill Howe, UW 48
“…Operate on Genomics Intervals -> Join”
7/10/2014 Bill Howe, UW 49
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
A closer look at an example
ROI(id, start, stop) is a set of “regions of interest”
Read(id, start, stop) is a set of “reads” from sequencer
Task: For each region of interest, count the number
of reads it contains
start stop
stopstart
SELECT roi.id, count(rd.id)
FROM regions_of_interest roi, reads rd
WHERE roi.start <= rd.start AND rd.[end] <= roi.[end]
GROUP BY roi.id​
As a query
“region of interest”
sequence “read”
SELECT roi.id, count(rd.start)
FROM regions_of_interest roi, reads rd
WHERE roi.start <= rd.start AND rd.[end] <= roi.[end]
GROUP BY roi.id​
Why databases get
a bad reputation
many minutes
SELECT roi.id, count(rd.start) as cnt
FROM regions_of_interest roi, indexed_reads rd
WHERE roi.start <= rd.start AND rd.start <= roi.[end]
AND roi.start <= rd.[end] AND rd.[end] >= roi.[end]
GROUP BY roi.id
3 seconds!
roi
read
two-sided index scan
one-sided index scan,
plus filter
The broken promise of declarative query…
Lowering barrier to entry
Giving users insight
Shumo Chu Dominik Moritz
Diagnosing problems
Sourcenode
Destination node
Shumo Chu Dominik Moritz
56
A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);
F = SEQUENCE();
Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];
Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]
DO
I = CROSS(Kmeans, Centroids);
J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id,
$distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];
K = [FROM J EMIT id, distance=$min(distance)];
L = JOIN(J, id, K, id)
M = [FROM L WHERE J.distance <= K.distance EMIT
(id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];
Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];
Delta = DIFF(Kmeans', Kmeans)
Kmeans = Kmeans'
Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];
WHILE DELTA != {}
K-Means in the language MyriaL
57
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
58
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
59
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2
• Hypothesis: Loops + RA covers everything anyone wants to do
– and it scales, it’s optimizable, and it’s accessible
• We can smooth the ROI curve for novices
– Start with simple queries…
– …end up working on advanced parallel algorithms
• “White Box Analytics”
– Compose queries, inspect plans, monitoring, debugging, “UDRs” –
user-defined optimization rules
• Multiple languages, multiple backends, one data/query model
– Ask me about graph data
– Ask me about array data (or, rather, mesh data)
“Relational Algorithmics”
Takeaways
• We hope to see “Data Science Environments” at
universities worldwide
– We try to make our programs and activities reusable
• Software-as-a-service to reach the “long tail” of science
• “Relational Algorithmics”
– The relational algebra is the calculus of big data
– “It’s not just for databases anymore”
– Learn it, use it, teach it
– Myria is a platform for “relational algorithmics”
http://escience.washington.edu
@billghowe
billhowe@cs.washington.edu
62
63
Maslow’s Needs Hierarchy
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
A “Needs Hierarchy” of Science Data Management
storage
sharing
64
query
integration
analytics
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
A “Needs Hierarchy” of Science Data Management
storage
sharing
65
integration
query
analytics
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43

Mais conteúdo relacionado

Mais procurados

A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable PapersJose Enrique Ruiz
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013osimod
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataHerbert Van de Sompel
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiativeHerbert Van de Sompel
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
Love for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 versionLove for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 versionLourdes Verdes-Montenegro
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paperJose Enrique Ruiz
 
A Biological Internet?: Eywa
A Biological Internet?: EywaA Biological Internet?: Eywa
A Biological Internet?: EywaEugene Siow
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesIan Mulvany
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentationPaolo Missier
 

Mais procurados (20)

A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Cifar
CifarCifar
Cifar
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable Papers
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage data
 
towards interoperable archives: the Universal Preprint Service initiative
towards interoperable archives:  the Universal Preprint Service initiativetowards interoperable archives:  the Universal Preprint Service initiative
towards interoperable archives: the Universal Preprint Service initiative
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
Love for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 versionLove for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 version
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paper
 
A Biological Internet?: Eywa
A Biological Internet?: EywaA Biological Internet?: Eywa
A Biological Internet?: Eywa
 
A Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific CuriositiesA Cabinet Of Web2.0 Scientific Curiosities
A Cabinet Of Web2.0 Scientific Curiosities
 
Camp 4-data workshop presentation
Camp 4-data workshop presentationCamp 4-data workshop presentation
Camp 4-data workshop presentation
 

Semelhante a XLDB South America Keynote: eScience Institute and Myria

So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceLizLyon
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube
 
IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfssuserff37aa
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 

Semelhante a XLDB South America Keynote: eScience Institute and Myria (20)

eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Democratizing Data Science by Bill Howe
Democratizing Data Science by Bill HoweDemocratizing Data Science by Bill Howe
Democratizing Data Science by Bill Howe
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalface
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data
Big Data Big Data
Big Data
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdf
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 

Mais de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 

Mais de University of Washington (14)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

Último

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Último (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

XLDB South America Keynote: eScience Institute and Myria

  • 1. Myria: Scalable Analytics as a Service Bill Howe, PhD University of Washington XLDB South America 2014
  • 2. This morning • UW eScience Institute – A “Data Science Environment” • SQLShare and High Variety Data • Myria and “Relational Algorithmics” 7/10/2014 Bill Howe, UW 2
  • 3. 3 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
  • 4. The Fourth Paradigm 1. Empirical + experimental 2. Theoretical 3. Computational 4. Data-Intensive Jim Gray 7/10/2014 Bill Howe, UW 4
  • 5. “All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data… In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields.” 2005-2008 In other words: • Data-driven discovery will be ubiquitous • UW must be a leader in inventing the capabilities • UW must be a leader in translational activities – in putting these capabilities to work • It’s about intellectual infrastructure (human capital) and software infrastructure (shared tools and services – digital capital)
  • 6. A 5-year, US$37.8 million cross-institutional collaboration to create a data science environment 6 2014
  • 7. 7/10/2014 Bill Howe, UW 7 Data Science Kickoff Session: 137 posters from 30+ departments and units
  • 8. Establish a virtuous cycle • 6 working groups, each with • 3-6 faculty from each institution
  • 9. UW Data Science Education Efforts 7/10/2014 Bill Howe, UW 9 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate MOOC Intro to Data Science IGERT: Big Data PhD Track New CS Courses Bootcamps and workshops Intro to Data Programming Data Science Masters (planned) Incubator: hands-on training
  • 10. 7/10/2014 Bill Howe, UW 10 Next Session begins June 30, 2014 https://www.coursera.org/course/datasci
  • 11. MOOC Participation numbers • “Registered”: 119,517 totally irrelevant • Clicked play in first 2 weeks: 78,589 • Turned in 1st homework: 10,663 • Completed all assignments: ~9000 typical attrition for a MOOC • “Passed”: 7022 • Forum threads: 4661 • Forum posts: 22,900 Fairly consistent with Coursera data across “hard” courses 11
  • 12. Educational transformation: A new generation of “Pi-shaped” scientists 12 PhD  πhD Educational transformation Magda Balazinska
  • 13. 13 Educational transformation Big Data access and management Big Data modeling Big Data analytics Collaborative Big Data scienceData Education and Research in Data Science • Ultimate goal: A new PhD program – Initial goal: A new certificate based on Big Data tracks in all departments – Education highlights: data science courses, co-advising, and internships • End-to-End Research Agenda – Big Data mgmt, analytics, modeling, & collaboration • Cyberinfrastructure Development – Big Data analysis service
  • 14. The Data Science Studio • An open collaborative research space • A resident data science team – Permanent staff of ~5 data scientists – applied research and development – ~15-20 data science fellows (research scientists, visitors, postdocs, students) • How to Engage: – Drop-in open workspace – Studio “Office Hours” – Incubation Program 14
  • 15. 15 6th floor Physics Astronomy Building A partnership among … • Provost • UW Libraries • Physics, Astronomy, Arts & Sciences • eScience Institute
  • 16. 16 Estimated Timeline: • Design Phase Jan-June • Construction June – Sep • Target: October 1, 2014
  • 17. 7/10/2014 Bill Howe, UW 17 The rest of this talk…
  • 18. 7/10/2014 Bill Howe, UW 18 How can we deliver 1000 little SDSSs to anyone who wants one?
  • 19. 7/10/2014 Bill Howe, UW 19 #ofbytes # of data sources telescopes spectra LSST (~100PB; images, spectra) PanSTARRS (~40PB; images, trajectories) OOI (~50TB/year; sims, RSN) IOOS (~50TB/year; sims, satellite, gliders, AUVs, vessels, more) CMOP (~10TB/year; sims, stations, gliders, AUVs, vessels, more) SDSS (~400TB; images, spectra, catalogs) n-body sims models AUVs stations cruises, CTDs flow cytometry gliders ADCP satellites Astronomy Ocean Sciences 3 V’s of Big Data Volume Variety Velocity
  • 20. How much time do you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/10/2014 Bill Howe, UW 20 Key question: How can we reduce this “data overhead”?
  • 21. 7/10/2014 Bill Howe, UW Simple Example ###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1 chr_4[480001-580000].287 4500 chr_4[560001-660000].1 3556 chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN, chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf f chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf f chr_24[160001-260000].65 3542 chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf f chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p chr_11[1-100000].70 2886 chr_11[80001-180000].100 1523 ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length 1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285 2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233 3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872 … 2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089 2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316 … 3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105 … COGAnnotation_coastal_sample.txt SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit 21
  • 22. Data Science Workflow: 7/10/2014 Bill Howe, UW 22 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work”
  • 23. “[This was hard] due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used). In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants) So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format. I guess in total [I spent] 6 months [on this project].” At least 3 months on issues of scale, file handling, and feature engineering. Martin Kircher, Genome SciencesWhy? 3k NSF postdocs in 2010 $50k / postdoc at least 50% overhead maybe $75M annually at NSF alone?
  • 24. 0 30 60 90 120 Benchmark 1 Benchmark 2 Old system Your system Our system A typical Computer Science paper…. slide src: Dan Halperin
  • 25. 0 2500 5000 7500 10000 12500 Benchmark 1 Benchmark 2 Old system Your system Our system What people use The reality of the situation…. slide src: Dan Halperin
  • 26. A modest goal: Expose all the world’s science data through declarative query interfaces 7/10/2014 Bill Howe, UW 26
  • 28. 1) Upload data “as is” Cloud-hosted, secure; no need to install or design a database; no pre-defined schema; schema inference; some itegration 2) Write Queries Right in your browser, writing views on top of views on top of views ... SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC 3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query http://sqlshare.escience.washington.edu
  • 29. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results We see thousands of queries written by non-programmers
  • 30. Howe, et al., CISE 2013
  • 31. Steven Roberts SQL as a lab notebook: http://bit.ly/16Xj2JP Calculate # methylated CGs Calculate # all CGs Calculate methylation ratio Link methylation with gene description GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Join Reorder columns Count Count JoinJoin Reorder columns Reorder columns Compute Trim Excel Join Join misstep: join w/ wrong fill Calculate # methylated CGs Calculate # all CGs GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Calculate methylation ratio and link with gene description Popular service for Bioinformatics Workflows
  • 32. Halperin, Howe, et al. SSDBM 2013
  • 33. Two Problems with SQLShare • No help for truly big datasets • No help for “algorithmics” 33 Limitations of SQLShare
  • 34. 7/10/2014 Bill Howe, UW 34 Relational Algorithmics-as-a-Service Version 2 http://myria.cs.washington.edu
  • 35. Myria is… • MyriaQ: A compiler framework for multiple iterative RA-based languages and multiple big data back ends • MyriaX: A parallel, shared-nothing, iterative execution engine • MyriaWeb: A RESTful Analytics-as-a- Service platform and web-based interface 35 Myria is …
  • 36. Magda Balazinska, Bill Howe, and Dan Suciu Dan Halperin (technical lead) Victor Almeida Andrew Whitaker PhD Students Shumo Chu Eric Gribkoff Jeremy Hyrkas Paris Koutris Ryan Maas Dominik Moritz Laurel Orr Jennifer Ortiz Emad Soroush Jingjing Wang ShengLiang Xu Undergraduate Students Lee Lee Choo Vaspol Ruamviboonsuk Myria Team
  • 37. Myria Architecture Coordinator Language Parser Myria Compiler Logical Optimizer for RA+While REST Server Worker Catalog Catalog … json query plan netty protocols RDBMS jdbc Worker Catalog RDBMS jdbc Worker Catalog RDBMS jdbc MyriaX (Java) C Compiler Grappa Web UI MyriaQ (Python) HDFS HDFS HDFS Datalog SQL MyriaL REST SciDB
  • 38. SparkSerial C++GrappaMyriaX SQL SQLDatalogMyriaL ?? Relational Algebra + Iteration Compiler Compiler Compiler Compiler Compiler MyriaQ Oceanography, Astronomy, Biology, Medical Informatics
  • 39. Laser Microscope Objective Pine Hole Lens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo EX: SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 40. Ex: SeaFlow 10 0 10 1 10 2 10 3 10 4 100 10 1 10 2 10 3 10 4 ps3.fcs…Focus D1/FSC D2/FSC d1/FSC d2 / FSC 10 0 10 1 10 2 10 3 10 4 100 101 10 2 10 3 10 4 ps3.fcs…subset FSC 692-40REDfluorescence FSC Picoplankton Nanoplankton 100 101 102 103 104 10 0 10 1 10 2 103 104 P35-surf FSC Small Stuff 580-30 IS Ultraplankton Prochlorococcus Continuous observations of various phytoplankton groups from 1-20 mm in size Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton Based on ORANGE fluo: Synechococcus, Cryptophytes Based on FSC: Coccolithophores Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 42. SeaFlow in Myria • “That 5-line MyriaL program was 100x faster than my R cluster, and much simpler” Dan Halperin Sophie Clayton
  • 43. 7/10/2014 Bill Howe, UW 43 1) BD experiments are ridiculously labor-intensive – N systems x M real-world applications – Big clusters and big datasets 2) No “one size fits all solution” – Realistic environments will use more than one system 3) A return to distributed, federated databases – Erase the distinction between ETL and Analytics Why a big data middleware?
  • 44. Pregel (Malewicz) Hadoop 2008 2009 2010 2011 2012 2013 2014 HaLoop (Bu) Spark (Zakaria) Vertica (Pavlo) ~100x faster SystemML (Ghoting) Hyracks (Borkar) GraphLab (Low) faster Cumulon (Huang) comparable or inconclusive Giraph (Tian) Dremel (Melnik) SimSQL (Cai) epiC (Jiang) Impala (Cloudera) Shark (Xin) HIVE (Thusoo) “The good old days” “The age of uncertainty”
  • 45. 7/10/2014 Bill Howe, UW 45 What can we conclude? Hadoop was probably just pretty bad The rest of the story not so clear
  • 46. Relational Algebra is the Calculus of Big Data • Hadoopspawn: Pig, HIVE, blah • Hadoop contemporaries: Cascalog, Flume, blah • Post-Hadoop: Spark/Shark, Dremel, blah • etc. 7/10/2014 Bill Howe, UW 46
  • 47. HBase 7/10/2014 Bill Howe, UW 47 BigTable Dremel Tenzing 2004 Pregel Hadoop 2005 MapReduce 2006 2007 2008 2009 Spanner Megastore 2010 2011 2012 Google Big Data Systems non-Google open source implementation direct influence / shared features compatible implementation of SQL-like interface BigQuery
  • 48. Relational Algebra is the Calculus of Small Data • Galaxy – “bioinformatics workflows” • Pandas (Python) merge(left, right, on=‘key’) • dplyr (R) filter(x), select(x), arrange(x), groupby(x), inner_join(x, y), left_join(x, y), …. • Manimal, Pyxis/StatusQuo, others – Extract RA operators implemented manually in Java code 7/10/2014 Bill Howe, UW 48 “…Operate on Genomics Intervals -> Join”
  • 49. 7/10/2014 Bill Howe, UW 49 Key Idea: Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
  • 50. A closer look at an example ROI(id, start, stop) is a set of “regions of interest” Read(id, start, stop) is a set of “reads” from sequencer Task: For each region of interest, count the number of reads it contains start stop stopstart
  • 51. SELECT roi.id, count(rd.id) FROM regions_of_interest roi, reads rd WHERE roi.start <= rd.start AND rd.[end] <= roi.[end] GROUP BY roi.id​ As a query “region of interest” sequence “read”
  • 52. SELECT roi.id, count(rd.start) FROM regions_of_interest roi, reads rd WHERE roi.start <= rd.start AND rd.[end] <= roi.[end] GROUP BY roi.id​ Why databases get a bad reputation many minutes SELECT roi.id, count(rd.start) as cnt FROM regions_of_interest roi, indexed_reads rd WHERE roi.start <= rd.start AND rd.start <= roi.[end] AND roi.start <= rd.[end] AND rd.[end] >= roi.[end] GROUP BY roi.id 3 seconds! roi read two-sided index scan one-sided index scan, plus filter The broken promise of declarative query…
  • 54. Giving users insight Shumo Chu Dominik Moritz
  • 56. 56 A = LOAD('points.txt', id:int, x:float, y:float) E = LIMIT(A, 4); F = SEQUENCE(); Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)]; Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)] DO I = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))]; K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)]; Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))]; Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans' Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))]; WHILE DELTA != {} K-Means in the language MyriaL
  • 57. 57 CurGood = SCAN(public:adhoc:sc_points); DO mean = [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0]; WHILE continue; DUMP(CurGood); Sigma-clipping, V0
  • 58. 58 CurGood = P sum = [FROM CurGood EMIT SUM(val)]; sumsq = [FROM CurGood EMIT SUM(val*val)] cnt = [FROM CurGood EMIT CNT(*)]; NewBad = [] DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {} Sigma-clipping, V1: Incremental
  • 59. 59 Points = SCAN(public:adhoc:sc_points); aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; newBad = [] bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)]; DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt]; stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))]; newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std]; tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh); bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0]; WHILE continue; output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v]; DUMP(output); Sigma-clipping, V2
  • 60. • Hypothesis: Loops + RA covers everything anyone wants to do – and it scales, it’s optimizable, and it’s accessible • We can smooth the ROI curve for novices – Start with simple queries… – …end up working on advanced parallel algorithms • “White Box Analytics” – Compose queries, inspect plans, monitoring, debugging, “UDRs” – user-defined optimization rules • Multiple languages, multiple backends, one data/query model – Ask me about graph data – Ask me about array data (or, rather, mesh data) “Relational Algorithmics”
  • 61. Takeaways • We hope to see “Data Science Environments” at universities worldwide – We try to make our programs and activities reusable • Software-as-a-service to reach the “long tail” of science • “Relational Algorithmics” – The relational algebra is the calculus of big data – “It’s not just for databases anymore” – Learn it, use it, teach it – Myria is a platform for “relational algorithmics” http://escience.washington.edu @billghowe billhowe@cs.washington.edu
  • 62. 62
  • 63. 63 Maslow’s Needs Hierarchy “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
  • 64. A “Needs Hierarchy” of Science Data Management storage sharing 64 query integration analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
  • 65. A “Needs Hierarchy” of Science Data Management storage sharing 65 integration query analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43