Mais conteúdo relacionado Semelhante a THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 2013 (19) THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 20131. THE MISSING MANUAL FOR DATA SCIENCE: REMIX.
RESUSE. REPRODUCE
SPEAKER: Matt Wood
Principal Data Scientist
Amazon Web Services
Monday, April 1, 13
2. The Missing Manual:
Reproduce, Reuse, Remix
Dr. Matt Wood
matthew@amazon.com
@mza
Monday, April 1, 13
7. Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
10. Amazing data generators: cell phones tracking cholera in Haiti
Linus Bengtsson et al. PLoS Medicine, 2011
Monday, April 1, 13
11. Amazing data generators: social networks tracking influenza
You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011
Monday, April 1, 13
24. Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
25. Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
35. Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Monday, April 1, 13
52. How do we get from
here to there?
IPLESF
5 PR INC O
REPRO DUCIBILITY
Monday, April 1, 13
53. PRINCIPLESF
5
O
REPRODUCIBILITY
Monday, April 1, 13
54. PRINCIPLESF
5
O
REPRODUCIBILITY
1. Data has Gravity
Monday, April 1, 13
77. More data,
more users,
more uses,
more locations
Monday, April 1, 13
84. PRINCIPLESF
5
O
REPRODUCIBILITY
Monday, April 1, 13
85. PRINCIPLESF
5
O
REPRODUCIBILITY
2. Ease of use is a prerequisite
Monday, April 1, 13
104. 1000 Genomes
Project
Cloud BioLinux
Monday, April 1, 13
106. 1000 Genomes
Project + your
genomic data
Illumina Basespace
Monday, April 1, 13
107. Cassandra Aegisthus Hadoop, Hive, Pig
Amazon S3
Legacy data warehousing
http://www.youtube.com/watch?v=oGcZ7WVx6EI
Monday, April 1, 13
108. Sting
Microstrategy
R
Cassandra Aegisthus Hadoop, Hive, Pig
Amazon S3
Legacy data warehousing
http://www.youtube.com/watch?v=oGcZ7WVx6EI
Monday, April 1, 13
110. PRINCIPLESF
5
O
REPRODUCIBILITY
Monday, April 1, 13
111. PRINCIPLESF
5
O
REPRODUCIBILITY
3. Reuse is as important as reproduction
Monday, April 1, 13
112. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
Monday, April 1, 13
113. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
Monday, April 1, 13
121. Fire and forget reproduction
is a good first step, but limits
longer term value.
Monday, April 1, 13
138. PRINCIPLESF
5
O
REPRODUCIBILITY
Monday, April 1, 13
139. PRINCIPLESF
5
O
REPRODUCIBILITY
4. Build for collaboration
Monday, April 1, 13
150. Code + AMI +
custom datasets + public datasets +
databases + compute + result data
Monday, April 1, 13
151. Code + AMI +
custom datasets + public datasets +
databases + compute + result data
Monday, April 1, 13
152. Code + AMI +
custom datasets + public datasets +
databases + compute + result data
Monday, April 1, 13
153. Code + AMI +
custom datasets + public datasets +
databases + compute + result data
Monday, April 1, 13
154. PRINCIPLESF
5
O
REPRODUCIBILITY
Monday, April 1, 13
155. PRINCIPLESF
5
O
REPRODUCIBILITY
5. Provenance is a first class object
Monday, April 1, 13
166. IPLESF
5
PRI NC O
Y
RODUCIBILIT
REP
Monday, April 1, 13
167. IPLESF
5 PRI NC O
Y
RODUCIBILIT
REP
1. Data has gravity
2. Ease of use is a prerequisite
3. Reuse is as important as reproduction
4. Build for collaboration
5. Provenance is a first class object
Monday, April 1, 13
169. Thank you
matthew@amazon.com
aws.amazon.com
@mza
Monday, April 1, 13