Mais conteúdo relacionado Semelhante a Human Genetics & Big Data [sans Ethics] (20) Mais de Allen Day, PhD (18) Human Genetics & Big Data [sans Ethics]2. © 2014 MapR Technologies 2
Biomedical Research Goal: Improve Fitness
Therapeutics => Diagnostics => Prognostics
• Therapeutics => traditional medicine
• Diagnostics => personalized medicine
– NextGen public health
– Requires hi-res mechanical knowledge
– Reverse engineer how genetic variation leads to (un)desired traits
• Prognostics => GATTACA (dys/eu)topia
– Managed populations / NextGen eugenics
3. © 2014 MapR Technologies 3
Biomedical & Advertising Tech Overarching Themes*
*Obligatory movie references… shout-out to my hometown LA
Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
6. © 2014 MapR Technologies 6
Health ~ Fitness
Genes => Traits => Behaviors => Fitness
8. © 2014 MapR Technologies 8© 2014 MapR Technologies
Human Genetics & Big Data
Human Genetics & Ethics
Today we talk about technology
9. © 2014 MapR Technologies 9
Me, Us
• Allen Day, Principal Data Scientist, MapR
5yr Hadoop Dev, R project contributor
PhD, Human Genetics, UCLA Medicine
• MapR
Distributes open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
• See Also
– “allenday” most places (twitter, github, etc.)
– @mapR
10. © 2014 MapR Technologies 10
Genetic Basis of Facial Features
self-reported values of {sex, ancestry}
+ observer scores [race, sex]}
+ 3D facial scan
+ genome scan
______________________________
Allelic model of 20 genes that
determine facial characteristics
Claes, et al. 2014. Modeling 3D Facial Shape from DNA
11. © 2014 MapR Technologies 11
Genetic Basis of Facial Features
Claes, et al. 2014. Modeling 3D Facial Shape from DNA
12. © 2014 MapR Technologies 12
So Get Ready…
www.theness.com
13. © 2014 MapR Technologies 13
DTRA102-007 – Forensic DNA
Analysis Kit for Genetic Intelligence
• Sex
• Blood type
• Ancestry
• Hair morphology
• Dimples
• Freckles
• Shoe size
• Flat-footedness
• Vision correction
• Ear lobe attachment
• Ear lobe crease
• 5th digit clinodactyly
• Eye color, hair color, skin
color
• Height, handedness
• Etc
https://sbirsource.com/grantiq#/topics/85383
14. © 2014 MapR Technologies 14
DTRA102-007: Sex and Ancestry
15. © 2014 MapR Technologies 15© 2014 MapR Technologies
Trends & Events
16. © 2014 MapR Technologies 16
Trends and Events: Even Moore’s Law
Stein. 2010. The case for cloud computing in genome informatics
“Even Moore’s” begins in 2004
with Solexa (acquired by ILMN 2007)
Storage:MB/$
DNA:bp/$
ILMN HiSeq XTen
(Jan 2014)
$1000 Genome
17. © 2014 MapR Technologies 17
NIH Research Funding Trends.
http://www.faseb.org/Policy-and-Government-Affairs/Data-Compilations/NIH-Research-Funding-Trends.aspx
Trends and Events: US Federal Funding You are here
18. © 2014 MapR Technologies 18
More Data
Less Federal $
19. © 2014 MapR Technologies 19
Trends and Events: The $1000 Genome
• Physicians want to use patient genomes to improve care
• Scientists say personalized medicine breakthroughs require
100Ks to MMs of genomes
• Healthcare mandates efficacy and efficiency (early majority)
These forces converge at $1000 for a clinically usable genome
20. © 2014 MapR Technologies 20
Trends and Events: ILMN HiSeq XTen Specs
• Sold in sets of 10 units ONLY (XTen =10 sequencers)
~ $10 million/XTen, shipments began in Jan 2014
• XTen produces 600 GBases/day @ 30x oversampling
= 1.8 TBases per 3-day cycle
= 54 TBytes per 3-day cycle
= $1000 per genome
= 18,000 genomes/year/XTen
~ 4,000,000 births/year (US, 2012)
Neonatal sequencing is a reality (with 200 of today’s systems)
21. © 2014 MapR Technologies 21
Summary: Major Impact on Social Fabric
Soon to be gone:
• Muscular dystrophy
• Cystic fibrosis
• Albinism
• PKU (phenylketonuria)
• Paternity Tests =>
http://pandawhale.com/post/13851/my-report-card-came-in-my-paternity-test-came-in
http://www.nature.com/scitable/topicpage/rare-genetic-disorders-learning-about-genetic-disease-979
• Hemophilia
• Huntington’s Disease (keep?)
Fact: US paternity fraud
rate is 1 in 25
22. © 2014 MapR Technologies 22
Summary: A Perfect Storm
• LESS public funding (NIH)
• MORE DNA sequencing efficiency (HiSeq XTen)
• Predicted DNA sequencing demand VALIDATED (medicine)
• MORE VC investment ($1000/genome force confluence)
• DNA sequencing capacity consolidating into genome “factories” (e.g.
Broad, ILMN) => REQUIRES new infrastructure
23. © 2014 MapR Technologies 23
The Evolving Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= 1º analytics
“current high ROI use cases”
<= 2º analytics
“next-gen high ROI use cases”
24. © 2014 MapR Technologies 24
The Evolving Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= 1º analytics
“current high ROI use cases”
<= 2º analytics
“next-gen high ROI use cases”
25. © 2014 MapR Technologies 25© 2014 MapR Technologies
Clinical Application of Human Genetics
26. © 2014 MapR Technologies 26
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
27. © 2014 MapR Technologies 27
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
28. © 2014 MapR Technologies 28
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
29. © 2014 MapR Technologies 29
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
30. © 2014 MapR Technologies 30
Clinical Genomics, Information Systems Perspective
Compressed Structured
Base4 Data
Uncompressed Unstructured
Base2 Data
extract
Base4=>Base2
Converter
[[ DE-STRUCTURES ]]
“BI” Reporting and
Visualization tools
PhysicianPatient
AnalystStakeholder
31. © 2014 MapR Technologies 31
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
32. © 2014 MapR Technologies 32
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
1º analytics
2º analytics
Not much in this presentation,
see also:
http://slidesha.re/1sC2BOX
33. © 2014 MapR Technologies 33© 2014 MapR Technologies
1º Analytics: Why MapReduce?
34. © 2014 MapR Technologies 34
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
1º analytics
2º analytics
see also:
http://slidesha.re/1sC2BOX
35. © 2014 MapR Technologies 35
The Essence of the Problem:
What is the (Probable) Color of Each Column?
36. © 2014 MapR Technologies 36© 2014 MapR Technologies
Next-Gen Human Genetics – Population Scale
37. © 2014 MapR Technologies 37
The Evolving Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= 2º analytics
“next-gen high ROI use cases”
38. © 2014 MapR Technologies 38
MapR Data Platform Advantage, Clinical Genomics
Epidemiological,
Actuarial Analyses
Denormalization for
Search, Viz, Research
ETL
Clinical
Reporting
WEB TIERClinical
Reporting
Systems
CLINICAL
TREATMENT
OF PATIENTS
RESEARCHERS
National Pop.
Database
INDEX SHARDSPrognostic
Capability
39. © 2014 MapR Technologies 39
Co-expression (10K samples) and Linkage
Gene Annotation / Set CompletionBMP6
BMP2
MMP3
LIF
NOS2A
MMP13
CSPG4
ACAN
ACAN
ACAN
COL11A2
COL11A2
COL9A1
MATN1
LECT1
MATN4
HAPLN1
HAPLN1
ITGA10
EDIL3
NGF
MAST4
MATN3
EPYC
COL11A1
COL11A1
COL10A1
COL10A1
THBS3
C1QTNF3
WISP1
PDPN
PDLIM4
CHST3
MIA
SOX5
CYTL1
TNMD
AKR1C1
MMP12
ETNK1
RELA
FOSL1
EIF2C2
NUPL1
RLF
RELB
SOD2
RNF24
RNF24
XYLT1
HAS2
BDKRB1
HSPC159
SLC28A3
FZD10
SLC28A3
HSPC159
BDKRB1
HAS2
XYLT1
RNF24
RNF24
SOD2
RELB
RLF
NUPL1
EIF2C2
FOSL1
RELA
ETNK1
MMP12
AKR1C1
TNMD
CYTL1
SOX5
MIA
CHST3
PDLIM4
PDPN
FZD10
WISP1
C1QTNF3
THBS3
COL10A1
COL10A1
COL11A1
COL11A1
EPYC
MATN3
MAST4
NGF
EDIL3
ITGA10
HAPLN1
HAPLN1
MATN4
ACAN
ACAN
ACAN
LECT1
MATN1
COL9A1
COL11A2
COL11A2
CSPG4
MMP13
NOS2A
LIF
MMP3
BMP2
BMP6
Disease gene characterization through large-scale co-expression analysis.
http://www.ncbi.nlm.nih.gov/pubmed/20046828
+ =>
40. © 2014 MapR Technologies 40
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
NPR. 2011. The Search For Analysts To Make Sense Of
'Big Data’
http://www.npr.org/2011/11/30/142893065
41. © 2014 MapR Technologies 41
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
• Identify network structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• $$$$ Twitter’s Business
Model
42. © 2014 MapR Technologies 42© 2014 MapR Technologies
These are Linear Algebra / Machine Learning Problems
43. © 2014 MapR Technologies 43© 2014 MapR Technologies
A Quick Digression: Recommender Systems
44. © 2014 MapR Technologies 44
HOW RECOMMENDATIONS WORK
Behavior of a crowd
helps us understand
what individuals will do
45. © 2014 MapR Technologies 45
History Matrix (A)
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔
46. © 2014 MapR Technologies 46
Co-occurrence Matrix (ATA)
1 2
1 1
1
1
2 1
47. © 2014 MapR Technologies 47
<Normalize>
(filter to identify only unusual co-occurences)
48. © 2014 MapR Technologies 48
HOW CROSS-RECOMMENDATIONS
WORK
Behavior of a crowd
helps us understand
what individuals will do
49. © 2014 MapR Technologies 49
Example Multi-modal Inputs
• Overlap in restaurant visits is useful
• Big spender cues
• Cuisine as an indicator
• Review text as an indicator
50. © 2014 MapR Technologies 50
People do more than one kind of thing
• Different kinds of behaviors give different quality, quantity and
kind of information
– Restaurant visits
– Movie reviews
• We don’t have to do co-occurrence
• We can do cross-occurrence
• Result is cross-recommendation
51. © 2014 MapR Technologies 51
For example
• Users enter queries (A)
– (actor = user, item=query)
• Users view videos (B)
– (actor = user, item=video)
• ATA gives query recommendation
– “did you mean to ask for”
• BTB gives video recommendation
– “you might like these videos”
52. © 2014 MapR Technologies 52
The punch-line
• BTA recommends videos in response to a query
– (isn’t that a search engine?)
– (not quite, it doesn’t look at content or meta-data)
53. © 2014 MapR Technologies 53
Real-life example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres del paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
55. © 2014 MapR Technologies 55
Previous Click Histories
user1
user2
user3
user4
user5
1 2 3 4 5 6 7 8
56. © 2014 MapR Technologies 56
Detect similar content: 2 & 8
user1
user2
user3
user4
user5
1 2 3 4 5 6 7 8
57. © 2014 MapR Technologies 57
Call to Action – Request Clicks
user1
user2
user3
user4
user5
Show me more:
sports
comedy
technology
1 2 3 4 5 6 7 8
“Under
Construction”
58. © 2014 MapR Technologies 58
Build Navigational Ontology (estimate content labels):
4=sports ; 2 & 7=comedy
user1
user2
user3
user4
user5
Show me more:
sports
comedy
technology
1 2 3 4 5 6 7 8
4
2 & 7
Under
construction
59. © 2014 MapR Technologies 59
Matrices A (U*Q) and B (U*V)
Query Term = Clicked Term
Users
Query Terms
Users
Clicked Videos
60. © 2014 MapR Technologies 60
Relate Q to V
Users
Query Terms
61. © 2014 MapR Technologies 61
Relate Q to V
Users
Query Terms
63. © 2014 MapR Technologies 63
Relate Q to V: it’s a Cross-Recommender
QueryTerms
Videos
64. © 2014 MapR Technologies 64© 2014 MapR Technologies
Population-level Inference
65. © 2014 MapR Technologies 65
Typical Dimensions in Genetics/Medicine
• Genotype
• Gene Expression
• Samples
• Phenotypes (traits/behavior)
66. © 2014 MapR Technologies 66
Typical Dimensions in Behavioral Data
• Genotype
• Gene Expression
• Samples Individuals
• Phenotype
– Traits
– Behaviors
67. © 2014 MapR Technologies 67
Incidence/Co-occurrence in Behavioral Data
• Individual * Individual
– Genealogy
• Trait * Behavior => [Netflix]
– User/Content Topic Modeling
• Genotype * Behavior => [Psychometrics]
– Genetics of personality, intelligence, aptitude
• Behavior * Outcome => [Korn-Ferry]
– Job effectiveness
• Phenotype (trait/behavior) * Outcome => [eHarmony]
– Reproductive fitness
68. © 2014 MapR Technologies 68
Traits and Behaviors:
Content Topic Modeling / UX Personalization
69. © 2014 MapR Technologies 69
Behaviors and Outcomes:
Economic Fitness (Korn/Ferry)
Korn/Ferry ProSpective
http://linkedin.kornferry.com
Allen
=>
71. © 2014 MapR Technologies 71
(Traits/Behaviors) and Outcomes
Reproductive Fitness (eHarmony)
eHarmony @ Hadoop World: Data Science of Love
http://eharmony.com
72. © 2014 MapR Technologies 72
Genes
Reproductive
Outcomes
73. © 2014 MapR Technologies 73
Genes => Traits => Behaviors => Fitness
Job Performance
Psychometrics
Movie Preferences
Medicine
Forensics
74. © 2014 MapR Technologies 74
Genes => Traits => Behaviors => Fitness
Job Performance
Psychometrics
Movie Preferences
Medicine
Forensics
Fitness
Reproductive Outcomes
76. © 2014 MapR Technologies 76
ENCODE
http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312
77. © 2014 MapR Technologies 77
Robot Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
78. © 2014 MapR Technologies 78
Robot (Data?) Scientist
Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
79. © 2014 MapR Technologies 79© 2014 MapR Technologies
Thanks
Notas do Editor The genomic position (x-axis) of probesets within a 6 megabase region centered at the location of TTN, a gene known to be associated with LMGD2, is plotted versus the Pearson correlation coefficient An external file that holds a picture, illustration, etc.Object name is pone.0008491.e023.jpg (y-axis) to a list of probesets targeting other genes known to be associated with LGMD2 (excluding TTN) across 11636 HG-U133_Plus_2 microarrays. Solid circles: probesets targeting TTN, An external file that holds a picture, illustration, etc.Object name is pone.0008491.e024.jpg: probesets that are for genes of unknown function and, open circles: probesets for known genes in interval. Allen: this is the transitional slide from talking about more than one input to one step further: cross recommendation. I doubt you want to use it as it, but I’ve included it FYI Allen: additional transitional slide Allen: What do you plan to say about this? General example without anything proprietary? Allen: What do you plan to say about this? General example without anything proprietary? Allen: What do you plan to say about this? General example without anything proprietary?