O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Technologies, methods and challenges to data sharing and aggrigation

786 visualizações

Publicada em

Presentation to the "Pest Practices in Personalized Medicine" (B2PM) Conference, Vancouver, BC, Canada. March, 2011.

Publicada em: Tecnologia
  • Entre para ver os comentários

  • Seja a primeira pessoa a gostar disto

Technologies, methods and challenges to data sharing and aggrigation

  1. 1. Technologies, methods and challengesto effective public data sharing and aggregation<br />Mark D Wilkinson<br />Medical Genetics, UBC<br />PI Bioinformatics<br />Heart + Lung Institute at St. Paul’s Hospital<br />markw@illuminae.com<br />http://wilkinsonlab.ca<br />
  2. 2. Thanks in advance(very few of these ideas are my own!)<br />Paul Gordon – Sun Center of Excellence, U Calgary<br />Carole Goble – University of Manchester<br />Charles Petrie – Stanford University<br />
  3. 3. We’ve come a long way!!<br />
  4. 4. In the beginning...<br />
  5. 5. Link Integration<br />
  6. 6.
  7. 7. Integration of What?<br />
  8. 8. Specially-formatted Files<br />EMBL Format<br />
  9. 9. Specially-formatted Files<br />GenBankFormat<br />
  10. 10. Specially-formatted Files<br />FASTA Format<br />
  11. 11. Specially-formatted Files<br />GCG Format<br />
  12. 12. At least 20 different formats for <br /> representing DNA sequences...<br />Lord et al. 2004<br />
  13. 13. Many formats contained a wide variety of related, but different, information<br />DNA sequence<br />Sequence features<br />Translation<br />Date/time/method<br />Publication cross-references<br />...<br />...<br />
  14. 14. Each file format required it’s own parser...<br />
  15. 15. ...and that problem wasn’t limited to DNA...<br />
  16. 16. XML<br />
  17. 17. What did XML do for us?<br />“...advent of XML meant that we didn’t <br />have to write our own parsers anymore...”<br />Individual data elements in a file can be automatically located and extracted<br />
  18. 18. Predictable way to represent data<br />Makes it easier for machines to encode/extract<br />
  19. 19. EMBL Record for BRCA1<br />In XML<br />
  20. 20. GenBank Record for BRCA1<br />In XML<br />
  21. 21. So now we can share and <br />aggregate data!<br />
  22. 22. So now we can share and <br />aggregate data!<br />
  23. 23. EMBL Record for BRCA1<br />In XML<br />GenBank Record for BRCA1<br />In XML<br />
  24. 24. So now we can share and <br />aggregate data!<br />...because it isn’t (just) a parsing problem...<br />Various resources have various data models<br />
  25. 25. So... Let’s find a way to describe the data models!<br />
  26. 26. XML Schema<br />
  27. 27. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  28. 28. So now we can share and <br />aggregate data!<br />
  29. 29. So now we can share and <br />aggregate data!<br />
  30. 30. What did XML Schema do for us?<br />“...XML Schema (among other things) allowed <br />us to ~automate the creation of (in-memory) <br />Structures which could hold the given <br />XML-formatted data...”<br />
  31. 31. Does not solve the integration or aggregation problem<br />
  32. 32. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  33. 33. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />Because the “meaning” of eachelement is implicit, we resort to<br />“Schema Mapping” <br />to integrate the data<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  34. 34. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  35. 35. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  36. 36. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />Automatically<br />Map Schema, thenwe can integrate data!<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  37. 37. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child element called “value”<br />The content of that child element will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child element called “GBQualifier_name”<br />The content of that child element will be free-text<br />There will be a child element called “GBQualifier_value”<br />The content of that child element will be free-text<br />
  38. 38. Nevertheless...<br />
  39. 39. Web Services<br />“Service Oriented Architectures”<br />WSDL <br />(and many other 4-letter words)<br />
  40. 40. Web Services & SOA’s<br />Allow you to expose software <br />(e.g. a database, analytical tool, or service)<br />on the Web<br />so that others can use it(in their own analytical pipelines)<br />
  41. 41. Excellent!!<br />
  42. 42. But...<br />
  43. 43. XML Schema<br />
  44. 44. XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  45. 45. “The phrase ‘practical Web Services’ is not intrinsically an oxymoron, but [I] argue that there are <br /> few in existence.”<br />
  46. 46. Why?<br />
  47. 47. Because this problem is so disruptivethat there is little point in building “public” Web Services... <br />They are simply too difficult to integrate with other “public” Web Services.<br />-- adapted from Petrie, SWSIP 2009<br />XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  48. 48. ...and that’s pretty muchwhere the world is right now...<br />
  49. 49. But there is hope!<br />
  50. 50. “Linked Data” movement<br />Resource Description Framework<br />“RDF”<br />Two new technologies & communities<br />The “Semantic Web” movement<br />Web Ontology Language<br />“OWL” (+ RDF)<br />
  51. 51. What does RDF do for us?<br />“...RDF replaces XML Schema, because RDF says that there is only one data model...”<br />
  52. 52. What does OWL do for us?<br />XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />“...the semantics are no longer implicit in the data model...”<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  53. 53. So what?<br />
  54. 54. The Semantic Web <br />Gives us the opportunity to re-thinkhow we build our health data infrastructures<br />
  55. 55. The Semantic Web <br />isn’t <br />“yet another layer of technology”<br />
  56. 56. The Semantic Web <br />changes<br />the way we write software<br />
  57. 57. The Semantic Web <br />What to do & How to do it<br />is no longer encoded in your software<br />
  58. 58. The Semantic Web <br />What to do & How to do it<br />is part of the data<br />
  59. 59. The Semantic Web <br />What to do & How to do itis part of a shared, expert understanding<br />
  60. 60. The Semantic Web <br />What to do & How to do it<br />IS* PERSONAL!<br />* can be...<br />
  61. 61. One piece of software<br />Any question... Any answer<br />
  62. 62. Let me demonstrate what I mean<br />
  63. 63. Semantic Automated Discovery and Integrationhttp://sadiframework.org<br />MicrosoftResearch<br />Founding partner<br />
  64. 64. Semantic Health And Research Environment(a Semantic Web question answerer...)<br />
  65. 65. Example #1<br />Show me the latest Blood Urea Nitrogen and Creatinine levelsof patients who appear to be rejecting their transplants<br />SELECT ?patient ?bun ?creat<br />FROM <http://sadiframework.org/ontologies/patients.rdf><br />WHERE {<br /> ?patientrdf:typepatient:LikelyRejecter .<br /> ?patient l:latestBUN ?bun . <br /> ?patient l:latestCreatinine ?creat . <br />}<br />
  66. 66. Likely Rejecter:<br />A patient who has creatinine levelsthat are increasing over time<br />- - Wilkinson “MD”<br />
  67. 67. Likely Rejecter:<br />…but there is no “likely rejecter” column or table in our database… <br />
  68. 68. Likely Rejecter:<br />Our database contains various blood chemistry measurementsat various time-points<br />
  69. 69. SHARE determines<br />by itselfthe need to do a Linear Regression analysis over Creatinine blood chemistry measurements<br />Semantics<br />
  70. 70. SHARE determines<br />by itselfhow and wherethat analysiscan be doneand does it<br />Semantics<br />
  71. 71. The SHARE system utilizes Semantics (via SADI) to discover and access analytical services on the Web that do linear regression analysis<br />
  72. 72. VOILA!<br />
  73. 73. Neither SADI nor SHAREknow anything aboutblood chemistry, or mathematics<br />
  74. 74. Example #2<br />From a (contrived) integrated dataset, <br />retrieve the blood pressure measurements<br />SELECT ?output ?unit ?value <br />FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE { <br /> ?output rdf:typesbp:BloodPressure . <br /> ?output local:hasCanonicalAttribute ?pr .<br /> ?pr sio:SIO_000221 ?unit .<br /> ?pr sio:SIO_000300 ?value .<br />}<br />
  75. 75. This should be extremely straightforward...<br />
  76. 76. ...except for one problem...<br />
  77. 77. <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>0.137</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>12.45</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#centi-meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>5.3</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/><br /> </owl:NamedIndividual><br />
  78. 78. Spot the problem?<br />
  79. 79. Let me help...<br />
  80. 80. <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>0.137</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>12.45</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource=“&ucum;unit/framingham/sbpfeb.owl#centi-meter-of-mercury-column"/><br /> </owl:NamedIndividual><br /> <owl:NamedIndividualrdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"><br /> <rdf:typerdf:resource="&galen;SystolicBloodPressure"/><br /> <resource:SIO_000300>5.3</resource:SIO_000300><br /> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/><br /> </owl:NamedIndividual><br />
  81. 81. Example #2<br />From a (contrived) integrated dataset, <br />retrieve the blood pressure measurements<br />Semantics<br />SELECT ?output ?unit ?value <br />FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE { <br /> ?output rdf:typesbp:BloodPressure . <br /> ?output local:hasCanonicalAttribute ?pr .<br /> ?pr sio:SIO_000221 ?unit .<br /> ?pr sio:SIO_000300 ?value .<br />}<br />My semantic definition of “Blood Pressure”includes the units that I want...<br />
  82. 82. Semantics<br />This is enough to trigger SHARE to automatically discover<br />an online unit-conversion service...<br />
  83. 83.
  84. 84. Neither SADI nor SHAREknow anything aboutunits or unit conversions<br />
  85. 85. Many of the challenges <br />to data aggregation and sharingnow have solutions that work!<br />
  86. 86. What, in my opinion, is the greatest remaining challenge?<br />
  87. 87. To a biologist...<br /> ...“data mining” means “this data is mine!”<br />
  88. 88. The challenge to us all<br />Move from Data Mine-ing<br />To Data Ours-ing<br />-- Len Silverston, 2007<br />
  89. 89. XML<br />We’ve come a long way!!<br />XML Schema<br />There will be an element called “qualifier”<br />It will have an attribute called “name”<br />The content of that attribute will betext<br />There will be a child attribute called “value”<br />The content of that child attribute will be free-text<br />XML Schema<br />There will be an element called “GBQualifier”<br />There will be a child attribute called “GBQualifier_name”<br />The content of that child attribute will be free-text<br />There will be a child attribute called “GBQualifier_value”<br />The content of that child attribute will be free-text<br />
  90. 90. TEAM:<br />Luke McCarthyBenjamin Vandervalk<br />SoroushSamadian<br />Microsoft Research<br />

×