Lars Juhl JensenText mining>10 km
exponential growth
~45 seconds per paper
corpus
most use abstracts
few use full-text articles
no access
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step...
no tool will find that
still too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
identify the concepts
comprehensive lexicon
small molecules
proteins
cellular components
tissues
diseases
environments
organisms
orthographic expansion
prefixes and postfixes
Cdc28 vs. Cdc28p
singular and plural forms
flexible matching
upper- and lower-case
spaces and hyphens
“black list”
SDS
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step...
information extraction
formalize the facts
the starting point
named entity recognition
two approaches
co-mentioning
within documents
within paragraphs
within sentences
weighted counts
NLPNatural Language Processing
part-of-speech tagging
semantic tagging
sentence parsing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction[nxexpr The expression of[nxgene the cy...
handle negations
high precision
poor recall
highly domain specific
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step...
text/data integration
augmented browsing
Reflect
show relevant information
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010
guilt by association
heterogeneous evidence
knowledge
experiments
text mining
predictions
common identifiers
quality scores
web interface
STRING
proteins
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
Frishman et al., Modern Genome Annotation, 2009
STITCH
small molecules
Kuhn et al., Nucleic Acids Research, 2012
COMPARTMENTS
subcellular localization
compartments.jensenlab.org
TISSUES
human tissue expression
tissues.jensenlab.org
DISEASES
human diseases
evidence viewers
web services
bulk download
summary
text mining
simpler
more useful
less boring
thank you!
questions?
Text mining
Text mining
Text mining
Text mining
Text mining
Text mining
Text mining
Text mining
Text mining
Próximos SlideShares
Carregando em…5
×

Text mining

668 visualizações

Publicada em

Publicada em: Tecnologia, Negócios
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Text mining

  1. 1. Lars Juhl JensenText mining>10 km
  2. 2. exponential growth
  3. 3. ~45 seconds per paper
  4. 4. corpus
  5. 5. most use abstracts
  6. 6. few use full-text articles
  7. 7. no access
  8. 8. information retrieval
  9. 9. find the relevant papers
  10. 10. ad hoc retrieval
  11. 11. user-specified query
  12. 12. “yeast AND cell cycle”
  13. 13. PubMed
  14. 14. indexing
  15. 15. fast lookup
  16. 16. stemming
  17. 17. word endings
  18. 18. dynamic query expansion
  19. 19. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1hyperphosphorylation and degradation
  20. 20. no tool will find that
  21. 21. still too much to read
  22. 22. computer
  23. 23. as smart as a dog
  24. 24. teach it specific tricks
  25. 25. named entity recognition
  26. 26. identify the concepts
  27. 27. comprehensive lexicon
  28. 28. small molecules
  29. 29. proteins
  30. 30. cellular components
  31. 31. tissues
  32. 32. diseases
  33. 33. environments
  34. 34. organisms
  35. 35. orthographic expansion
  36. 36. prefixes and postfixes
  37. 37. Cdc28 vs. Cdc28p
  38. 38. singular and plural forms
  39. 39. flexible matching
  40. 40. upper- and lower-case
  41. 41. spaces and hyphens
  42. 42. “black list”
  43. 43. SDS
  44. 44. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1hyperphosphorylation and degradation
  45. 45. information extraction
  46. 46. formalize the facts
  47. 47. the starting point
  48. 48. named entity recognition
  49. 49. two approaches
  50. 50. co-mentioning
  51. 51. within documents
  52. 52. within paragraphs
  53. 53. within sentences
  54. 54. weighted counts
  55. 55. NLPNatural Language Processing
  56. 56. part-of-speech tagging
  57. 57. semantic tagging
  58. 58. sentence parsing
  59. 59. Gene and protein namesCue words for entity recognitionVerbs for relation extraction[nxexpr The expression of[nxgene the cytochrome genes[nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
  60. 60. handle negations
  61. 61. high precision
  62. 62. poor recall
  63. 63. highly domain specific
  64. 64. Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1homolog) directly phosphorylated Swe1 andthis modification served as a priming step topromote subsequent Cdc5-dependent Swe1hyperphosphorylation and degradation
  65. 65. text/data integration
  66. 66. augmented browsing
  67. 67. Reflect
  68. 68. show relevant information
  69. 69. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010
  70. 70. guilt by association
  71. 71. heterogeneous evidence
  72. 72. knowledge
  73. 73. experiments
  74. 74. text mining
  75. 75. predictions
  76. 76. common identifiers
  77. 77. quality scores
  78. 78. web interface
  79. 79. STRING
  80. 80. proteins
  81. 81. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
  82. 82. Frishman et al., Modern Genome Annotation, 2009
  83. 83. STITCH
  84. 84. small molecules
  85. 85. Kuhn et al., Nucleic Acids Research, 2012
  86. 86. COMPARTMENTS
  87. 87. subcellular localization
  88. 88. compartments.jensenlab.org
  89. 89. TISSUES
  90. 90. human tissue expression
  91. 91. tissues.jensenlab.org
  92. 92. DISEASES
  93. 93. human diseases
  94. 94. evidence viewers
  95. 95. web services
  96. 96. bulk download
  97. 97. summary
  98. 98. text mining
  99. 99. simpler
  100. 100. more useful
  101. 101. less boring
  102. 102. thank you!
  103. 103. questions?

×