SlideShare uma empresa Scribd logo
1 de 54
Biodiversity Informatics: Mining Untapped Resources February 8, 2010 Marine Biology Laboratory and Woods Hole Oceanographic Institute Library  P. Bryan Heidorn Director University of Arizona School of Information Resources and Library Science
[object Object],[object Object],[object Object],[object Object],[object Object]
The problem ,[object Object],[object Object]
[object Object],Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17
Naive View of Science Data GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
Does NSF’s Data Follow the Power Law? I do not know but if  $1 = X bytes…..
20-80  Rule The small are big! Total Grants 9347  $2,137,636,716 20% 80% Number Grants 1869 7478 Total Dollars $1,199,088,125 $938,548,595 Range $6,892,810-$350,000 $350,000- $831
Related Ideas ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Where to find dark data ,[object Object],[object Object],[object Object],[object Object],[object Object]
What is dark data good for? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],Animalia
[object Object],[object Object],[object Object],[object Object],[object Object],Historical and Current Data need to be in a form that allow for use and reuse.
[object Object],Favorable Climate Change Response Explains Non-Native Species’ Success in Thoreau’s Woods
The problem with Museum Specimens ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Natural History Specimens
Automatic Metadata Extraction (Darwin Core) From Museum Specimen Labels … <co> Curtis,  </co><hdlc>  North American Pl </hdlc><cnl> No.</cnl><cn> 503*</cn> <gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>… With Qin Wei, Univ of Illinois
S ample records
Sample OCR Output ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Label Labels ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Label Labels ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example Training Record ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Supervised Learning Framework Gold Classified Labels Training Phase Application Phase Machine Learner Unclassified Labels Segmented Text Silver  Classified Labels Segmentation  Machine  Classifier Unclassified  Labels Human Editing Trained  Model
Herbis Experimental Data ,[object Object],[object Object],[object Object]
Performances of NB and HMM
Element Identifiers
Improved Performance With Field Element Identifiers
 
Learning w/ pre categorization Gold Labels Machine Learner Model n Classified Labels Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Learner Machine Learner Model 2 Model 1 Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Classification Machine Classification Machine Classification Classified Labels Classified Labels Unclassified Labels
FIG. 5. Improved Performance of Specialist Model Specialist100 Curtiss VS 100 General Iterations 0 200 0 100 Specialist Random
P. Bryan Heidorn 1 , Hong Zhang 1 , Eugene Chung 2   and   BGWG 1 Graduate School of Library and Information Science,  2 Linguistics, University of Illinois  Machine Learning in BioGeomancer’s Locality Specification SPNHC & NSCA 2006
BioGeomancer Working Group (BGWG)  http://203.202.1.217/bgwebsite/index.html ,[object Object],[object Object],[object Object],[object Object]
Participants
Example Locality Types Record # Specification of Location   Locality  Type 43 dario 7 mi wnw of; RIO VIEJO FOH; F 86 near Aleutian Islands; S of Amukta Pass  NF; FH 100 INDIAN CREEK, 11 MI. W HWY 160 P; POH 109 TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R  P; FOH; NP 160 WALTMAN, 9 MI N, 2.5 MI W OF  FOO 181 0.4 mi N Collinston on LA 138 FPOH 204 Seward Peninsula; vic. Bluff, S coast F; NF; FS
 
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],FRAME
Xiaoya Tang and P. Bryan Heidorn ,[object Object],Long leaves … ...  Leaves  20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m    1.5–3.5 cm, ……...  Inflorescences:  ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. User query Description of leaf Length in texts
Information Extraction From FNA Templates for  useful information Extraction Rules Structured  information  Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm   ………… .. ………… .. Original documents ……… .. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate,   . ……………… Knowledge bases … .. PartBlade: Leaf blade Blades blade …… Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' *  Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} User log analysis Leaf_Shape Leaf_Margin Leaf_Apex     Leaf_Base Blade_Dimension … .. … .. 
Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to  accomplish a task Group NT NTH TSR SSR NSST TST NDVST SEARFA 6.75 8.078 0.860 0.210 4.779 338.8 11.16 SEARF 4.50 3.598 0.568 0.053 9.584 435.2 14.75 Sig.(ANOVA) 0.005 0.005 0.000 0.011 0.000 0.72 0.162
Education Programs ,[object Object],[object Object],[object Object],[object Object]
Biological Information Specialists ,[object Object],[object Object],[object Object],[object Object],[object Object]
Master of Science in Biological Informatics ,[object Object],[object Object],[object Object],[object Object]
What does a BIS need to know? ,[object Object],[object Object],[object Object],[object Object],[object Object]
UIUC bioinformatics core coursework ,[object Object],[object Object],[object Object],[object Object]
Sample of existing LIS courses ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
MSLIS Data Curation Concentration ,[object Object],[object Object],[object Object],[object Object]
New research directions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example Service ,[object Object],[object Object],[object Object]
JRS Biodiversity Foundation ,[object Object],[object Object],[object Object]
JRS Biodiversity Foundation ,[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],JRS Biodiversity Foundation
National Science Foundation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
 
ALISE and AMISE ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Semelhante a Mblwhoil2010 Heidorn

Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Stephane Fellah
 
Geographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data SourcesGeographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data Sources
Ian Turton
 
Mo ta phau dien dat theo usda 2012 version 3.0
Mo ta phau dien dat theo usda 2012   version 3.0Mo ta phau dien dat theo usda 2012   version 3.0
Mo ta phau dien dat theo usda 2012 version 3.0
tuyen
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
cunera
 
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
Dawn Wright
 

Semelhante a Mblwhoil2010 Heidorn (20)

Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
 
Geographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data SourcesGeographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data Sources
 
What is DataCite-screenshots
What is DataCite-screenshotsWhat is DataCite-screenshots
What is DataCite-screenshots
 
RAC data day
RAC data dayRAC data day
RAC data day
 
bgsu1349900740
bgsu1349900740bgsu1349900740
bgsu1349900740
 
TERN Ecosystem Surveillance Plots Roy Hill Station
TERN Ecosystem Surveillance Plots Roy Hill StationTERN Ecosystem Surveillance Plots Roy Hill Station
TERN Ecosystem Surveillance Plots Roy Hill Station
 
Or2013 poster
Or2013 posterOr2013 poster
Or2013 poster
 
Baseline study for EIA
Baseline study for EIABaseline study for EIA
Baseline study for EIA
 
BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014BHL Technical Director's Report, Mar. 2014
BHL Technical Director's Report, Mar. 2014
 
What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?What's wrong with our scholarly infrastructure?
What's wrong with our scholarly infrastructure?
 
Behavior ontology workshop princeton
Behavior ontology workshop princetonBehavior ontology workshop princeton
Behavior ontology workshop princeton
 
Mo ta phau dien dat theo usda 2012 version 3.0
Mo ta phau dien dat theo usda 2012   version 3.0Mo ta phau dien dat theo usda 2012   version 3.0
Mo ta phau dien dat theo usda 2012 version 3.0
 
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
 
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Forum, Denver, Sept. 24, 2012: Data EquivalenceNISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
 
Dr Sarah Adamowicz - Ecological studies
Dr Sarah Adamowicz - Ecological studiesDr Sarah Adamowicz - Ecological studies
Dr Sarah Adamowicz - Ecological studies
 
Module 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptxModule 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptx
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
BHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-EuropeBHL Tech Overview for BHL-Europe
BHL Tech Overview for BHL-Europe
 
Lehnert_EGU201_SampleMetadataStandards
Lehnert_EGU201_SampleMetadataStandardsLehnert_EGU201_SampleMetadataStandards
Lehnert_EGU201_SampleMetadataStandards
 
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
 

Último

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Mblwhoil2010 Heidorn

  • 1. Biodiversity Informatics: Mining Untapped Resources February 8, 2010 Marine Biology Laboratory and Woods Hole Oceanographic Institute Library P. Bryan Heidorn Director University of Arizona School of Information Resources and Library Science
  • 2.
  • 3.
  • 4.
  • 5. Naive View of Science Data GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
  • 6. Does NSF’s Data Follow the Power Law? I do not know but if $1 = X bytes…..
  • 7. 20-80 Rule The small are big! Total Grants 9347 $2,137,636,716 20% 80% Number Grants 1869 7478 Total Dollars $1,199,088,125 $938,548,595 Range $6,892,810-$350,000 $350,000- $831
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 16. Automatic Metadata Extraction (Darwin Core) From Museum Specimen Labels … <co> Curtis, </co><hdlc> North American Pl </hdlc><cnl> No.</cnl><cn> 503*</cn> <gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>… With Qin Wei, Univ of Illinois
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Supervised Learning Framework Gold Classified Labels Training Phase Application Phase Machine Learner Unclassified Labels Segmented Text Silver Classified Labels Segmentation Machine Classifier Unclassified Labels Human Editing Trained Model
  • 23.
  • 26. Improved Performance With Field Element Identifiers
  • 27.  
  • 28. Learning w/ pre categorization Gold Labels Machine Learner Model n Classified Labels Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Learner Machine Learner Model 2 Model 1 Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Classification Machine Classification Machine Classification Classified Labels Classified Labels Unclassified Labels
  • 29. FIG. 5. Improved Performance of Specialist Model Specialist100 Curtiss VS 100 General Iterations 0 200 0 100 Specialist Random
  • 30. P. Bryan Heidorn 1 , Hong Zhang 1 , Eugene Chung 2 and BGWG 1 Graduate School of Library and Information Science, 2 Linguistics, University of Illinois Machine Learning in BioGeomancer’s Locality Specification SPNHC & NSCA 2006
  • 31.
  • 33. Example Locality Types Record # Specification of Location Locality Type 43 dario 7 mi wnw of; RIO VIEJO FOH; F 86 near Aleutian Islands; S of Amukta Pass NF; FH 100 INDIAN CREEK, 11 MI. W HWY 160 P; POH 109 TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R P; FOH; NP 160 WALTMAN, 9 MI N, 2.5 MI W OF FOO 181 0.4 mi N Collinston on LA 138 FPOH 204 Seward Peninsula; vic. Bluff, S coast F; NF; FS
  • 34.  
  • 35.
  • 36.
  • 37. Information Extraction From FNA Templates for useful information Extraction Rules Structured information Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………… .. ………… .. Original documents ……… .. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate, . ……………… Knowledge bases … .. PartBlade: Leaf blade Blades blade …… Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' * Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} User log analysis Leaf_Shape Leaf_Margin Leaf_Apex    Leaf_Base Blade_Dimension … .. … .. 
  • 38. Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task Group NT NTH TSR SSR NSST TST NDVST SEARFA 6.75 8.078 0.860 0.210 4.779 338.8 11.16 SEARF 4.50 3.598 0.568 0.053 9.584 435.2 14.75 Sig.(ANOVA) 0.005 0.005 0.000 0.011 0.000 0.72 0.162
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.  
  • 53.  
  • 54.

Notas do Editor

  1. Figure 1. Bar graphs depicting phylogenetically corrected mean differences between species groups for two climate change response traits: the correlation coefficient between first flowering day and annual spring temperature for the time period of 1888–1902 (A; i.e., flowering time tracking ), and the shift in mean first flowering day during the period exhibiting the most dramatic increase in mean annual temperature, from 1900–2006 (B; i.e., flowering time shift ).
  2. Not handwriting
  3. Insert lake victoria overlay
  4. Insert lake victoria overlay
  5. Insert lake victoria overlay