SlideShare a Scribd company logo
1 of 36
Bots & spiders

 Bio-informatica II
    19/04/2012

        Maté Ongenaert

   Center for Medical Genetics
Ghent University Hospital, Belgium
 Part 1: Bots & spiders
  Background

 Part 2: Real-life case studies
  The use of bots and spiders in bio-informatics
 About the presenter
   Bio-engineer cell and gene biotechnology (2005)
    •   Master thesis: identificatie van kanker-specifiek gemethyleerde genen

   PhD applied biological sciences: cell and gene
    biotechnology (2009)
    •   PhD thesis: cellular reprogramming

   Industrial experience
    •   Research scientist (methylation biomarkers)

   Currently: postdoc at CMGG
    •   Prognostic methylation biomarkers in neuroblastoma
Part 1
Bots & spiders: background
Overview

 Bots and spiders
     Introduction
     Bots
     Spiders
     The Google case
 Bots/spiders and bio-informatics
     Automated querying
     APIs
     NCBI E-Utils (PubMed/GenBank)
     Ensembl
Bots and spiders

 Bots and spiders
    The web history
       •   In 1989, while working at CERN, Tim Berners-
           Lee       invented        a      network-based
           implementation of the hypertext concept
       •   Since then, information can be retrieved by
           ‘following links’ instead of having to know the
           exact location at first
       •   Information is not at a single location, it is
           dynamic and spread across machines
Bots and spiders

 Bots
   Webbots
      •   Web robots, WWW robots, bots): software
          applications that run automated tasks over the
          Internet

   Bots perform tasks that:
      •   Are simple
      •   Structurally repetitive
      •   At a much higher rate than would be possible
          for a human
      •   Automated script fetches, analyses and files
          information from web servers at many times
          the speed of a human

   Other uses:
      •   Chatbots / IM / Skype / Wiki bots
      •   Malicious bots and bot networks (Zombies)
Bots and spiders

 Bots
   A spam bot, called the ‘Zunker Bot’
      •   Is installed on unpatched Windows machines
      •   Controls the clients trough a neat application
      •   Can install additional software and execute commands
Bots and spiders

 Spiders
   Webspiders
      •   Webspiders / Crawlers are programs or
          automated scripts which browses the World
          Wide Web in a methodical, automated
          manner. It is one type of bot

   The spider starts with a list of
    URLs to visit, called the seeds
      • As the crawler visits these URLs, it identifies
        all the hyperlinks in the page
      • It adds them to the list of URLs to visit, called
        the crawl frontier
      • URLs from the frontier are recursively visited
        according to a set of policies
      • This process is called web crawling: in most
        cases a mean of collecting up-to-date data
Bots and spiders

 Spiders
Bots and spiders

 Spiders
   Use of webcrawlers:
      •   Mainly used to create a copy of all the visited pages for later processing by a
          search engine that will index the downloaded pages to provide fast searches
      •   Automating maintenance tasks on a website, such as checking links or
          validating HTML code
      •   Can be used to gather specific types of information from Web pages, such as
          harvesting e-mail addresses

   Most commonly used crawler is probably the
    GoogleBot crawler
      •   Crawls
      •   Indexes (content + key content tags and attributes, such as Title tags and ALT
          attributes)
      •   Serves results: PageRank Technology
Bots and spiders

 PageRank
Bots and spiders

 PageRank
Bots and spiders

 Google
   Hardware
      •   Standard server hardware (2009): 16 GB RAM / 2 TB storage per server
      •   2009 estimate: 450 000 servers – 2 million $/month electricity cost

   Software
      •   Webserver (Not apache-based)
      •   Storage (Google File System / BigTable): distributed storage – mostly in
          memory
      •   Borg job scheduling and monitoring
      •   Indexing services: caffeine / percolator
      •   MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker
          nodes (Map), answers are gathered and combined to solve the original
          question (Reduce)
Overview

 Bots and spiders
     Introduction
     Bots
     Spiders
     The Google case
 Bots/spiders and bio-informatics
     Automated querying
     APIs
     NCBI E-Utils (PubMed/GenBank)
     Ensembl
Bots and spiders

 Bots/spiders and bio-informatics
   Automated querying
      •   Collecting information nowadays means the power to automatically query
          datasources (databases, websites, Google, Ensembl or NCBI databases)
      •   Query in web-terms: GET / POST
      •   Web-queries using Perl: LWP library

   LWP: set of Perl modules which provides a simple and
    consistent application programming interface (API) to
    the World-Wide Web
      •   Free LWP E-book: http://lwp.interglacial.com/

   LWP for newbies
      •   LWP::Simple (demo1)
      •   Go to a URL, fetch data, ready to parse
      •   Attention: HTML tags and regular expression
Bots and spiders

 Bots/spiders and bio-informatics
   Some more advanced features
      •   LWP::UserAgent (demo2 – show server access logs)
      •   Fill in forms and parse results
      •   Depending on content: follow hyperlinks to other pages and parse these
          again,…
      •   Mechanize package: follow links; fill in forms,…

   Bioinformatics examples
      • Use genome browser data (demo3) and sequences
      • Get gene aliases and symbols from GeneCards (demo4)
Bots and spiders

 Bots/spiders and bio-informatics
   Why not make use of crawls, indexing and serving
    technologies of others (e.g. Google)
      • Google allows automated queries: per account 1000 queries a day
      • Google uses Snippets: the short pieces of text you get in the main search
        results
      • This is the result of its indexing and parsing algoritms
      • Demo5: LWP and Google APIs combined and parsing the results

   API: Application Programming Interface
      •   Hides complexity by sharing ‘libraries’ with functions that can be applied within
          another programming language
      •   Bridges programming languages – crosses abstraction layers
      •   Example: displaying on a screen; printing; querying Google or NCBI from within
          a programming language
Bots and spiders

 Bots/spiders and bio-informatics APIs
   Google example used Google API
   NCBI API
      • The NCBI Web service is a web program that enables developers to access
        Entrez Utilities via the Simple Object Access Protocol (SOAP)
      • Programmers may write software applications that access the E-Utilities using
        any SOAP development tool
      • Main tools (demo6):
         – E-Search: Searches and retrieves primary IDs and term translations and
             optionally retains results for future use in the user's environment
         – E-Fetch: Retrieves records in the requested format from a list of one or
             more primary IDs

   Ensembl API (demo7)
      •   Uses ‘Slices’ and adaptors
      •   You have to know the ‘application’ or database (Compare/Core/…)
Bots and spiders

 Bots/spiders and bio-informatics APIs
   NCBI API
   A NCBI database, frequently used is PubMed
      •   PubMed can be queried using E-Utils
      •   Uses syntax as regular PubMed website
      •   Get the data back in data formats as on the website (XML, Plain Text)
      •   Parse XML results and apply more advanced Text-mining techniques
      •   Demo8
      •   Parse results and present them in an interface
           – Methylated genes in cancer:
           – http://matrix.ugent.be/mate/methylome/result1.html
           – miRNAs in cancer:
           – http://matrix.ugent.be/mate/textmining/preprocess/
Part 2
Real-life case studies: the use of bots and
         spiders in bio-informatics
Bots and spiders

 TextMining
   Create and translate query
      •   User query -> query suited for PubMed

   Query is executed, results are returned
      •   Results format: XML, TXT, MedLine, ASN,…
      •   Human readable <> parsable (XML parsers)

   Parse results
      • Extract information: authors, title, abstract
      •   Store results

   Analyse results
      •   Identify gene names, keywords, GO-terms,… -> score
      •   Semantic analysis / NLP processing / …

   Visualise results
      •   Highlighting, hierarchie, filters, searches, graphics
Bots and spiders

 TextMining
Bots and spiders

 TextMining
Bots and spiders

 TextMining
Bots and spiders

 TextMining
   Demonstration: GoldMine
   Web-application
   Translate query – find aliases for genes or miRNAs
    and incorporate them in the search
   Query NCBI PubMed using E-fetch
   Get the results and process them
         Count
         Highlight
         Rank
         Visualization
Bots and spiders

 Data analysis
     NCBI GEO – Gene Expression Omnibus
     Raw expression data on FTP-server
     Annotation: can be queried using NCBI E-Utils
     Annotation: in Excel-files at FTP-server
     For specific experimental conditions, get all raw data
      and annotations and perform an automated analysis
 Create a scheme how you would proceed:
  biological question: superficial vs.
  Infiltrating bladder cancer
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
     Find experiments on GEO
     Annotation of samples: up to the submittors
     ‘Uniform’ sample sheet available (Matrix-file)
     Current update of GEO: view ‘factors’ in graphical
      overview
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
   Use this to couple sample annotation features (stage,
    age, risk, sex) to unique sampleID (GSMxxxxxxx)
   Get raw data for each sample in dataset
   Either txt files (uniform) or raw data files (such as Affy
    CEL files)
   Dependends on the used platform: GPLxxxx
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
   Platform / data files / samples / sample annotation
    relationship
   Set up standardised analysis strategy
   Make use of sample annotations
   Combine studies or keep them seperate?
   Normalisation
   RankProd analysis
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
     data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES
     expression<-exprs(data.justrma)         NORMALISATION
     results[,2:103]<-expression
     library(hgu95av2.db)                        PLATFORM
     cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   ANNOTATION
     RP.out.stage <- RP(results[,3:104], cl, num.perm =
      100, logged = TRUE, na.rm = FALSE, plot = TRUE, rand
      = 123)                              ANALYSIS STRATEGY
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
     Combine results accross studies
     Biological question <> data analysis
     Scoring scheme, priorization
     Superficial vs. Infiltrating
     Metastasis vs. Primary cancer
     High stage vs. Low stage
     Normal vs. Cancer
Bots and spiders

 OncoMine
Bots and spiders

 Integrated analysis
Rank     Meth Pca   Lit   Meth other Expression Pca   Progression   Rank1    2       3          4      5        6       7       8

                                                                                         EXPRESSION           RE-EXP   CpG      Pc

 1          1                               x                                       0,95        1     0,993   0,997    0,84     1

 2                                                                  0,998           0,995       1     0,958   0,091            0,994

 3                  1         x             x              x         1              0,993       1     0,996                    0,312

 4          1                 x             x              x        0,995   0,767   0,96               1      0,931    0,998   0,635

 5                  1                       x                       0,997           0,968       1      1      0,364    0,746   0,199

 6                                                         x                0,711   0,948             0,994   0,559    0,991   0,993

 7                                                                                  0,998             0,993    0,83    0,936   0,996

 8                                                                  0,997           0,99              0,998   0,759    0,726   0,575

 9                  1                       x              x        0,886           0,995             0,997     1               0,7

 10                 1                                               0,998           0,409             0,99     0,88    0,998   0,779

 11                 1                       x              x                        0,995             0,999   0,995            0,687

 12                 1                       x              x                        0,997             0,999   0,999            0,257

 13         1                 x             x              x        0,799   0,996   0,969             0,994   0,848    0,981   0,887

 14         1                 x             x                       0,916   0,568   0,99              0,993   0,994    0,988   0,558

 15                                                                                 0,986             0,995   0,956    0,983   0,998

 16         1                 x                                                     0,157       1     0,925   0,989    0,984   0,993
Acknowledgments


   CMGG
       Anneleen Decock
       Frank Speleman
       Jo Vandesompele


   BioBix
       Leander Van Neste
       Tim De Meyer
       Gerben Mensschaert
       Geert Trooskens
       Wim Van Criekinge

More Related Content

Similar to Bots & spiders

2006 bio it web services
2006 bio it web services2006 bio it web services
2006 bio it web servicesChris Dwan
 
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGenWeb Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGenMonica Munoz-Torres
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning processDenis Dus
 
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4jRobotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4jKevin Watters
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...Lucidworks
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Yury Leonychev
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basicsJyoti Yadav
 
Module development
Module development Module development
Module development Araport
 
Biocatalogue, FileQuirks, MyExperiment
Biocatalogue, FileQuirks, MyExperimentBiocatalogue, FileQuirks, MyExperiment
Biocatalogue, FileQuirks, MyExperimentJerzy
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Softwaredgarijo
 
IWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW
 
The 3 Top Techniques for Web Security Testing Using a Proxy
The 3 Top Techniques for Web Security Testing Using a ProxyThe 3 Top Techniques for Web Security Testing Using a Proxy
The 3 Top Techniques for Web Security Testing Using a ProxyTEST Huddle
 
Internet browser and search engines
Internet browser and search enginesInternet browser and search engines
Internet browser and search enginesJoshua Pasion
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 

Similar to Bots & spiders (20)

2006 bio it web services
2006 bio it web services2006 bio it web services
2006 bio it web services
 
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGenWeb Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
Web Apollo: A Web-based Genomics Annotation Editing Platform. 13ArthGen
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Unit 1
Unit 1Unit 1
Unit 1
 
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4jRobotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
Robotics, Search and AI with Solr, MyRobotLab, and Deeplearning4j
 
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
The Intersection of Robotics, Search and AI with Solr, MyRobotLab, and Deep L...
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basics
 
Module development
Module development Module development
Module development
 
Walter api
Walter apiWalter api
Walter api
 
Biocatalogue, FileQuirks, MyExperiment
Biocatalogue, FileQuirks, MyExperimentBiocatalogue, FileQuirks, MyExperiment
Biocatalogue, FileQuirks, MyExperiment
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
IWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW 1999: Indexing your web server
IWMW 1999: Indexing your web server
 
The 3 Top Techniques for Web Security Testing Using a Proxy
The 3 Top Techniques for Web Security Testing Using a ProxyThe 3 Top Techniques for Web Security Testing Using a Proxy
The 3 Top Techniques for Web Security Testing Using a Proxy
 
Internet browser and search engines
Internet browser and search enginesInternet browser and search engines
Internet browser and search engines
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Node.js
Node.jsNode.js
Node.js
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Burpsuite yara
Burpsuite yaraBurpsuite yara
Burpsuite yara
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 

More from Maté Ongenaert

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Maté Ongenaert
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Maté Ongenaert
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenMaté Ongenaert
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Maté Ongenaert
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsMaté Ongenaert
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisMaté Ongenaert
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themMaté Ongenaert
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMaté Ongenaert
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyMaté Ongenaert
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsMaté Ongenaert
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Maté Ongenaert
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchersMaté Ongenaert
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationMaté Ongenaert
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment trainingMaté Ongenaert
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercisesMaté Ongenaert
 

More from Maté Ongenaert (18)

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis Lokeren
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findings
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting them
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the bench
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functions
 
Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...Race against the sequencing machine: processing of raw DNA sequence data at t...
Race against the sequencing machine: processing of raw DNA sequence data at t...
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchers
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integration
 
Introduction
IntroductionIntroduction
Introduction
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment training
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercises
 

Recently uploaded

4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 

Recently uploaded (20)

4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 

Bots & spiders

  • 1. Bots & spiders Bio-informatica II 19/04/2012 Maté Ongenaert Center for Medical Genetics Ghent University Hospital, Belgium
  • 2.  Part 1: Bots & spiders Background  Part 2: Real-life case studies The use of bots and spiders in bio-informatics
  • 3.  About the presenter  Bio-engineer cell and gene biotechnology (2005) • Master thesis: identificatie van kanker-specifiek gemethyleerde genen  PhD applied biological sciences: cell and gene biotechnology (2009) • PhD thesis: cellular reprogramming  Industrial experience • Research scientist (methylation biomarkers)  Currently: postdoc at CMGG • Prognostic methylation biomarkers in neuroblastoma
  • 4. Part 1 Bots & spiders: background
  • 5. Overview  Bots and spiders  Introduction  Bots  Spiders  The Google case  Bots/spiders and bio-informatics  Automated querying  APIs  NCBI E-Utils (PubMed/GenBank)  Ensembl
  • 6. Bots and spiders  Bots and spiders  The web history • In 1989, while working at CERN, Tim Berners- Lee invented a network-based implementation of the hypertext concept • Since then, information can be retrieved by ‘following links’ instead of having to know the exact location at first • Information is not at a single location, it is dynamic and spread across machines
  • 7. Bots and spiders  Bots  Webbots • Web robots, WWW robots, bots): software applications that run automated tasks over the Internet  Bots perform tasks that: • Are simple • Structurally repetitive • At a much higher rate than would be possible for a human • Automated script fetches, analyses and files information from web servers at many times the speed of a human  Other uses: • Chatbots / IM / Skype / Wiki bots • Malicious bots and bot networks (Zombies)
  • 8. Bots and spiders  Bots  A spam bot, called the ‘Zunker Bot’ • Is installed on unpatched Windows machines • Controls the clients trough a neat application • Can install additional software and execute commands
  • 9. Bots and spiders  Spiders  Webspiders • Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot  The spider starts with a list of URLs to visit, called the seeds • As the crawler visits these URLs, it identifies all the hyperlinks in the page • It adds them to the list of URLs to visit, called the crawl frontier • URLs from the frontier are recursively visited according to a set of policies • This process is called web crawling: in most cases a mean of collecting up-to-date data
  • 11. Bots and spiders  Spiders  Use of webcrawlers: • Mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches • Automating maintenance tasks on a website, such as checking links or validating HTML code • Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses  Most commonly used crawler is probably the GoogleBot crawler • Crawls • Indexes (content + key content tags and attributes, such as Title tags and ALT attributes) • Serves results: PageRank Technology
  • 14. Bots and spiders  Google  Hardware • Standard server hardware (2009): 16 GB RAM / 2 TB storage per server • 2009 estimate: 450 000 servers – 2 million $/month electricity cost  Software • Webserver (Not apache-based) • Storage (Google File System / BigTable): distributed storage – mostly in memory • Borg job scheduling and monitoring • Indexing services: caffeine / percolator • MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker nodes (Map), answers are gathered and combined to solve the original question (Reduce)
  • 15. Overview  Bots and spiders  Introduction  Bots  Spiders  The Google case  Bots/spiders and bio-informatics  Automated querying  APIs  NCBI E-Utils (PubMed/GenBank)  Ensembl
  • 16. Bots and spiders  Bots/spiders and bio-informatics  Automated querying • Collecting information nowadays means the power to automatically query datasources (databases, websites, Google, Ensembl or NCBI databases) • Query in web-terms: GET / POST • Web-queries using Perl: LWP library  LWP: set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web • Free LWP E-book: http://lwp.interglacial.com/  LWP for newbies • LWP::Simple (demo1) • Go to a URL, fetch data, ready to parse • Attention: HTML tags and regular expression
  • 17. Bots and spiders  Bots/spiders and bio-informatics  Some more advanced features • LWP::UserAgent (demo2 – show server access logs) • Fill in forms and parse results • Depending on content: follow hyperlinks to other pages and parse these again,… • Mechanize package: follow links; fill in forms,…  Bioinformatics examples • Use genome browser data (demo3) and sequences • Get gene aliases and symbols from GeneCards (demo4)
  • 18. Bots and spiders  Bots/spiders and bio-informatics  Why not make use of crawls, indexing and serving technologies of others (e.g. Google) • Google allows automated queries: per account 1000 queries a day • Google uses Snippets: the short pieces of text you get in the main search results • This is the result of its indexing and parsing algoritms • Demo5: LWP and Google APIs combined and parsing the results  API: Application Programming Interface • Hides complexity by sharing ‘libraries’ with functions that can be applied within another programming language • Bridges programming languages – crosses abstraction layers • Example: displaying on a screen; printing; querying Google or NCBI from within a programming language
  • 19. Bots and spiders  Bots/spiders and bio-informatics APIs  Google example used Google API  NCBI API • The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP) • Programmers may write software applications that access the E-Utilities using any SOAP development tool • Main tools (demo6): – E-Search: Searches and retrieves primary IDs and term translations and optionally retains results for future use in the user's environment – E-Fetch: Retrieves records in the requested format from a list of one or more primary IDs  Ensembl API (demo7) • Uses ‘Slices’ and adaptors • You have to know the ‘application’ or database (Compare/Core/…)
  • 20. Bots and spiders  Bots/spiders and bio-informatics APIs  NCBI API  A NCBI database, frequently used is PubMed • PubMed can be queried using E-Utils • Uses syntax as regular PubMed website • Get the data back in data formats as on the website (XML, Plain Text) • Parse XML results and apply more advanced Text-mining techniques • Demo8 • Parse results and present them in an interface – Methylated genes in cancer: – http://matrix.ugent.be/mate/methylome/result1.html – miRNAs in cancer: – http://matrix.ugent.be/mate/textmining/preprocess/
  • 21. Part 2 Real-life case studies: the use of bots and spiders in bio-informatics
  • 22. Bots and spiders  TextMining  Create and translate query • User query -> query suited for PubMed  Query is executed, results are returned • Results format: XML, TXT, MedLine, ASN,… • Human readable <> parsable (XML parsers)  Parse results • Extract information: authors, title, abstract • Store results  Analyse results • Identify gene names, keywords, GO-terms,… -> score • Semantic analysis / NLP processing / …  Visualise results • Highlighting, hierarchie, filters, searches, graphics
  • 23. Bots and spiders  TextMining
  • 24. Bots and spiders  TextMining
  • 25. Bots and spiders  TextMining
  • 26. Bots and spiders  TextMining  Demonstration: GoldMine  Web-application  Translate query – find aliases for genes or miRNAs and incorporate them in the search  Query NCBI PubMed using E-fetch  Get the results and process them  Count  Highlight  Rank  Visualization
  • 27. Bots and spiders  Data analysis  NCBI GEO – Gene Expression Omnibus  Raw expression data on FTP-server  Annotation: can be queried using NCBI E-Utils  Annotation: in Excel-files at FTP-server  For specific experimental conditions, get all raw data and annotations and perform an automated analysis  Create a scheme how you would proceed: biological question: superficial vs. Infiltrating bladder cancer
  • 28. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Find experiments on GEO  Annotation of samples: up to the submittors  ‘Uniform’ sample sheet available (Matrix-file)  Current update of GEO: view ‘factors’ in graphical overview
  • 29. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer
  • 30. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Use this to couple sample annotation features (stage, age, risk, sex) to unique sampleID (GSMxxxxxxx)  Get raw data for each sample in dataset  Either txt files (uniform) or raw data files (such as Affy CEL files)  Dependends on the used platform: GPLxxxx
  • 31. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Platform / data files / samples / sample annotation relationship  Set up standardised analysis strategy  Make use of sample annotations  Combine studies or keep them seperate?  Normalisation  RankProd analysis
  • 32. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES  expression<-exprs(data.justrma) NORMALISATION  results[,2:103]<-expression  library(hgu95av2.db) PLATFORM  cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ANNOTATION  RP.out.stage <- RP(results[,3:104], cl, num.perm = 100, logged = TRUE, na.rm = FALSE, plot = TRUE, rand = 123) ANALYSIS STRATEGY
  • 33. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Combine results accross studies  Biological question <> data analysis  Scoring scheme, priorization  Superficial vs. Infiltrating  Metastasis vs. Primary cancer  High stage vs. Low stage  Normal vs. Cancer
  • 35. Bots and spiders  Integrated analysis Rank Meth Pca Lit Meth other Expression Pca Progression Rank1 2 3 4 5 6 7 8 EXPRESSION RE-EXP CpG Pc 1 1 x 0,95 1 0,993 0,997 0,84 1 2 0,998 0,995 1 0,958 0,091 0,994 3 1 x x x 1 0,993 1 0,996 0,312 4 1 x x x 0,995 0,767 0,96 1 0,931 0,998 0,635 5 1 x 0,997 0,968 1 1 0,364 0,746 0,199 6 x 0,711 0,948 0,994 0,559 0,991 0,993 7 0,998 0,993 0,83 0,936 0,996 8 0,997 0,99 0,998 0,759 0,726 0,575 9 1 x x 0,886 0,995 0,997 1 0,7 10 1 0,998 0,409 0,99 0,88 0,998 0,779 11 1 x x 0,995 0,999 0,995 0,687 12 1 x x 0,997 0,999 0,999 0,257 13 1 x x x 0,799 0,996 0,969 0,994 0,848 0,981 0,887 14 1 x x 0,916 0,568 0,99 0,993 0,994 0,988 0,558 15 0,986 0,995 0,956 0,983 0,998 16 1 x 0,157 1 0,925 0,989 0,984 0,993
  • 36. Acknowledgments  CMGG  Anneleen Decock  Frank Speleman  Jo Vandesompele  BioBix  Leander Van Neste  Tim De Meyer  Gerben Mensschaert  Geert Trooskens  Wim Van Criekinge