Bots & spiders

Bots & spiders

Bio-informatica II
19/04/2012

Maté Ongenaert

Center for Medical Genetics
Ghent University Hospital, Belgium

 Part 1: Bots & spiders
Background

 Part 2: Real-life case studies
The use of bots and spiders in bio-informatics

 About the presenter
 Bio-engineer cell and gene biotechnology (2005)
• Master thesis: identificatie van kanker-specifiek gemethyleerde genen

 PhD applied biological sciences: cell and gene
biotechnology (2009)
• PhD thesis: cellular reprogramming

 Industrial experience
• Research scientist (methylation biomarkers)

 Currently: postdoc at CMGG
• Prognostic methylation biomarkers in neuroblastoma

Part 1
Bots & spiders: background

Overview

 Bots and spiders
 Introduction
 Bots
 Spiders
 The Google case
 Bots/spiders and bio-informatics
 Automated querying
 APIs
 NCBI E-Utils (PubMed/GenBank)
 Ensembl

Bots and spiders

 Bots and spiders
 The web history
• In 1989, while working at CERN, Tim Berners-
Lee invented a network-based
implementation of the hypertext concept
• Since then, information can be retrieved by
‘following links’ instead of having to know the
exact location at first
• Information is not at a single location, it is
dynamic and spread across machines

Bots and spiders

 Bots
 Webbots
• Web robots, WWW robots, bots): software
applications that run automated tasks over the
Internet

 Bots perform tasks that:
• Are simple
• Structurally repetitive
• At a much higher rate than would be possible
for a human
• Automated script fetches, analyses and files
information from web servers at many times
the speed of a human

 Other uses:
• Chatbots / IM / Skype / Wiki bots
• Malicious bots and bot networks (Zombies)

Bots and spiders

 Bots
 A spam bot, called the ‘Zunker Bot’
• Is installed on unpatched Windows machines
• Controls the clients trough a neat application
• Can install additional software and execute commands

Bots and spiders

 Spiders
 Webspiders
• Webspiders / Crawlers are programs or
automated scripts which browses the World
Wide Web in a methodical, automated
manner. It is one type of bot

 The spider starts with a list of
URLs to visit, called the seeds
• As the crawler visits these URLs, it identifies
all the hyperlinks in the page
• It adds them to the list of URLs to visit, called
the crawl frontier
• URLs from the frontier are recursively visited
according to a set of policies
• This process is called web crawling: in most
cases a mean of collecting up-to-date data

Bots and spiders

 Spiders
 Use of webcrawlers:
• Mainly used to create a copy of all the visited pages for later processing by a
search engine that will index the downloaded pages to provide fast searches
• Automating maintenance tasks on a website, such as checking links or
validating HTML code
• Can be used to gather specific types of information from Web pages, such as
harvesting e-mail addresses

 Most commonly used crawler is probably the
GoogleBot crawler
• Crawls
• Indexes (content + key content tags and attributes, such as Title tags and ALT
attributes)
• Serves results: PageRank Technology

Bots and spiders

 PageRank

Bots and spiders

 Google
 Hardware
• Standard server hardware (2009): 16 GB RAM / 2 TB storage per server
• 2009 estimate: 450 000 servers – 2 million $/month electricity cost

 Software
• Webserver (Not apache-based)
• Storage (Google File System / BigTable): distributed storage – mostly in
memory
• Borg job scheduling and monitoring
• Indexing services: caffeine / percolator
• MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker
nodes (Map), answers are gathered and combined to solve the original
question (Reduce)

Bots and spiders

 Automated querying
• Collecting information nowadays means the power to automatically query
datasources (databases, websites, Google, Ensembl or NCBI databases)
• Query in web-terms: GET / POST
• Web-queries using Perl: LWP library

 LWP: set of Perl modules which provides a simple and
consistent application programming interface (API) to
the World-Wide Web
• Free LWP E-book: http://lwp.interglacial.com/

 LWP for newbies
• LWP::Simple (demo1)
• Go to a URL, fetch data, ready to parse
• Attention: HTML tags and regular expression

Bots and spiders

 Some more advanced features
• LWP::UserAgent (demo2 – show server access logs)
• Fill in forms and parse results
• Depending on content: follow hyperlinks to other pages and parse these
again,…
• Mechanize package: follow links; fill in forms,…

 Bioinformatics examples
• Use genome browser data (demo3) and sequences
• Get gene aliases and symbols from GeneCards (demo4)

Bots and spiders

 Why not make use of crawls, indexing and serving
technologies of others (e.g. Google)
• Google allows automated queries: per account 1000 queries a day
• Google uses Snippets: the short pieces of text you get in the main search
results
• This is the result of its indexing and parsing algoritms
• Demo5: LWP and Google APIs combined and parsing the results

 API: Application Programming Interface
• Hides complexity by sharing ‘libraries’ with functions that can be applied within
another programming language
• Bridges programming languages – crosses abstraction layers
• Example: displaying on a screen; printing; querying Google or NCBI from within
a programming language

Bots and spiders

 Bots/spiders and bio-informatics APIs
 Google example used Google API
 NCBI API
• The NCBI Web service is a web program that enables developers to access
Entrez Utilities via the Simple Object Access Protocol (SOAP)
• Programmers may write software applications that access the E-Utilities using
any SOAP development tool
• Main tools (demo6):
– E-Search: Searches and retrieves primary IDs and term translations and
optionally retains results for future use in the user's environment
– E-Fetch: Retrieves records in the requested format from a list of one or
more primary IDs

 Ensembl API (demo7)
• Uses ‘Slices’ and adaptors
• You have to know the ‘application’ or database (Compare/Core/…)

Bots and spiders

 Bots/spiders and bio-informatics APIs
 NCBI API
 A NCBI database, frequently used is PubMed
• PubMed can be queried using E-Utils
• Uses syntax as regular PubMed website
• Get the data back in data formats as on the website (XML, Plain Text)
• Parse XML results and apply more advanced Text-mining techniques
• Demo8
• Parse results and present them in an interface
– Methylated genes in cancer:
– http://matrix.ugent.be/mate/methylome/result1.html
– miRNAs in cancer:
– http://matrix.ugent.be/mate/textmining/preprocess/

Part 2
Real-life case studies: the use of bots and
spiders in bio-informatics

Bots and spiders

 TextMining
 Create and translate query
• User query -> query suited for PubMed

 Query is executed, results are returned
• Results format: XML, TXT, MedLine, ASN,…
• Human readable <> parsable (XML parsers)

 Parse results
• Extract information: authors, title, abstract
• Store results

 Analyse results
• Identify gene names, keywords, GO-terms,… -> score
• Semantic analysis / NLP processing / …

 Visualise results
• Highlighting, hierarchie, filters, searches, graphics

Bots and spiders

 TextMining

Bots and spiders

 TextMining
 Demonstration: GoldMine
 Web-application
 Translate query – find aliases for genes or miRNAs
and incorporate them in the search
 Query NCBI PubMed using E-fetch
 Get the results and process them
 Count
 Highlight
 Rank
 Visualization

Bots and spiders

 Data analysis
 NCBI GEO – Gene Expression Omnibus
 Raw expression data on FTP-server
 Annotation: can be queried using NCBI E-Utils
 Annotation: in Excel-files at FTP-server
 For specific experimental conditions, get all raw data
and annotations and perform an automated analysis
 Create a scheme how you would proceed:
biological question: superficial vs.
Infiltrating bladder cancer

Bots and spiders

 Case study: superficial vs. Infiltrating
bladder cancer
 Find experiments on GEO
 Annotation of samples: up to the submittors
 ‘Uniform’ sample sheet available (Matrix-file)
 Current update of GEO: view ‘factors’ in graphical
overview

Bots and spiders

bladder cancer

Bots and spiders

bladder cancer
 Use this to couple sample annotation features (stage,
age, risk, sex) to unique sampleID (GSMxxxxxxx)
 Get raw data for each sample in dataset
 Either txt files (uniform) or raw data files (such as Affy
CEL files)
 Dependends on the used platform: GPLxxxx

Bots and spiders

bladder cancer
 Platform / data files / samples / sample annotation
relationship
 Set up standardised analysis strategy
 Make use of sample annotations
 Combine studies or keep them seperate?
 Normalisation
 RankProd analysis

Bots and spiders

bladder cancer
 data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES
 expression<-exprs(data.justrma) NORMALISATION
 results[,2:103]<-expression
 library(hgu95av2.db) PLATFORM
 cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ANNOTATION
 RP.out.stage <- RP(results[,3:104], cl, num.perm =
100, logged = TRUE, na.rm = FALSE, plot = TRUE, rand
= 123) ANALYSIS STRATEGY

Bots and spiders

bladder cancer
 Combine results accross studies
 Biological question <> data analysis
 Scoring scheme, priorization
 Superficial vs. Infiltrating
 Metastasis vs. Primary cancer
 High stage vs. Low stage
 Normal vs. Cancer

Bots and spiders

 OncoMine

Bots and spiders

 Integrated analysis
Rank Meth Pca Lit Meth other Expression Pca Progression Rank1 2 3 4 5 6 7 8

EXPRESSION RE-EXP CpG Pc

1 1 x 0,95 1 0,993 0,997 0,84 1

2 0,998 0,995 1 0,958 0,091 0,994

3 1 x x x 1 0,993 1 0,996 0,312

4 1 x x x 0,995 0,767 0,96 1 0,931 0,998 0,635

5 1 x 0,997 0,968 1 1 0,364 0,746 0,199

6 x 0,711 0,948 0,994 0,559 0,991 0,993

7 0,998 0,993 0,83 0,936 0,996

8 0,997 0,99 0,998 0,759 0,726 0,575

9 1 x x 0,886 0,995 0,997 1 0,7

10 1 0,998 0,409 0,99 0,88 0,998 0,779

11 1 x x 0,995 0,999 0,995 0,687

12 1 x x 0,997 0,999 0,999 0,257

13 1 x x x 0,799 0,996 0,969 0,994 0,848 0,981 0,887

14 1 x x 0,916 0,568 0,99 0,993 0,994 0,988 0,558

15 0,986 0,995 0,956 0,983 0,998

16 1 x 0,157 1 0,925 0,989 0,984 0,993

Acknowledgments

 CMGG
 Anneleen Decock
 Frank Speleman
 Jo Vandesompele

 BioBix
 Leander Van Neste
 Tim De Meyer
 Gerben Mensschaert
 Geert Trooskens
 Wim Van Criekinge

Bots & spiders

Recommended

Recommended

More Related Content

Similar to Bots & spiders

Similar to Bots & spiders (20)

More from Maté Ongenaert

More from Maté Ongenaert (18)

Recently uploaded

Recently uploaded (20)

Bots & spiders