5. Overview
Bots and spiders
Introduction
Bots
Spiders
The Google case
Bots/spiders and bio-informatics
Automated querying
APIs
NCBI E-Utils (PubMed/GenBank)
Ensembl
6. Bots and spiders
Bots and spiders
The web history
• In 1989, while working at CERN, Tim Berners-
Lee invented a network-based
implementation of the hypertext concept
• Since then, information can be retrieved by
‘following links’ instead of having to know the
exact location at first
• Information is not at a single location, it is
dynamic and spread across machines
7. Bots and spiders
Bots
Webbots
• Web robots, WWW robots, bots): software
applications that run automated tasks over the
Internet
Bots perform tasks that:
• Are simple
• Structurally repetitive
• At a much higher rate than would be possible
for a human
• Automated script fetches, analyses and files
information from web servers at many times
the speed of a human
Other uses:
• Chatbots / IM / Skype / Wiki bots
• Malicious bots and bot networks (Zombies)
8. Bots and spiders
Bots
A spam bot, called the ‘Zunker Bot’
• Is installed on unpatched Windows machines
• Controls the clients trough a neat application
• Can install additional software and execute commands
9. Bots and spiders
Spiders
Webspiders
• Webspiders / Crawlers are programs or
automated scripts which browses the World
Wide Web in a methodical, automated
manner. It is one type of bot
The spider starts with a list of
URLs to visit, called the seeds
• As the crawler visits these URLs, it identifies
all the hyperlinks in the page
• It adds them to the list of URLs to visit, called
the crawl frontier
• URLs from the frontier are recursively visited
according to a set of policies
• This process is called web crawling: in most
cases a mean of collecting up-to-date data
11. Bots and spiders
Spiders
Use of webcrawlers:
• Mainly used to create a copy of all the visited pages for later processing by a
search engine that will index the downloaded pages to provide fast searches
• Automating maintenance tasks on a website, such as checking links or
validating HTML code
• Can be used to gather specific types of information from Web pages, such as
harvesting e-mail addresses
Most commonly used crawler is probably the
GoogleBot crawler
• Crawls
• Indexes (content + key content tags and attributes, such as Title tags and ALT
attributes)
• Serves results: PageRank Technology
14. Bots and spiders
Google
Hardware
• Standard server hardware (2009): 16 GB RAM / 2 TB storage per server
• 2009 estimate: 450 000 servers – 2 million $/month electricity cost
Software
• Webserver (Not apache-based)
• Storage (Google File System / BigTable): distributed storage – mostly in
memory
• Borg job scheduling and monitoring
• Indexing services: caffeine / percolator
• MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker
nodes (Map), answers are gathered and combined to solve the original
question (Reduce)
15. Overview
Bots and spiders
Introduction
Bots
Spiders
The Google case
Bots/spiders and bio-informatics
Automated querying
APIs
NCBI E-Utils (PubMed/GenBank)
Ensembl
16. Bots and spiders
Bots/spiders and bio-informatics
Automated querying
• Collecting information nowadays means the power to automatically query
datasources (databases, websites, Google, Ensembl or NCBI databases)
• Query in web-terms: GET / POST
• Web-queries using Perl: LWP library
LWP: set of Perl modules which provides a simple and
consistent application programming interface (API) to
the World-Wide Web
• Free LWP E-book: http://lwp.interglacial.com/
LWP for newbies
• LWP::Simple (demo1)
• Go to a URL, fetch data, ready to parse
• Attention: HTML tags and regular expression
17. Bots and spiders
Bots/spiders and bio-informatics
Some more advanced features
• LWP::UserAgent (demo2 – show server access logs)
• Fill in forms and parse results
• Depending on content: follow hyperlinks to other pages and parse these
again,…
• Mechanize package: follow links; fill in forms,…
Bioinformatics examples
• Use genome browser data (demo3) and sequences
• Get gene aliases and symbols from GeneCards (demo4)
18. Bots and spiders
Bots/spiders and bio-informatics
Why not make use of crawls, indexing and serving
technologies of others (e.g. Google)
• Google allows automated queries: per account 1000 queries a day
• Google uses Snippets: the short pieces of text you get in the main search
results
• This is the result of its indexing and parsing algoritms
• Demo5: LWP and Google APIs combined and parsing the results
API: Application Programming Interface
• Hides complexity by sharing ‘libraries’ with functions that can be applied within
another programming language
• Bridges programming languages – crosses abstraction layers
• Example: displaying on a screen; printing; querying Google or NCBI from within
a programming language
19. Bots and spiders
Bots/spiders and bio-informatics APIs
Google example used Google API
NCBI API
• The NCBI Web service is a web program that enables developers to access
Entrez Utilities via the Simple Object Access Protocol (SOAP)
• Programmers may write software applications that access the E-Utilities using
any SOAP development tool
• Main tools (demo6):
– E-Search: Searches and retrieves primary IDs and term translations and
optionally retains results for future use in the user's environment
– E-Fetch: Retrieves records in the requested format from a list of one or
more primary IDs
Ensembl API (demo7)
• Uses ‘Slices’ and adaptors
• You have to know the ‘application’ or database (Compare/Core/…)
20. Bots and spiders
Bots/spiders and bio-informatics APIs
NCBI API
A NCBI database, frequently used is PubMed
• PubMed can be queried using E-Utils
• Uses syntax as regular PubMed website
• Get the data back in data formats as on the website (XML, Plain Text)
• Parse XML results and apply more advanced Text-mining techniques
• Demo8
• Parse results and present them in an interface
– Methylated genes in cancer:
– http://matrix.ugent.be/mate/methylome/result1.html
– miRNAs in cancer:
– http://matrix.ugent.be/mate/textmining/preprocess/
26. Bots and spiders
TextMining
Demonstration: GoldMine
Web-application
Translate query – find aliases for genes or miRNAs
and incorporate them in the search
Query NCBI PubMed using E-fetch
Get the results and process them
Count
Highlight
Rank
Visualization
27. Bots and spiders
Data analysis
NCBI GEO – Gene Expression Omnibus
Raw expression data on FTP-server
Annotation: can be queried using NCBI E-Utils
Annotation: in Excel-files at FTP-server
For specific experimental conditions, get all raw data
and annotations and perform an automated analysis
Create a scheme how you would proceed:
biological question: superficial vs.
Infiltrating bladder cancer
28. Bots and spiders
Case study: superficial vs. Infiltrating
bladder cancer
Find experiments on GEO
Annotation of samples: up to the submittors
‘Uniform’ sample sheet available (Matrix-file)
Current update of GEO: view ‘factors’ in graphical
overview
29. Bots and spiders
Case study: superficial vs. Infiltrating
bladder cancer
30. Bots and spiders
Case study: superficial vs. Infiltrating
bladder cancer
Use this to couple sample annotation features (stage,
age, risk, sex) to unique sampleID (GSMxxxxxxx)
Get raw data for each sample in dataset
Either txt files (uniform) or raw data files (such as Affy
CEL files)
Dependends on the used platform: GPLxxxx
31. Bots and spiders
Case study: superficial vs. Infiltrating
bladder cancer
Platform / data files / samples / sample annotation
relationship
Set up standardised analysis strategy
Make use of sample annotations
Combine studies or keep them seperate?
Normalisation
RankProd analysis
33. Bots and spiders
Case study: superficial vs. Infiltrating
bladder cancer
Combine results accross studies
Biological question <> data analysis
Scoring scheme, priorization
Superficial vs. Infiltrating
Metastasis vs. Primary cancer
High stage vs. Low stage
Normal vs. Cancer