Geared towards bioinformatics students and taking a somewhat humoristic point of view, this presentation explains what bioinformaticians are and what they do.
1. 1
How to be a bioinformatician
Christian Frech, PhD
St. Anna Children’s Cancer Research Institute, Vienna, Austria
Talk at University of Applied Sciences, Hagenberg, Austria
April 23rd, 2014
2. What is a bioinformatician?
2
Informatician Statistician
Biologist
Data
scientist
Modified from http://blog.fejes.ca/?p=2418
3. Bioinformatician vs. computational biologist
Asks biological questions
Analyzes & interprets
biological data
Runs existing programs
Ad hoc scripting
Perl, R, Python
3
IT savvy
Builds & maintains
biological databases &
Web sites
Designs & implements
clever algorithms
C/C++, Java, Python
Bioinformatician Computational
biologist
Grasp of computational subjectsmore less
Grasp of biological subjectsless more
or vice versa
4. Why do we need bioinformaticians?
Amount of generated biological data requires sophisticated
computing for data management and analysis
Programmers lack biological knowledge
Biologists don‟t program
The two don‟t understand each other
4
http://www.youtube.com/watch?v=Hz1fyhVOjr4
Latest Illumina sequencer shipped last
week (HiSeq v4 reagent kit) outputs
1 terabase (TB) of data in 6 days1!
Biologists talks to statistician
1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
6. 6
What are bioinformaticians doing?
Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
7. Challenges as bioinformatician
Biology is complex, not black and white
As many exceptions as rules (e.g.: define “gene”)
No single optimal solution to a problem
Results interpretable in many ways (story telling, cherry picking)
Understanding the biological question
Field is moving incredibly fast
Lack of standards, immature/abandoned software
Standard of today obsolete tomorrow
Much time spent on collecting/cleaning-up data, troubleshooting errors
Stay flexible, don‟t overinvest in single platform/technology
Hundreds of software tools and databases out there
Easy to get lost
Important to understand their strengths and weaknesses
8
10. Things to have in your bioinformatics
toolbox
Linux command line
Scripting language with
associated Bio* library (BioPerl,
BioPython, R/Bioconductor, …)
Basic statistical tests, regression,
p-values, maximum likelihood,
multiple testing correction
Sequence alignment
(FASTA & BLAST)
Biological databases
Regular expressions
Sequencing technologies
Web technologies (HTML, XML, …)
11
Advanced R skills
Parallel/distributed computing
DBMS, SQL
(Semi-)compiled language (C/C++, Java)
Dimensionality reduction (e.g. PCA)
Cluster analysis
Support Vector Machines
Hidden Markov models
Web framework (e.g. Django)
Version control system (e.g. Git)
Advanced text editor (Emacs, vim)
IDE (e.g. Eclipse, NetBeans)
Must haves Highly recommended
11. Requirement
Recommended
Language
Speed matters, low-level programming
Rich-client enterprise application development
Text file processing (regex)
Statistical analysis, fancy plots
Rapid prototyping, readable & maintainable scripts
Workflow automation
What programming language should I learn?
12Be a jack of all trades, master of ONE!
12. Perl on decline, R and Python gaining popularity
13
http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming-
languages.html
http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png
Perl most popular bioinformatics
programming language in 2008
R and Python take the lead in 2014
13. Top 10 most common and/or
annoying mistakes in bioinformatics
14
Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
14. Top-10 most common/annoying mistakes in bioinformatics
# 10
Using genome coordinates with wrong
genome version
(for example, using gene coordinates from human genome
version hg18 but reference sequence from version hg19)
15
15. Top-10 most common/annoying mistakes in bioinformatics
# 9
Forgetting to process the second strand of
DNA sequence
16
16. Top-10 most common/annoying mistakes in bioinformatics
# 8
Processing second strand of DNA sequence,
but taking reverse instead of reverse
complement sequence
17
17. Top-10 most common/annoying mistakes in bioinformatics
# 7
Not accounting for different human
chromosomes names between
UCSC and Ensembl
Example:
UCSC: “chr1”
Ensembl: “1”
18
18. Top-10 most common/annoying mistakes in bioinformatics
# 6
Assuming the alphabetical order of
chromosome names is
“chr1”, “chr2”, “chr3”, …
when in fact it is
“chr1”, “chr10”, “chr11”, …
19
19. Top-10 most common/annoying mistakes in bioinformatics
# 5
Assuming „tab‟ field separator
when in fact it is „blank‟
(or vice versa)
(look almost identical in text editor)
20
20. Top-10 most common/annoying mistakes in bioinformatics
# 4
Assuming DNA sequence consists of only
four letters (A, T, C, G) while in fact
there is a fifth
21
„N‟ for missing base
(„X‟ for missing amino acid)
21. Top-10 most common/annoying mistakes in bioinformatics
# 3
Forgetting to use dos2unix on a Windows text file
before processing it under Linux
plus spending 1 hour to debug the problem
plus being tricked by this multiple times
Text file line breaks differ between platforms:
Linux (LF); Windows (CR+LF); classic Mac (CR).
22
22. Top-10 most common/annoying mistakes in bioinformatics
# 2
When importing data into MS Excel, letting it
auto-convert HUGO gene names into dates
and forgetting about it
(e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import)
~30 genes in total
23
23. #1
Off-by-one error
There are only two common problems in bioinformatics:
(1) lack of standards, (2) ID conversion, and
(3) off-by-one errors
24
http://en.wikipedia.org/wiki/Off-by-one_error
Top-10 most common/annoying mistakes in bioinformatics
25. #1 - Learn Linux!
Most bioinformatics tools not available
on Windows
Linux file systems better for many and/or very large files
Command line interface (CLI) has advantages over
graphical user interface (GUI)
Recorded command history (reproducibility)
Key stroke to re-run analysis, instead of repeating 100 mouse
clicks
Linux CLI (Shell) much more powerful than Windows CLI
26
26. # 2 - Embrace the “Unix tools philosophy”
Small programs (“tools”) instead of monolithic applications
Designed for simple, specific tasks that are performed well
(awk, cat, grep, wc, etc.)
Many and well documented parameters
Combined with Unix pipes (read from STDIN, write to STDOUT)
cut -f 3 myfile.txt | sort | uniq
Advantages
Great flexibility, easy re-use of existing tools
Intermediate output can be stored and inspected for troubleshooting
Complex tasks can be performed quickly with shell „one-liners‟
This paradigm fits bioinformatics well, where often many
heterogeneous data files need to be processed in many
different ways
27http://www.linuxdevcenter.com/lpt/a/302
27. Example NGS use case demonstrating the power
of the Unix tools philosophy
Explanation
„samtools mpileup‟ piles up short reads from the input BAM file for
each position in the reference genome
„bcftools view‟ calls the variants
„vcfutils vcf2fq‟ computes the consensus sequence
The resulting FASTA sequence is redirected to the output file cns.fq
By knowing available tools and their parameters, bioinformatics
„wizards‟ can get complex stuff done in almost no time
28
samtools mpileup -uf ref.fa aln.bam |
bcftools view -cg - |
vcfutils.pl vcf2fq > cns.fq
http://samtools.sourceforge.net/mpileup.shtml
28. #3 - Don’t reinvent the wheel
Coding is fun, but look
around before you hack
into your keyboard
Don‟t write the 29th FASTA
file parser if proven solutions
are available
BioPerl
BioPython
Bioconductor
29
29. #4 - If you happen to invent a wheel, …
Document source and parameters well
Use version control system (git, svn)
Deposit code in public repository
sourceforge.net
github.com
Write test cases
30
30. # 5 - Automate pipelines
with GNU/Make
Developed in 1970s to build executables from
source files
Incredibly useful for data-driven workflows as well
Automatic error checking
Parallelization (utilize multiple cores)
Incremental builds (re-start your pipeline from point of failure)
Bug-free
Get started at
http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/
31
31. # 6 - Value your time
Architecture vs. accomplishment
“Perfect is the enemy of the good” -- Voltaire
OO design and normalized databases are nice, but can be an
overkill if requirements change from analysis to analysis
Automate what can be automated
Reproducibility
Easy to repeat analysis with slightly changed parameters
BUT: Don‟t spend two days automating a one-time
analysis that can be done manually in 10 minutes
32
32. # 7 – Make use of free online resources to learn
about specialized topics
www.coursera.org
Bioinformatics Algorithms
(https://www.coursera.org/course/bioinformatics)
Computing for Data Analysis
(https://www.coursera.org/course/compdata)
R Programming
(https://www.coursera.org/course/rprog)
https://www.edx.org/
Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx-
ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)
Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x-
introduction-biology-secret-1768#.U1TVL3V52R8)
http://rosalind.info/problems/locations/
33
33. # 8 - Become an expert
Identify an area of interest
and get really good at it
Work at places where you
can learn from the best
Spend time abroad
Great experience
Labs/companies will not only hire you for what you
know, but who you know
34
34. # 9 - Decide early on if you want to stay in
academia or go into industry
35
Academia Industry
• PhD highly recommended
• Take your time to find
compatible supervisor
+ Freedom to pursue own ideas
+ Very flexible working hours
+ Work independently
- Steep & competitive career
ladder (postdoc >> PI/prof)
- Lower pay
- Publish or perish
• PhD beneficial (to get in), but
not necessarily required for
daily work (e.g. build/maintain
databases)
+ More frequent (positive)
feedback
+ Higher pay
+ Job security
- More (external) deadlines
- Higher pressure to get things
done
See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
35. # 10 - Stay informed & get connected
Follow literature and blogs
http://en.wikipedia.org/wiki/List_of_bioinformatics_journals
http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay-
current-in-bioinformaticsgenomics/
Subscribe via RSS feeds
http://feedly.com or others
Platform independent (e.g. read on your phone)
Bioinformatics Q&A forums
http://www.biostars.org (highly recommended)
http://seqanswers.com/ (focus on NGS)
http://www.reddit.com/r/bioinformatics/ (student-oriented)
Other
http://bioinformatics.org – fosters collaboration in bioinformatics
http://www.researchgate.net – “Facebook” for researchers
German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin)
36
36. Conclusion
As bioinformatician, you will be at the
forefront of one of the greatest scientific
enterprises of our time
Biologists overwhelmed with massive
data sets
YOU will get to see exciting results first
Requires integration of knowledge from many domains
IT, biology, medicine, statistics, math, …
Knowing your informatics toolbox AND understanding the biological
question is what makes you very valuable
37
38. Further Reading
“So you want to be a computational biologist?”
http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html
“What It Takes to Be a Bioinformatician”
http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/
“The alternative „what it takes to be a bioinformatician‟”
https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/
“So You Want To Be a Computational Biologist, Or A Bioinformatician?”
http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html
“Being a bioinformatician is hard”
http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/
“How not to be a bioinformatician”
http://www.scfbm.org/content/7/1/3
“Ten Simple Rules for Reproducible Computational Research”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285
“Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2
E21C9E565378C7F714D76
“A Quick Guide for Developing Effective Bioinformatics Programming Skills”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589
“What Is Really the Salary of a Bioinformatician/Computational Biologist?”
http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational-
biologist/
39
Notas do Editor
Version 5
Funny rant about bioinformatics, not to be taken literally:http://madhadron.com/posts/2012-03-26-a-farewell-to-bioinformatics.html