This document discusses bioinformatics and some of its key concepts and tools. It begins with definitions of bioinformatics as the intersection of biology, computer science, and information technology. It then discusses some of the data formats, tools, and skills used in bioinformatics, including working with nucleotide sequence data, translating sequences into amino acids, and analyzing large datasets. It also summarizes how ontologies are used to represent concepts and how various data types are organized and stored in databases for analysis.
1. BIOINFORMATIC
S
BIG HUMAN DATA AND
METADATA
Shehab Anwer, MBBCh. MRes.
Research Data Manager / Clinical Research Fellow
Magdi Yacoub Heart Foundation, Aswan Heart Centre
3. What’s
Bioinformatics?
Language
Bio-
Biology: the study of life
and living organisms.
Informatics Information science
Concept
The body of tools, algorithms needed to handle large
and complex biological information.
Scientific Discipline Interaction of biology and computer science.
NCBI
The field of science in which biology, computer science,
and information technology merge into a single discipline”
6. INTERMS OF SKILLS’ SET
• Using and developing PracticalToolsKnow-how
• Language of Biomedical Research
• Language of Informatics
‘Cross-Cultural’
Exchange
• Database Inter-operation
• Process Modeling & DataVisualization
Solving Scientific
Problems using
Computers
8. AFTER ANALYSIS: NUCLEOTIDE
CODE FORMAT
AAACGTACGTATTCGGGCCATCGAGGCTAGCGGCACTTCGAGCGATCTATC
GGGAGCTTTGGCTATCGATCGGGCGATCGATGCTGACGTACGTAGCGCGCG
ATCGAGCGCGGCTAGCTAGCGGCATCGTAGCTACGTAGCTACGGCGCTATT
TCGATCGAGTCGTGTCTAGTCGGATATAGCTATGCATCTAGCTGAGGCGAT
CTGAGCGGATCGATGCTAGGGCGATCGGAGCTAGCTGAGCTAGCTAGCTGA
GCGCTAGCGAGCGTACGAGCGATCGAGCGAGTCTAGCGAGCGATTCTAGCG
ATCGAGCGTCTACGATCGTATGCTAGCTAGGGCTAGCATGCGGATCTATCG
AGCGGCTATCTGAGCGATTCGATCGAGCGATCTAGCGAGCTATCGATCGAG
CCGGCTCACCGTCGTAAATCTATGATCTGGCTTGGCCTGCAGTAGCTCTTT
CATTTCGGGCTTATCTAATGCTGACTGGTCGGTCCTGGCTACGCTCCA
8
9. BIG DATA: CONCEPT BREAKDOWN
• 500 CharactersPrevious slide
• 6,000,000 Slides
• Printed in 130 books
Human Genome:
• Study the DNA and RNA of many peopleLarge projects
• Pair-wise and higher-order relationships.
20,000-25,000 active
genes in human genome
9
12. … AND IT EXPANDS!
Nucleotide sequences
Protein sequences
Patterns or motifs
Macro-molecular 3D structure
Gene expression data
Metabolic pathways
Proteomics data
13. EXAMPLES OFWHAT
BIOINFORMATICS STUDY
1000s of studies
Consortia projects
TCGA – The Cancer Genome Atlas projects Profile 500 samples of each cancer
Level Findings Effect
DNA
• Mutations
• Polymorphism
• Outcomes
• Staging
• Response to Therapy
RNA
• Specific Micro-RNA or
mRNA transcripts
16. TYPES OF ONTOLOGY
• Depends on domain background
• Example: use of a word with different meanings.
Domain-Specific
Ontology
• Foundation ontology
• Example: DiseaseClassificationUpper Ontology
• Combined ontology
• Standardised disease classification that differs
according to practice.
Hybrid Ontology
17. GENE ONTOLOGY
• Maintain and develop its controlled vocabulary.
• Annotate genes and gene products, and assimilate and
disseminate annotation data.
• Provide tools for easy access to all aspects of the data provided
by the project, and to enable functional interpretation of
experimental data using the GO.
A major bioinformatics initiative that aims to unify the
representation of gene and gene product attributes across
all species.
19. DATABASES FOR BIG DATA
Primary
databases
Experimental results directly into database
Secondary
databases
Results of analysis of primary databases
Aggregate
databases
Links to other
data items
Combination of
data
Consolidation
of data
20. COMPLICATED - DIFFERENT DATA
AND REPOSITORIES
Primary database – DNA or protein sequence
Secondary - (derived information e.g. protein domains)
Protein structure or other (e.g. crystal coordinates)
21. WORKINGWITH DATA
Tens to thousands of
points
Examples:
• Sequence of a gene
• Protein structure
Tools:
• Excel, R, Matlab
Small-scale
Thousands to millions of
points
Example:
• SNP list of a genome
• A protein-protein interaction
network
• Tools:
• Perl, Python, Java
Medium-scale
Millions to billions of
points
Data:
• Raw sequencing reads
• Whole-genome alignment of 10
species
Tools:
• C, Oracle, parallelized and
tailor-made software/Coding
Large-scale
22. SYSTEM DESIGN &
IMPLEMENTATION
System Algorithms
Project Requirements &
Problem Determination
DESIGN
System Configuration
& Maintenance DEVELOP
Usage & Outcomes
EVALUATE
User Interface
Data Distribution
Data Production
&
Data Gathering
Results & Interpretation
26. Is the comparison of two nucleotide sequences accurate?
By translating into amino acid sequence, are we losing information?
The genetic code is degenerate (Two or more codons can represent
the same amino acid)
Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Search by similarity
Using nucleotide seq. Using amino acid seq.
TOOLSTO SEARCH DATABASES
The dilemma: DNA or protein?
27. You have just cloned a gene
Evolutionary relationship?
-Phylogenetic tree
-Accession #?
-Annotation?
Is it already in databases?
-Sub-localization
-Soluble?
-3D fold
Protein characteristics?
-% identity?
-Family member?
Is there similar sequences?
-Alignments?
-Domains?
Is there conserved regions?
Other information?
-Expression profile?
-Mutants?
A critical failure of current bioinformatics is the lack of a single software package that can perform all of these functions!
APPLYING ALGORITHMS TO ANALYZE
GENOMICS DATA
28. GENOMIC DATA REPOSITORIES
Gene Expression Data
• Microarray
• PCR
• RNA-Seq
• e.g. Gene Expression
Omnibus (GEO), Array
Express
Sequencing Data
• Whole Genome
• Exome
• Targeted resequencing
• e.g. EuroNA, DDBJapan
30. 30
WE ARE DROWNING IN DATA,
BUT STARVING FOR
KNOWLEDGE!
• Data collection and data availability
from major sources of abundant data
The Explosive Growth
of Data: from
terabytes to petabytes
• Automated analysis of massive data
sets
Data mining
31. 31
DATA MININGVS. DATA QUERY
Data Query
• A list of all patients with
cancer who responded to
drug X.
Data Mining
• What is the likelihood of a
patient to respond to drug X ?
• What are the characteristics
of these patients?
32. 32
DATAWAREHOUSE
■ Data cleaning
■ Data integration
■ OLAP: On-Line Analytical Processing
– summarization
– consolidation
– aggregation
– view information from different angles
■ but additional data analysis tools are needed for
– classification
– clustering
– charecterization of data changing over time
35. 35
DEVELOPMENTS IN COMPUTER HARDWARE
■ Powerful and affordable computers
■ Data collection equipment
■ Storage media
■ Communication and networking
36.
37. HUMAN GENOME PROJECT
Identify all the approximately 20,000-25,000 genes in human DNA,
Determine the sequences of the 3 billion chemical base pairs that make up human DNA
Store this information in databases,
Improve tools for data analysis,
Transfer related technologies to the private sector, and
Address the ethical, legal, and social issues that may arise from the project.
43. BACKGROUND
• Need to understand existing tools,
scientific approach, and needs of
biological research
Computer &
information
science
• Need to learn a set of tools and skills
• May also need to understand the
deeper scientific technical background
Biomedicine
45. GAPTWO
• Barriers include:
• Language
• Methodology
• Conceptualization
Putting a biological
investigator and a
system implementer
together in a room
doesn’t solve the
problem
46. GAPTHREE
Computer science is a
“science of the artificial”
• Mainly concerned with
human artifacts i.e.
creations limited mainly
by conceptualization
and imagination
Biomedicine is a science
of discovery
• Mainly concerned with
how organisms function,
the limiting factors are
often a result of limits of
investigative methods
and tools