EiTESAL eHealth Conference 14&15 May 2017

BIOINFORMATIC
S
BIG HUMAN DATA AND
METADATA
Shehab Anwer, MBBCh. MRes.
Research Data Manager / Clinical Research Fellow
Magdi Yacoub Heart Foundation, Aswan Heart Centre

What’s
Bioinformatics?
Language
Bio-
Biology: the study of life
and living organisms.
Informatics Information science
Concept
The body of tools, algorithms needed to handle large
and complex biological information.
Scientific Discipline Interaction of biology and computer science.
NCBI
The field of science in which biology, computer science,
and information technology merge into a single discipline”

Integration in
Bioinformatics
Computer
Science
Physics
Biological
Sciences
Chemistry
Statistics
Mathematics

INTERMS OFTOOLS
• Spreadsheets, AI, InstrumentationGeneralTools
• E-Mail, Networks, Internet &WorldWideWebCommunications
• Storage, OrganizationDatabases
• Examination & DiscoveryAnalysisTools

INTERMS OF SKILLS’ SET
• Using and developing PracticalToolsKnow-how
• Language of Biomedical Research
• Language of Informatics
‘Cross-Cultural’
Exchange
• Database Inter-operation
• Process Modeling & DataVisualization
Solving Scientific
Problems using
Computers

DATA FORMAT:
FROM NUCLEOTIDES
TO DATA

AFTER ANALYSIS: NUCLEOTIDE
CODE FORMAT
AAACGTACGTATTCGGGCCATCGAGGCTAGCGGCACTTCGAGCGATCTATC
GGGAGCTTTGGCTATCGATCGGGCGATCGATGCTGACGTACGTAGCGCGCG
ATCGAGCGCGGCTAGCTAGCGGCATCGTAGCTACGTAGCTACGGCGCTATT
TCGATCGAGTCGTGTCTAGTCGGATATAGCTATGCATCTAGCTGAGGCGAT
CTGAGCGGATCGATGCTAGGGCGATCGGAGCTAGCTGAGCTAGCTAGCTGA
GCGCTAGCGAGCGTACGAGCGATCGAGCGAGTCTAGCGAGCGATTCTAGCG
ATCGAGCGTCTACGATCGTATGCTAGCTAGGGCTAGCATGCGGATCTATCG
AGCGGCTATCTGAGCGATTCGATCGAGCGATCTAGCGAGCTATCGATCGAG
CCGGCTCACCGTCGTAAATCTATGATCTGGCTTGGCCTGCAGTAGCTCTTT
CATTTCGGGCTTATCTAATGCTGACTGGTCGGTCCTGGCTACGCTCCA
8

BIG DATA: CONCEPT BREAKDOWN
• 500 CharactersPrevious slide
• 6,000,000 Slides
• Printed in 130 books
Human Genome:
• Study the DNA and RNA of many peopleLarge projects
• Pair-wise and higher-order relationships.
20,000-25,000 active
genes in human genome
9

CANYOU SPOT “CGAGCGTC”
AAACGTACGTATTCGGGCCATCGAGGCTAGCGGCACTTCGAGCGATCTATC
GGGAGCTTTGGCTATCGATCGGGCGATCGATGCTGACGTACGTAGCGCGCG
ATCGAGCGCGGCTAGCTAGCGGCATCGTAGCTACGTAGCTACGGCGCTATT
TCGATCGAGTCGTGTCTAGTCGGATATAGCTATGCATCTAGCTGAGGCGAT
CTGAGCGGATCGATGCTAGGGCGATCGGAGCTAGCTGAGCTAGCTAGCTGA
GCGCTAGCGAGCGTACGAGCGATCGAGCGAGTCTAGCGAGCGATTCTAGCG
ATCGAGCGTCTACGATCGTATGCTAGCTAGGGCTAGCATGCGGATCTATCG
AGCGGCTATCTGAGCGATTCGATCGAGCGATCTAGCGAGCTATCGATCGAG
CCGGCTCACCGTCGTAAATCTATGATCTGGCTTGGCCTGCAGTAGCTCTTT
CATTTCGGGCTTATCTAATGCTGACTGGTCGGTCCTGGCTACGCTCCA
10

… AND IT EXPANDS!
Nucleotide sequences
Protein sequences
Patterns or motifs
Macro-molecular 3D structure
Gene expression data
Metabolic pathways
Proteomics data

EXAMPLES OFWHAT
BIOINFORMATICS STUDY
1000s of studies
Consortia projects
TCGA – The Cancer Genome Atlas projects Profile 500 samples of each cancer
Level Findings Effect
DNA
• Mutations
• Polymorphism
• Outcomes
• Staging
• Response to Therapy
RNA
• Specific Micro-RNA or
mRNA transcripts

REPRESENTATIO
N:
FROM CONCEPTSTO
ONTOLOGY

ONTOLOGY IS AN
EXPLICIT
SPECIFICATION OF
A
CONCEPTUALIZAT
ION OF A DOMAIN

TYPES OF ONTOLOGY
• Depends on domain background
• Example: use of a word with different meanings.
Domain-Specific
Ontology
• Foundation ontology
• Example: DiseaseClassificationUpper Ontology
• Combined ontology
• Standardised disease classification that differs
according to practice.
Hybrid Ontology

GENE ONTOLOGY
• Maintain and develop its controlled vocabulary.
• Annotate genes and gene products, and assimilate and
disseminate annotation data.
• Provide tools for easy access to all aspects of the data provided
by the project, and to enable functional interpretation of
experimental data using the GO.
A major bioinformatics initiative that aims to unify the
representation of gene and gene product attributes across
all species.

DATABASES FOR BIG DATA
Primary
databases
Experimental results directly into database
Secondary
databases
Results of analysis of primary databases
Aggregate
databases
Links to other
data items
Combination of
data
Consolidation
of data

COMPLICATED - DIFFERENT DATA
AND REPOSITORIES
Primary database – DNA or protein sequence
Secondary - (derived information e.g. protein domains)
Protein structure or other (e.g. crystal coordinates)

WORKINGWITH DATA
Tens to thousands of
points
Examples:
• Sequence of a gene
• Protein structure
Tools:
• Excel, R, Matlab
Small-scale
Thousands to millions of
points
Example:
• SNP list of a genome
• A protein-protein interaction
network
• Tools:
• Perl, Python, Java
Medium-scale
Millions to billions of
points
Data:
• Raw sequencing reads
• Whole-genome alignment of 10
species
Tools:
• C, Oracle, parallelized and
tailor-made software/Coding
Large-scale

SYSTEM DESIGN &
IMPLEMENTATION
System Algorithms
Project Requirements &
Problem Determination
DESIGN
System Configuration
& Maintenance DEVELOP
Usage & Outcomes
EVALUATE
User Interface
Data Distribution
Data Production
&
Data Gathering
Results & Interpretation

DATABASES’ INTER-CONVERTIBLE
DATA FORMATS
Flat-files
Delimited
Spreadshee
ts
Relational
database
SQL
Exchange/publicati
on technologies
HTML/XML

Getting useful “discovery”
answers from large databases
probably will depend on a
concept model i.e. an
ontology.

Cloud Orchestration
DataData
Genome data
Protected Data Clouds
Cloud Orchestrator
(queuing, monitoring)
External Clouds/Sources
Compute algorithms
Sequencing
Centers
Compute Compute
Compute
Results
YOU!

Is the comparison of two nucleotide sequences accurate?
By translating into amino acid sequence, are we losing information?
The genetic code is degenerate (Two or more codons can represent
the same amino acid)
Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Search by similarity
Using nucleotide seq. Using amino acid seq.
TOOLSTO SEARCH DATABASES
The dilemma: DNA or protein?

You have just cloned a gene
Evolutionary relationship?
-Phylogenetic tree
-Accession #?
-Annotation?
Is it already in databases?
-Sub-localization
-Soluble?
-3D fold
Protein characteristics?
-% identity?
-Family member?
Is there similar sequences?
-Alignments?
-Domains?
Is there conserved regions?
Other information?
-Expression profile?
-Mutants?
A critical failure of current bioinformatics is the lack of a single software package that can perform all of these functions!
APPLYING ALGORITHMS TO ANALYZE
GENOMICS DATA

GENOMIC DATA REPOSITORIES
Gene Expression Data
• Microarray
• PCR
• RNA-Seq
• e.g. Gene Expression
Omnibus (GEO), Array
Express
Sequencing Data
• Whole Genome
• Exome
• Targeted resequencing
• e.g. EuroNA, DDBJapan

30
WE ARE DROWNING IN DATA,
BUT STARVING FOR
KNOWLEDGE!
• Data collection and data availability
from major sources of abundant data
The Explosive Growth
of Data: from
terabytes to petabytes
• Automated analysis of massive data
sets
Data mining

31
DATA MININGVS. DATA QUERY
Data Query
• A list of all patients with
cancer who responded to
drug X.
Data Mining
• What is the likelihood of a
patient to respond to drug X ?
• What are the characteristics
of these patients?

32
DATAWAREHOUSE
■ Data cleaning
■ Data integration
■ OLAP: On-Line Analytical Processing
– summarization
– consolidation
– aggregation
– view information from different angles
■ but additional data analysis tools are needed for
– classification
– clustering
– charecterization of data changing over time

Goal
identification:
Creating a
target data set
Data
preprocessing
Data reduction
and
transformation
Data Mining
Presentation
and Evaluation
Gaining
information!

35
DEVELOPMENTS IN COMPUTER HARDWARE
■ Powerful and affordable computers
■ Data collection equipment
■ Storage media
■ Communication and networking

HUMAN GENOME PROJECT
Identify all the approximately 20,000-25,000 genes in human DNA,
Determine the sequences of the 3 billion chemical base pairs that make up human DNA
Store this information in databases,
Improve tools for data analysis,
Transfer related technologies to the private sector, and
Address the ethical, legal, and social issues that may arise from the project.

BACKGROUND
• Need to understand existing tools,
scientific approach, and needs of
biological research
Computer &
information
science
• Need to learn a set of tools and skills
• May also need to understand the
deeper scientific technical background
Biomedicine

GAP ONE
Biological scientists
and investigators can’t
build their own tools
Computer scientists
don’t know what tools
to build

GAPTWO
• Barriers include:
• Language
• Methodology
• Conceptualization
Putting a biological
investigator and a
system implementer
together in a room
doesn’t solve the
problem

GAPTHREE
Computer science is a
“science of the artificial”
• Mainly concerned with
human artifacts i.e.
creations limited mainly
by conceptualization
and imagination
Biomedicine is a science
of discovery
• Mainly concerned with
how organisms function,
the limiting factors are
often a result of limits of
investigative methods
and tools

THANK
YOU !
SHEHABANWER@GMAIL.COM

EiTESAL eHealth Conference 14&15 May 2017

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a EiTESAL eHealth Conference 14&15 May 2017

Semelhante a EiTESAL eHealth Conference 14&15 May 2017 (20)

Mais de EITESANGO

Mais de EITESANGO (20)

Último

Último (20)

EiTESAL eHealth Conference 14&15 May 2017