Novozymes Enzyme Stability Prediction

Novozymes Enzyme Stability Prediction
Help identify the thermostable mutations in enzymes
組員：11莊O禾、18郭O智、19林O泰、20張O豪

About Presentation
Content
About This
Project !
Execute
How To Do
This Project ?
Motivation
Why Are We Doing
This Project ?

Content
About This Project !
Competition Enzyme

Novozymes
Enzyme Stability
Prediction
Kaggle is biggest data science
competition web
The goal of this competition is
to predict the thermostability
of enzyme variants (Tm)
Total 1331 teams
4

5
What is Protein
A A U P
P U C
I
Primary structure
Secondary structure
Tertiary structure
Quaternary structure

A large group of
proteins
Can accelerating
chemical reactions
6
What is Enzyme
Can be a Biologic
Therapy
pH and Tm is
crucial factors

Motivation
Why Are We Doing This Project ?
Market Demand

Market
The global biologics therapy market was estimated at US$ 366.43 billion in 2021 and it is expected to hit
over US$ 719.84 billion by 2030 with a noteworthy CAGR of 7.8% from 2022 to 2030.
資料來源:precedenceresearch/Biologics Market
Biologic Therapy Global Market 2021-2030
2030
2021
Market growth will
ACCELERATE
at a CAGR of
Growth Contributed by
NORTH AMERICA
Incremental growth ($B) Growth for 2021
7.8% 57%
719.84 196.2%

Regulatory Approval
Clinical Trials
Pre-clinical
Drug discovery
3-5 years
Demand
1-2 years 6-7 years 1-2 years
least over 10 years
Regulatory Approval
Clinical Trials
Pre-clinical
Drug discovery
1-2 years 2-3 years
Traditional development
AI
AI development
AI
5 years

Execute
How To Do This Project ?
Process Scoring

Process
Input Trained Output
Protein sequence
X
protein melting point
VPVNPEPDATSVENVALKTGS
GDSQSDPIKADLEVKGQSALPF
DVDCWAILCKGAPN...
Models ŷTm

Models
12
XG Boost
Baese on ML, we use
different feature which
form protein sequence
Py Rosetta
Compare 3D structural
between wild type and
mutant type
The protein sequence
sees as “Natural
Language”
Protein BERT
Analysis protein 3D
structural to extract
protein’s features
3D-CNN

XG Boost Process
Training data
Kaggle
Training
Data cleaning
Outcome
Protein sequence
X
y Tm
Testing data
1. Drop pH > 9 & PH
< 5.5
2. embedding:
length
entropy
aaindex1
atc
aac
Model
ŷTm
Data cleaning

Length
Sequence length in
amino acids
Entropy
Shannon entropy for
each sequence in the
dataset
atc
sum of atomic and bond
compositions for each
amino acid sequence
pH
Drop pH > 10 & PH < 6
aaindex1
is a set of 20
numerical values
representing 566
physicochemical and
biological properties
of amino acids
aac
the frequency of
amino acids for each
sequence in the
dataset
Data Cleaning

Py Rosetta Process
Training data Training Outcome
Protein sequence
X
Testing data
1. Input data
2. Import model
3. Predict scores

PyRosetta
Test data
Single-Point
Mutation pdb
Energy
Score
Function
scores
Wild type sequence:
VPVNPEPDATSVENVALKTGSGDSQSDPIKADL
EVKGQSALPFDVDCWAILCKGAPN...
Mutant sequence:
VPVNPEPDATSVENVALKTGSGDSASDPIKADL
EVKGQSALPFDVDCWAILCKGAPN...

Protein BERT Process
Protein sequence
X
1. Input data 2. Import model
ŷTm
4. Predict protein melting point
seq_id tm
0 75.7
1 50.5
2 40.5
3 47.2
4 49.5
5 48.4
6 45.7
7 55.9
8 48.1

◍ Protein Sequence vs NLP
○ Meaningful sentence
○ Bio-language
◍ BERT, stronger NLP model
pretrained on two tasks
○ language modeling
○ next sentence prediction
◍ Use Transfer Learning
○ Less training time
○ Significantly improve the
efficiency of reinforcement
learning
NLP Model
Transfer Learning

◍ BERT vs Protein
○ protein sequence not have
“chunk”
○ protein have 3D structural are
more complex than sentences
◍ Import Gene Ontology (GO)
○ Define protein’s space distance in
organism
○ Separate global and local, then
process each
Protein BERT
Input protein
BERT module
carry out
Finetune
ranking test
data
Input pretrained module
Build Finetune module
Setting Finetune Parameter
參考：https://github.com/nadavbra/protein_bert/blob/master/ProteinBERT demo.ipynb
Protein BERT

3D-CNN Process
Protein sequence
X
2. 3D structural 3D-CNN
1. Input data
3. Import model
4. PredictΔΔG
seq_id ΔΔG
31390 1.404995
31391 1.343793
31392 0.241666
31393 0.534203
31394 0.134588
31395 0.697623
31396 1.346896
31397 1.001297
31398 1.083319

資料來源: Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks Bian Li,Yucheng T. Yang,John A. Capra ,Mark B. Gerstein
3D Convolutional Neural Networks

資料來源: https://www.youtube.com/watch?v=f0t-OCG79-U
Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNN)
資料來源: https://www.louisbouchard.ai/densenet-explained/
資料來源: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

3D Convolutional Neural Networks (CNN) Architecture

ΔΔG
ΔG
資料來源:https://pubs.acs.org/doi/10.1021/ja100744h

0
0.1
0.2
0.3
0.4
0.5
0.6
XGBoost ProteinBERT Rosetta 3D-CNN
Score
0.494
0.471
0.168
Table of ranking Score
0.292

Process
Training Outcome
Ranking
table of ranking
Sorting
Protein BERT
3D-CNN
Training model
Ranking
Output
Data X
Data PR
Data PB
Data CN
Ensemble
𝛼1
𝛼2
𝛼3
𝛼4

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
XGBoost ProteinBERT Rosetta 3D-CNN Ensemble
Score
0.592
0.494
0.471
0.292
0.168

References
Kaggle:
 https://www.kaggle.com/code/cdeotte/protein-bert-finetune-lb-0-30
 https://www.kaggle.com/code/dschettler8845/novo-esp-residue-depth-and-more-w-biopython
 https://www.kaggle.com/code/vslaykovsky/nesp-thermonet
 https://www.kaggle.com/code/lucasmorin/nesp-changes-eda-and-baseline/notebook#Submission
ESM:
 https://www.pnas.org/doi/full/10.1073/pnas.2016239118
EVE:
 https://www.nature.com/articles/s41586-021-04043-8
Alphafold2:
 https://github.com/deepmind/alphafold
RoseTTAFold:
 https://github.com/RosettaCommons/RoseTTAFold

What is Enzyme
 酵素是一類大分子生物催化劑，大部分的酵
素是屬於蛋白質。能加快化學反應的速度，
在現代工業中，酵素可以取代化學品作為重
要的生產催化劑，以少量的資源，創造更多
的產品，同時節省能源，並減少浪費、加速
生產過程。
 然而，酵素作用的溫度有著十分嚴苛的條件，
加熱時或與化學變性劑接觸時，酵素結構會
發生去摺疊（即變性），原有的結構被打亂，
活性也往往隨之喪失。這限制酵素場景使用
的廣泛程度。
 蛋白質的分子結構可劃分為四
級，以描述其不同的方面：
胺基酸 3-字母 1-字母
丙氨酸（Alanine） Ala A
精氨酸（Arginine） Arg R
天冬醯胺（Asparagine） Asn N
天冬氨酸（Aspartate） Asp D
半胱氨酸（Cysteine） Cys C
穀氨酸（Glutamic acid） Glu E
穀氨醯胺（Glutamine） Gln Q
甘氨酸（Glycine） Gly G
組氨酸（Histidine） His H
異亮氨酸（Isoleucine） Ile I
亮氨酸（Leucine） Leu L
賴氨酸（Lysine） Lys K
甲硫氨酸（Methionine） Met M
苯丙氨酸（Phenylalanine） Phe F
脯氨酸（Proline） Pro P
絲氨酸（Serine） Ser S
蘇氨酸（Threonine） Thr T
色氨酸（Tryptophan） Trp W
酪氨酸（Tyrosine） Tyr Y
纈氨酸（Valine） Val V

Spearman's rank
https://www.google.com/imgres?imgurl=https%3A%2F%2Fimg-
blog.csdnimg.cn%2F2019032717063990.png&imgrefurl=https%3A%2F%2Fblog.csdn.net%2Fgaifuxi9518%2Farticle%2Fdetails%2F88849283&tbnid=SH0Pg3IWoAOU
6M&vet=12ahUKEwj7t9Ker6_7AhWOAaYKHYVuCeAQMyhEegQIARBk..i&docid=JkOIIPPh0E6MzM&w=446&h=162&q=spearman%E7%9B%B8%E9%97%9C%E4%B
F%82%E6%95%B8&ved=2ahUKEwj7t9Ker6_7AhWOAaYKHYVuCeAQMyhEegQIARBk
https://www.tes.com/teaching-resource/spearman-s-rank-correlation-cie-a-level-biology-12411879

Novozymes Enzyme Stability Prediction

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Novozymes Enzyme Stability Prediction

Semelhante a Novozymes Enzyme Stability Prediction (20)

Mais de IttrainingIttraining

Mais de IttrainingIttraining (20)

Último

Último (20)

Novozymes Enzyme Stability Prediction

Notas do Editor