4. Novozymes
Enzyme Stability
Prediction
Kaggle is biggest data science
competition web
The goal of this competition is
to predict the thermostability
of enzyme variants (Tm)
Total 1331 teams
4
5. 5
What is Protein
A A U P
P U C
I
Primary structure
Secondary structure
Tertiary structure
Quaternary structure
6. A large group of
proteins
Can accelerating
chemical reactions
6
What is Enzyme
Can be a Biologic
Therapy
pH and Tm is
crucial factors
8. Market
The global biologics therapy market was estimated at US$ 366.43 billion in 2021 and it is expected to hit
over US$ 719.84 billion by 2030 with a noteworthy CAGR of 7.8% from 2022 to 2030.
資料來源:precedenceresearch/Biologics Market
Biologic Therapy Global Market 2021-2030
2030
2021
Market growth will
ACCELERATE
at a CAGR of
Growth Contributed by
NORTH AMERICA
Incremental growth ($B) Growth for 2021
7.8% 57%
719.84 196.2%
9. Regulatory Approval
Clinical Trials
Pre-clinical
Drug discovery
3-5 years
Demand
1-2 years 6-7 years 1-2 years
least over 10 years
Regulatory Approval
Clinical Trials
Pre-clinical
Drug discovery
1-2 years 2-3 years
Traditional development
AI
AI development
AI
5 years
12. Models
12
XG Boost
Baese on ML, we use
different feature which
form protein sequence
Py Rosetta
Compare 3D structural
between wild type and
mutant type
The protein sequence
sees as “Natural
Language”
Protein BERT
Analysis protein 3D
structural to extract
protein’s features
3D-CNN
13. XG Boost Process
Training data
Kaggle
Training
Data cleaning
Outcome
Protein sequence
X
y Tm
Testing data
1. Drop pH > 9 & PH
< 5.5
2. embedding:
length
entropy
aaindex1
atc
aac
Model
ŷTm
protein melting point
Data cleaning
14. Length
Sequence length in
amino acids
Entropy
Shannon entropy for
each sequence in the
dataset
atc
sum of atomic and bond
compositions for each
amino acid sequence
pH
Drop pH > 10 & PH < 6
aaindex1
is a set of 20
numerical values
representing 566
physicochemical and
biological properties
of amino acids
aac
the frequency of
amino acids for each
sequence in the
dataset
Data Cleaning
15. Py Rosetta Process
Training data Training Outcome
Protein sequence
X
Testing data
1. Input data
2. Import model
3. Predict scores
17. Protein BERT Process
Training data Training Outcome
Protein sequence
X
1. Input data 2. Import model
ŷTm
protein melting point
4. Predict protein melting point
seq_id tm
0 75.7
1 50.5
2 40.5
3 47.2
4 49.5
5 48.4
6 45.7
7 55.9
8 48.1
18. ◍ Protein Sequence vs NLP
○ Meaningful sentence
○ Bio-language
◍ BERT, stronger NLP model
pretrained on two tasks
○ language modeling
○ next sentence prediction
◍ Use Transfer Learning
○ Less training time
○ Significantly improve the
efficiency of reinforcement
learning
NLP Model
Transfer Learning
19. ◍ BERT vs Protein
○ protein sequence not have
“chunk”
○ protein have 3D structural are
more complex than sentences
◍ Import Gene Ontology (GO)
○ Define protein’s space distance in
organism
○ Separate global and local, then
process each
Protein BERT
Input protein
BERT module
carry out
Finetune
ranking test
data
Input pretrained module
Build Finetune module
Setting Finetune Parameter
參考:https://github.com/nadavbra/protein_bert/blob/master/ProteinBERT demo.ipynb
Protein BERT
20. 3D-CNN Process
Training data Training Outcome
Protein sequence
X
2. 3D structural 3D-CNN
1. Input data
3. Import model
4. PredictΔΔG
seq_id ΔΔG
31390 1.404995
31391 1.343793
31392 0.241666
31393 0.534203
31394 0.134588
31395 0.697623
31396 1.346896
31397 1.001297
31398 1.083319
21.
22. 資料來源: Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks Bian Li,Yucheng T. Yang,John A. Capra ,Mark B. Gerstein
3D Convolutional Neural Networks
33. What is Enzyme
酵素是一類大分子生物催化劑,大部分的酵
素是屬於蛋白質。能加快化學反應的速度,
在現代工業中,酵素可以取代化學品作為重
要的生產催化劑, 以少量的資源,創造更多
的產品,同時節省能源,並減少浪費、加速
生產過程。
然而,酵素作用的溫度有著十分嚴苛的條件,
加熱時或與化學變性劑接觸時,酵素結構會
發生去摺疊(即變性),原有的結構被打亂,
活性也往往隨之喪失。這限制酵素場景使用
的廣泛程度。
蛋白質的分子結構可劃分為四
級,以描述其不同的方面:
胺基酸 3-字母 1-字母
丙氨酸(Alanine) Ala A
精氨酸(Arginine) Arg R
天冬醯胺(Asparagine) Asn N
天冬氨酸(Aspartate) Asp D
半胱氨酸(Cysteine) Cys C
穀氨酸(Glutamic acid) Glu E
穀氨醯胺(Glutamine) Gln Q
甘氨酸(Glycine) Gly G
組氨酸(Histidine) His H
異亮氨酸(Isoleucine) Ile I
亮氨酸(Leucine) Leu L
賴氨酸(Lysine) Lys K
甲硫氨酸(Methionine) Met M
苯丙氨酸(Phenylalanine) Phe F
脯氨酸(Proline) Pro P
絲氨酸(Serine) Ser S
蘇氨酸(Threonine) Thr T
色氨酸(Tryptophan) Trp W
酪氨酸(Tyrosine) Tyr Y
纈氨酸(Valine) Val V