2. About me
• Member of AI group, CTBC Data R&D Center
• Past experience on
• Cyber security and defense industry
• Smartphone industry
• Familiar with
• Machine learning
• Natural language processing
• Software development
• Cloud native architecture design
3. Team
• CTBC Data R&D Center AI group is founded in 2018
• AI group is composed of data scientists and software developers
• Our mission is to realize AI-based solution in banking scenario
• We currently focus on
• Computer Vision (CV)
• Natural Language Processing (NLP)
Retrieved from https://www.ithome.com.tw/news/131697
4. Achievement
NLP
• Pluto: A Deep Learning based Watchdog for
Anti Money Laundering
• First Vertical AI paradigm in RegTech
field in CTBC globally
• Daily reduce 67% human effort on
adverse media screening
• Publication
• https://www.aclweb.org/anthology/W19-5515
CV
• NIST Face Recognition Verification Test (FRVT)
• Rank 35th globally
• Rank 2nd in Taiwan industry
• X-ATM for fraud avoidance
名次 企業名稱 國家 FRR
10 Sensetine(商湯) 中國 0.0092
18 Face++(曠視) 中國 0.0145
26 CyberLink (訊連) 台灣 0.0195
29 Tencent Deepsea (騰訊) 中國 0.0215
35 CTBC BANK (中國信託) 台灣 0.0250
39 Gorilla Technology(大猩猩) 台灣 0.0291
55 Kneron Inc. (耐能) 台灣 0.0902
11. Key Components in NLU
• Deep Neural Networks (DNN)
• Conditional Random Field (CRF)
• Recurrent Neural Network (RNN)
Preprocessing
Tokenizer POS tagger
Modeling Modeling
Embeddings
Supervised learning method
vectorization
• Intent Recognizer
• Classification problem
• Named Entity Extractor
• Sequence labeling problem
Approach
12. Data Preparation
• Intent dataset
• 1016 samples over 3 distinct classes
• 試算匯兌, 查詢存款利率, 查詢台外幣餘額
• Named entity dataset
• 977 samples over 6 distinct entities
• amount, money, duration, currency, acnt_type, timestamp
Great
acknowledgment
for
數位金融處
and
個金數位營運處
13. Intent Classification Techniques
• Preprocessing
• Tokenization (ckiptagger)
• Feature extraction
• Bag of Word (scikit-learn)
Vocabulary
[ “現在”, “台幣”,”美金”, “日圓”,“一
年期”, “定存”,“是”, “多少”]
現在美金一年期定存是多少
Text
現在 美金 一年期 定存 是 多少
Tokens
• Model
• Deep Neural Network
(DNN) (tensorflow)
[ 1 , 0 , 1 , 0 , 1 , 1 ]
Feature vector
Word Count encodingFeature engineering
Model Training
14. Named Entity Recognition Techniques
• Preprocessing
• Tokenization (ckiptagger)
• POS tagging (ckiptagger)
• Feature extraction
• Text and POS tags
within context
Model I : CRF for Word-Level Feature
現在美金一年期定存是多少
Text
現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa)
Tokens
…, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), …
Feature vector
Context windows: 3 tokens
• Model
• Conditional Random Field
(CRF) (scikit-learn)
Feature engineering
Model Training
15. Named Entity Recognition Techniques
• Preprocessing
• Tokenization (ckiptagger)
Model II : Bi-LSTM-CRF for Word-Level Embedding
現在美金一年期定存是多少
Text
現在 美金 一年期 定存 是 多少
Tokens
• Model
• Embedding Layer (keras)
• Long Short-Term Memory
(LSTM) layer (keras)
• CRF layer (keras)
Embedding learning
Features learning
Model training
24. Prototype
Why Rasa ?
Extendible Architecture Open sourceOwn Our Data
• Preserve privacy
• Do not hand data over
to big tech company
• Transparency
• Community support
• Task-oriented dialogue
architecture
• Customizable
components
Rasa characteristics
CTBC strategy
• Customize Mandarin-
based component
• Integration on core
technology
• Compliance on Security and Regulation
• Customized scenario
• Ownership on core technology
28. Conclusion
• NLU is a key module in task-oriented dialogue systems
• Intent recognizer and entity extractor are key components to realize NLU by machine
learning techniques and annotated data
• DNN performs generally better than traditional method but not for all tasks
• Rasa powered by open source offers a framework for conversational assistant
development from scratch
Summary
29. Conclusion
• Transfer learning based on pre-trained word embeddings initialization
• Word-based embeddings vs. char-based embeddings
• Model engineering
What’s next