SlideShare a Scribd company logo
1 of 63
Download to read offline
Extreme-scale text-based classification of medical data
Anton Hristov & Svetla Boytcheva
18 May 2021
making sense of text and data
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
About 80% of
Electronic Health
Records are in
unstructured format
Need for NLP tools for
processing clinical text
Lack of multilingual
terminology
resources and
domain specific
ontologies
The automatic processing and knowledge extraction from
medical records is a task with public importance
Clinical text
HISTORY OF PRESENT ILLNESS :The patient is an 80 female with
a history of diastolic function and heart failure , hypertension and
rheumatoid arthritis who presents from an outside hospital with
presyncope.
Clinical text
OPERATIONS / PROCEDURES :Dobutamine stress test , cardiac
ultrasound , EGD , chest x-ray , PICC placement .The patient is a
62-year-old female with a history of diabetes mellitus ,
hypertension , COPD , hypercholesterolemia , depression and CHF
Clinical text
HISTORY OF PRESENT ILLNESS :The patient is a 63 year-old
woman transferred for evaluation of thrombotic thrombocytopenic
purpura and bronchiolitis obliterans organizing pneumonia .
Why the task for concept normalization
is so important?
o Disambiguation
o Usage of URI
o Data integration
o Reasoning
o Similarity search
o Phenotypes
Text-based classification
a process of assigning tags or categories to text
according to its content.
Standard Classification & Ontologies
SNOMED CT
SNOMED CT
Objective
To develop methods for automatic association of
SNOMED CD codes to textual descriptions of
diagnosis
How to find training data?
o For 150000 classes we will need huge training dataset
o Clinical data are not publicly available due to GDPR issues
o There are quite few manually annotate datasets
o We need to rely only on publicly available sources:
− Other standard classifications and ontologies
− Open data
ICD-10 CM
ICD-11
DOID
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
https://w.wiki/3Lyc
https://w.wiki/3Lyh
https://w.wiki/3Lys
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Medical Ontologies Mappings
o 1:1
o 1:N
o N:M
o No mappings
Source: https://library.ahima.org/doc?oid=106975#.YKOy_agzaHu
ExaMode dataset
Dataset version 1
• Summary:
– 22M+ data records
• 128K+ SNOMED codes
• 280K+ textual descriptions
- 17K+ undiscovered connections
32
Dataset Generation
o More data – more problems
o Data cleaning
o Unbalanced dataset
o Overrepresented vs underrepresented classes
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Data Augmentation
o The original idea for dataset enlargement
− Datasets with images for Neural networks training
o Popular techniques:
− Flip
− Rotation
Data Augmentation
o Popular techniques:
− Scale
− Crop
− Translate
− Pixel/Region change (fill with constant)
− Pixel/Region swap
− ….
Types of data augmentation that are applicable
for textual data
o Swap random letters within a single word
o Swap random words within a text
o Replace word with its synonim
o Delete random letter within a single word
o Replace a random letter with a letter close to it on the keyboard
ExaMode dataset
Dataset version 2 Remove noise
• Additional data augmentations
• Additional heuristics
• Additional data cleaning
• Split the dataset into 3 subgroups:
– Disorders
– Procedures
– Findings
38
ExaMode dataset
Dataset version 2
Summary:
– Disorders: ~105K SNOMED codes
– Procedures: ~67K SNOMED codes
– Findings: ~70K SNOMED codes
39
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Text based classification
o Binary classification
o Multiclass classification
o Multilabel classification
Binary classification
o Samples takes only 1 label out of 2 classes
Review Sentiment
Delivered as expected Positive
Good quality Positive
There are scratches on the surface Negative
Works great Positive
I do not recommend it Negative
Multiclass classification
o Samples takes only 1 label out of number of classes
Movie Rating
Palmer 7
Bad Trip 6
Godzilla vs. Kong 6
Band of Brothers 9
Big fish 8
Multilabel classification
o Samples takes one or more than one labels out of number
of classes
Movie Drama Comedy Action Sci-Fi War Adventure Fantasy
Palmer 1 0 0 0 0 0 0
Bad Trip 0 1 0 0 0 0 0
Godzilla vs. Kong 0 0 1 1 0 0 0
Band of Brothers 1 0 1 0 1 0 0
Big fish 1 0 0 0 0 1 1
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Classification model
o BERT (Bidirectional Encoder Representations from
Transformers)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805, 2018.
Classification model
o Why was BERT created?
o Big gap in the data
Classification model
o BERT core idea
Source: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
Classification model
o BERT used for classification
Classification model
o BERT advantages
o Incredible performance
o Open source
o Easy to pretrain with small amount of medical data
Classification model
o BERT pretrained models:
o bioBERT
o multilingualBERT
o slavicBERT
o clinicalBERT
o pubmedBERT
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In
Proceedings of NAACL, 2019.
Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo. BioBERT: a pre-trained biomedical
language representation model for biomedical text mining. Bioinformatics, 2019.
Mikhail Arkhipov, Maria Trofimova, Yurii Kuratov, and Alexey Sorokin. Tuning multilingual transformers for language-specific named entity recognition. 2019.
Emily Alsentzer, John R. Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical bert
embeddings. In ClinicalNLP workshop at NAACL, 2019.
Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779, 2020.
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
Embeddings
o Student: [2, 7]
o School: [3, 6]
o University: [1, 5]
o Dog: [6, 2.5]
o Cat: [5, 2]
o Fish: [7.5, 1]
Embeddings
o Deep learning embeddings
Figure is based on: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
Presentation outline
o Medical Ontologies
o Linked Open Data
o Dataset generation
o Data Augmentation
o Text based classification
o Classification model
o Embeddings
o eXtreme scale classification
Presentation outline
eXtreme scale classification
o Labels clustering
o Dataset with +10K classes
eXtreme scale classification
o Labels clustering
o Dataset with +10K classes
Labels clustering
o Labels embeddings
o Embeddings clustering
Labels embeddings
Embeddings clustering
Clustering
algorithm
o Clustering algorithms:
o Agglomerative clustering
o DBSCAN
o K-Means
o Mean Shift
o Spectral Clustering
o ...
o etc.
Refinement
o Possible solutions:
o Classical shallow ANN
o Deep learning approach
o Binary classifiers for every label
Acknowledgements
o Alexander Tahchiev
o Andrey Avramov
o Hristo Papazov
o Pavlin Gyurov
o Todor Primov
o Stanislav Slavkov
https://www.datasciencesociety.net/
https://www.ontotext.com
Thank you!
See Ontotext Platform demos
Star Wars API: https://swapi-platform.ontotext.com/graphiql/
Platform monitoring: https://test-platform.ontotext.com/grafana/

More Related Content

Similar to Extreme scale text based classification of medical data

ai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptxai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptx
ssuser6b571f
 
Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...
Rick Silva
 
Principles organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bankPrinciples organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bank
Espirituanna
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
Ian Foster
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
Human Variome Project
 

Similar to Extreme scale text based classification of medical data (20)

Evotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh DiscoveryEvotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh Discovery
 
Biomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challengesBiomedical Entity Linking - Introduction, approaches, challenges
Biomedical Entity Linking - Introduction, approaches, challenges
 
ai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptxai-in-healthcare-202011-201117103639.pptx
ai-in-healthcare-202011-201117103639.pptx
 
Usage of open source software for Real World Data Analysis in pharmaceutical ...
Usage of open source software for Real World Data Analysis in pharmaceutical ...Usage of open source software for Real World Data Analysis in pharmaceutical ...
Usage of open source software for Real World Data Analysis in pharmaceutical ...
 
Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...Emerging collaboration models for academic medical centers _ our place in the...
Emerging collaboration models for academic medical centers _ our place in the...
 
Principles organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bankPrinciples organization and_operation_of_a_dna_bank
Principles organization and_operation_of_a_dna_bank
 
Qiu_CV_Feb12_2017
Qiu_CV_Feb12_2017Qiu_CV_Feb12_2017
Qiu_CV_Feb12_2017
 
Understanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsUnderstanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methods
 
BiTeM / SIBTex @ TREC CDS 2014
BiTeM / SIBTex @ TREC CDS 2014BiTeM / SIBTex @ TREC CDS 2014
BiTeM / SIBTex @ TREC CDS 2014
 
Data Visualization in Biomedical Sciences: More than Meets the Eye
Data Visualization in Biomedical Sciences: More than Meets the EyeData Visualization in Biomedical Sciences: More than Meets the Eye
Data Visualization in Biomedical Sciences: More than Meets the Eye
 
CV of Rong Chen
CV of Rong ChenCV of Rong Chen
CV of Rong Chen
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
 
Genomics and Computation in Precision Medicine March 2017
Genomics and Computation in Precision Medicine March 2017Genomics and Computation in Precision Medicine March 2017
Genomics and Computation in Precision Medicine March 2017
 
AETIONOMY Overview AD/PD Conference 2015 Nice
AETIONOMY Overview AD/PD Conference 2015 NiceAETIONOMY Overview AD/PD Conference 2015 Nice
AETIONOMY Overview AD/PD Conference 2015 Nice
 
Introduction to data integration in bioinformatics
Introduction to data integration in bioinformaticsIntroduction to data integration in bioinformatics
Introduction to data integration in bioinformatics
 
Amia tb-review-13
Amia tb-review-13Amia tb-review-13
Amia tb-review-13
 
CHI MMTC Integrating Public and Private Data
CHI MMTC Integrating Public and Private DataCHI MMTC Integrating Public and Private Data
CHI MMTC Integrating Public and Private Data
 

Recently uploaded

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
0uyfyq0q4
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
w7jl3eyno
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
hwhqz6r1y
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 

Recently uploaded (20)

Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 

Extreme scale text based classification of medical data

  • 1. Extreme-scale text-based classification of medical data Anton Hristov & Svetla Boytcheva 18 May 2021 making sense of text and data
  • 2. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 3. About 80% of Electronic Health Records are in unstructured format Need for NLP tools for processing clinical text Lack of multilingual terminology resources and domain specific ontologies The automatic processing and knowledge extraction from medical records is a task with public importance
  • 4. Clinical text HISTORY OF PRESENT ILLNESS :The patient is an 80 female with a history of diastolic function and heart failure , hypertension and rheumatoid arthritis who presents from an outside hospital with presyncope.
  • 5. Clinical text OPERATIONS / PROCEDURES :Dobutamine stress test , cardiac ultrasound , EGD , chest x-ray , PICC placement .The patient is a 62-year-old female with a history of diabetes mellitus , hypertension , COPD , hypercholesterolemia , depression and CHF
  • 6. Clinical text HISTORY OF PRESENT ILLNESS :The patient is a 63 year-old woman transferred for evaluation of thrombotic thrombocytopenic purpura and bronchiolitis obliterans organizing pneumonia .
  • 7. Why the task for concept normalization is so important? o Disambiguation o Usage of URI o Data integration o Reasoning o Similarity search o Phenotypes
  • 8. Text-based classification a process of assigning tags or categories to text according to its content.
  • 12.
  • 13. Objective To develop methods for automatic association of SNOMED CD codes to textual descriptions of diagnosis
  • 14. How to find training data? o For 150000 classes we will need huge training dataset o Clinical data are not publicly available due to GDPR issues o There are quite few manually annotate datasets o We need to rely only on publicly available sources: − Other standard classifications and ontologies − Open data
  • 17. DOID
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 24.
  • 25.
  • 28.
  • 30. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 31. Medical Ontologies Mappings o 1:1 o 1:N o N:M o No mappings Source: https://library.ahima.org/doc?oid=106975#.YKOy_agzaHu
  • 32. ExaMode dataset Dataset version 1 • Summary: – 22M+ data records • 128K+ SNOMED codes • 280K+ textual descriptions - 17K+ undiscovered connections 32
  • 33. Dataset Generation o More data – more problems o Data cleaning o Unbalanced dataset o Overrepresented vs underrepresented classes
  • 34. o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 35. Data Augmentation o The original idea for dataset enlargement − Datasets with images for Neural networks training o Popular techniques: − Flip − Rotation
  • 36. Data Augmentation o Popular techniques: − Scale − Crop − Translate − Pixel/Region change (fill with constant) − Pixel/Region swap − ….
  • 37. Types of data augmentation that are applicable for textual data o Swap random letters within a single word o Swap random words within a text o Replace word with its synonim o Delete random letter within a single word o Replace a random letter with a letter close to it on the keyboard
  • 38. ExaMode dataset Dataset version 2 Remove noise • Additional data augmentations • Additional heuristics • Additional data cleaning • Split the dataset into 3 subgroups: – Disorders – Procedures – Findings 38
  • 39. ExaMode dataset Dataset version 2 Summary: – Disorders: ~105K SNOMED codes – Procedures: ~67K SNOMED codes – Findings: ~70K SNOMED codes 39
  • 40. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 41. Text based classification o Binary classification o Multiclass classification o Multilabel classification
  • 42. Binary classification o Samples takes only 1 label out of 2 classes Review Sentiment Delivered as expected Positive Good quality Positive There are scratches on the surface Negative Works great Positive I do not recommend it Negative
  • 43. Multiclass classification o Samples takes only 1 label out of number of classes Movie Rating Palmer 7 Bad Trip 6 Godzilla vs. Kong 6 Band of Brothers 9 Big fish 8
  • 44. Multilabel classification o Samples takes one or more than one labels out of number of classes Movie Drama Comedy Action Sci-Fi War Adventure Fantasy Palmer 1 0 0 0 0 0 0 Bad Trip 0 1 0 0 0 0 0 Godzilla vs. Kong 0 0 1 1 0 0 0 Band of Brothers 1 0 1 0 1 0 0 Big fish 1 0 0 0 0 1 1
  • 45. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 46. Classification model o BERT (Bidirectional Encoder Representations from Transformers) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • 47. Classification model o Why was BERT created? o Big gap in the data
  • 48. Classification model o BERT core idea Source: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
  • 49. Classification model o BERT used for classification
  • 50. Classification model o BERT advantages o Incredible performance o Open source o Easy to pretrain with small amount of medical data
  • 51. Classification model o BERT pretrained models: o bioBERT o multilingualBERT o slavicBERT o clinicalBERT o pubmedBERT Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019. Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2019. Mikhail Arkhipov, Maria Trofimova, Yurii Kuratov, and Alexey Sorokin. Tuning multilingual transformers for language-specific named entity recognition. 2019. Emily Alsentzer, John R. Murphy, Willie Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. Publicly available clinical bert embeddings. In ClinicalNLP workshop at NAACL, 2019. Gu, Yu, et al. "Domain-specific language model pretraining for biomedical natural language processing." arXiv preprint arXiv:2007.15779, 2020.
  • 52. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 53. Embeddings o Student: [2, 7] o School: [3, 6] o University: [1, 5] o Dog: [6, 2.5] o Cat: [5, 2] o Fish: [7.5, 1]
  • 54. Embeddings o Deep learning embeddings Figure is based on: Park, Dongju & Ahn, Chang Wook. Self-Supervised Contextual Data Augmentation for Natural Language Processing, 2019
  • 55. Presentation outline o Medical Ontologies o Linked Open Data o Dataset generation o Data Augmentation o Text based classification o Classification model o Embeddings o eXtreme scale classification Presentation outline
  • 56. eXtreme scale classification o Labels clustering o Dataset with +10K classes
  • 57. eXtreme scale classification o Labels clustering o Dataset with +10K classes
  • 58. Labels clustering o Labels embeddings o Embeddings clustering
  • 60. Embeddings clustering Clustering algorithm o Clustering algorithms: o Agglomerative clustering o DBSCAN o K-Means o Mean Shift o Spectral Clustering o ... o etc.
  • 61. Refinement o Possible solutions: o Classical shallow ANN o Deep learning approach o Binary classifiers for every label
  • 62. Acknowledgements o Alexander Tahchiev o Andrey Avramov o Hristo Papazov o Pavlin Gyurov o Todor Primov o Stanislav Slavkov https://www.datasciencesociety.net/ https://www.ontotext.com
  • 63. Thank you! See Ontotext Platform demos Star Wars API: https://swapi-platform.ontotext.com/graphiql/ Platform monitoring: https://test-platform.ontotext.com/grafana/