의료 인공지능: 인공지능은 의료를 어떻게 혁신하는가

Professor, SAHIST, Sungkyunkwan University
Director, Digital Healthcare Institute
Yoon Sup Choi, Ph.D.
의료 인공지능

: 인공지능은 의료를 어떻게 혁신하는가

“It's in Apple's DNA that technology alone is not enough.  
It's technology married with liberal arts.”

The Convergence of IT, BT and Medicine

최윤섭 지음
의료인공지능
표지디자인•최승협
컴퓨터공학, 생명과학, 의학의 융합을 통해 디지
털 헬스케어 분야의 혁신을 창출하고 사회적 가
치를 만드는 것을 화두로 삼고 있는 융합생명과학자, 미래의료학자,
기업가, 엔젤투자가, 에반젤리스트이다. 국내 디지털 헬스케어 분야
의 대표적인 전문가로, 활발한 연구, 저술 및 강연 등을 통해 국내에
이 분야를 처음 소개한 장본인이다.
포항공과대학교에서 컴퓨터공학과 생명과학을 복수전공하였으며
동 대학원 시스템생명공학부에서 전산생물학으로 이학박사 학위를
취득하였다. 스탠퍼드대학교 방문연구원, 서울의대 암연구소 연구
조교수, KT 종합기술원 컨버전스연구소 팀장, 서울대병원 의생명연
구원 연구조교수 등을 거쳤다. 『사이언스』를 비롯한 세계적인 과학
저널에 10여 편의 논문을 발표했다.
국내 최초로 디지털 헬스케어를 본격적으로 연구하는 연구소인 ‘최
윤섭 디지털 헬스케어 연구소’를 설립하여 소장을 맡고 있다. 또한
국내 유일의 헬스케어 스타트업 전문 엑셀러레이터 ‘디지털 헬스케
어 파트너스’의 공동 창업자 및 대표 파트너로 혁신적인 헬스케어
스타트업을 의료 전문가들과 함께 발굴, 투자, 육성하고 있다. 성균
관대학교 디지털헬스학과 초빙교수로도 재직 중이다.
뷰노, 직토, 3billion, 서지컬마인드, 닥터다이어리, VRAD, 메디히어,
소울링, 메디히어, 모바일닥터 등의 헬스케어 스타트업에 투자하고
자문을 맡아 한국에서도 헬스케어 혁신을 만들어내기 위해 노력하
고 있다. 국내 최초의 디지털 헬스케어 전문 블로그 『최윤섭의 헬스
케어 이노베이션』에 활발하게 집필하고 있으며, 『매일경제』에 칼럼
을 연재하고 있다. 저서로 『헬스케어 이노베이션: 이미 시작된 미래』
와 『그렇게 나는 스스로 기업이 되었다』가 있다.
•블로그_ http://www.yoonsupchoi.com/
•페이스북_ https://www.facebook.com/yoonsup.choi
•이메일_ yoonsup.choi@gmail.com
최윤섭
있다. 의료 인공지능의 빠른 발전과
들이 이해하기가 어려우며, 어디서부
과 적용, 그리고 의사와의 관계를 쉽
이 될 의학도와 젊은 의료인에게 유용
않는 사람은 거의 없다. 하지만 인공
별이다. 흔히 생각하는 만병통치약 같
능의 개발, 활용 및 가능성을 균형 있
역에 도전할 인공지능 연구자 모두에
이후 변하지 않은 현재의 의학 교육
한계를 절실히 느낀다. 저와 함께 의
미래 지향적 안목이 담긴 책이다. 인공
하는 학생과 학부모에게 추천한다.
하고 있다. 이 책은 다양한 사례와 깊
각을 제공하여, 인공지능이 의료에 본
상화된 10년 후 돌아보았을 때, 이 책
기대한다.
요하다. 단순히 인간의 일을 대신하는
이다. 따라서 인공지능을 균형있게 이
필요하다. 세계적으로 일어나고 있는
고 다양한 생각거리까지 주는 책이다.
근거에 기반하여 설득력 있게 제시하
최윤섭지음
의료인공지능
값 20,000원
ISBN 979-11-86269-99-2
미래의료학자 최윤섭 박사가 제시하는
의료 인공지능의 현재와 미래
의료 딥러닝과 IBM 왓슨의 현주소
인공지능은 의사를 대체하는가
값 20,000원
ISBN 979-11-86269-99-2
소울링, 메디히어, 모바일닥터 등의 헬스케어 스타트업에 투자하고
자문을 맡아 한국에서도 헬스케어 혁신을 만들어내기 위해 노력하
고 있다. 국내 최초의 디지털 헬스케어 전문 블로그 『최윤섭의 헬스
케어 이노베이션』에 활발하게 집필하고 있으며, 『매일경제』에 칼럼
을 연재하고 있다. 저서로 『헬스케어 이노베이션: 이미 시작된 미래』
와 『그렇게 나는 스스로 기업이 되었다』가 있다.
•블로그_ http://www.yoonsupchoi.com/
•페이스북_ https://www.facebook.com/yoonsup.choi
•이메일_ yoonsup.choi@gmail.com

의료 인공지능
•1부: 제 2의 기계시대와 의료 인공지능

•2부: 의료 인공지능의 과거와 현재

•3부: 미래를 어떻게 맞이할 것인가

대한영상의학회 춘계학술대회 2017.6

Vinod Khosla
Founder, 1st CEO of Sun Microsystems
Partner of KPCB, CEO of KhoslaVentures
LegendaryVenture Capitalist in SiliconValley

“Technology will replace 80% of doctors”

https://www.youtube.com/watch?time_continue=70&v=2HMPRXstSvQ
“영상의학과 전문의를 양성하는 것을 당장 그만둬야 한다.
5년 안에 딥러닝이 영상의학과 전문의를 능가할 것은 자명하다.”
Hinton on Radiology

• AP 통신: 로봇이 인간 대신 기사를 작성
• 초당 2,000 개의 기사 작성 가능
• 기존에 300개 기업의 실적 ➞ 3,000 개 기업을 커버

• 1978
• As part of the obscure task of “discovery” —
providing documents relevant to a lawsuit — the
studios examined six million documents at a
cost of more than $2.2 million, much of it to pay
for a platoon of lawyers and paralegals who
worked for months at high hourly rates.
• 2011
• Now, thanks to advances in artiﬁcial intelligence,
“e-discovery” software can analyze documents
in a fraction of the time for a fraction of the
cost.
• In January, for example, Blackstone Discovery of
Palo Alto, Calif., helped analyze 1.5 million
documents for less than $100,000.

“At its height back in 2000, the U.S. cash equities trading desk at
Goldman Sachs’s New York headquarters employed 600 traders,
buying and selling stock on the orders of the investment bank’s
large clients. Today there are just two equity traders left”

• 일본의 Fukoku 생명보험에서는 보험금 지급 여부를 심사
하는 사람을 30명 이상 해고하고, IBM Watson Explorer
에게 맡기기로 결정
• 의료 기록을 바탕으로 Watson이 보험금 지급 여부를 판단
• 인공지능으로 교체하여 생산성을 30% 향상
• 2년 안에 ROI 가 나올 것이라고 예상
• 1년차: 140m yen
• 2년차: 200m yen

No choice but to bring AI into the medicine

Martin Duggan,“IBM Watson Health - Integrated Care & the Evolution to Cognitive Computing”

• 약한 인공 지능 (Artificial Narrow Intelligence)
• 특정 방면에서 잘하는 인공지능
• 체스, 퀴즈, 메일 필터링, 상품 추천, 자율 운전
• 강한 인공 지능 (Artificial General Intelligence)
• 모든 방면에서 인간 급의 인공 지능
• 사고, 계획, 문제해결, 추상화, 복잡한 개념 학습
• 초 인공 지능 (Artificial Super Intelligence)
• 과학기술, 사회적 능력 등 모든 영역에서 인간보다 뛰어난 인공 지능
• “충분히 발달한 과학은 마법과 구분할 수 없다” - 아서 C. 클라크

2010 2020 2030 2040 2050 2060 2070 2080 2090 2100
90%
50%
10%
PT-AI
AGI
EETNTOP100 Combined
언제쯤 기계가 인간 수준의 지능을 획득할 것인가?
Philosophy and Theory of AI (2011)
Artiﬁcial General Intelligence (2012)
Greek Association for Artiﬁcial Intelligence
Survey of most frequently cited 100 authors (2013)
Combined
응답자
누적 비율
Superintelligence, Nick Bostrom (2014)

Superintelligence: Science of ﬁction?
Panelists: Elon Musk (Tesla, SpaceX), Bart Selman (Cornell), Ray Kurzweil (Google),
David Chalmers (NYU), Nick Bostrom(FHI), Demis Hassabis (Deep Mind), Stuart
Russell (Berkeley), Sam Harris, and Jaan Tallinn (CSER/FLI)
January 6-8, 2017, Asilomar, CA
https://brunch.co.kr/@kakao-it/49
https://www.youtube.com/watch?v=h0962biiZa4

Superintelligence: Science of ﬁction?
Panelists: Elon Musk (Tesla, SpaceX), Bart Selman (Cornell), Ray Kurzweil (Google),
David Chalmers (NYU), Nick Bostrom(FHI), Demis Hassabis (Deep Mind), Stuart
Russell (Berkeley), Sam Harris, and Jaan Tallinn (CSER/FLI)
January 6-8, 2017, Asilomar, CA
Q: 초인공지능이란 영역은 도달 가능한 것인가?
Q: 초지능을 가진 개체의 출현이 가능할 것이라고 생각하는가?
Table 1
Elon Musk Start Russell Bart Selman Ray Kurzweil David Chalmers Nick Bostrom DemisHassabis Sam Harris Jaan Tallinn
YES YES YES YES YES YES YES YES YES
Table 1-1
YES YES YES YES YES YES YES YES YES
Q: 초지능의 실현이 일어나기를 희망하는가?
Table 1-1-1
Complicated Complicated Complicated YES Complicated YES YES Complicated Complicated
https://brunch.co.kr/@kakao-it/49
https://www.youtube.com/watch?v=h0962biiZa4

•복잡한 의료 데이터의 분석 및 insight 도출

•영상 의료/병리 데이터의 분석/판독

•연속 데이터의 모니터링 및 예방/예측
의료 인공지능의 세 유형

Jeopardy!
2011년 인간 챔피언 두 명 과 퀴즈 대결을 벌여서 압도적인 우승을 차지

600,000 pieces of medical evidence
2 million pages of text from 42 medical journals and clinical trials
69 guidelines, 61,540 clinical trials
IBM Watson on Medicine
Watson learned...
+
1,500 lung cancer cases
physician notes, lab results and clinical research
+
14,700 hours of hands-on training

메이요 클리닉 협력
(임상 시험 매칭)
전남대병원
도입
인도 마니팔 병원
WFO 도입
식약처 인공지능
가이드라인 초안
메드트로닉과
혈당관리 앱 시연
2011 2012 2013 2014 2015
뉴욕 MSK암센터 협력
(폐암)
MD앤더슨 협력
(백혈병)
MD앤더슨
파일럿 결과 발표
@ASCO
왓슨 펀드,
웰톡에 투자
뉴욕게놈센터 협력
(교모세포종 분석)
GeneMD,
왓슨 모바일 디벨로퍼
챌린지 우승
클리블랜드 클리닉 협력
(암 유전체 분석)
한국 IBM
왓슨 사업부 신설
Watson Health 출범
피텔, 익스플로리스 인수
J&J, 애플, 메드트로닉 협력
에픽 시스템즈, 메이요클리닉
제휴 (EHR 분석)
동경대 도입
( WFO)
왓슨 펀드,
모더나이징 메디슨
투자
학계/의료계
산업계
패쓰웨이 지노믹스 OME
클로즈드 알파 서비스 시작
트루븐 헬스
인수
애플 리서치 키트
통한 수면 연구 시작
2017
가천대
길병원
도입
메드트로닉
Sugar.IQ 출시
제약사
테바와 제휴
태국 범룽랏 국제 병원,
WFO 도입
머지
헬스케어
인수
2016
언더 아머 제휴
브로드 연구소 협력 발표
(유전체 분석-항암제 내성)
마니팔 병원의  
WFO 정확성 발표
대구가톨릭병원
대구동산병원
도입
부산대병원
도입
왓슨 펀드,
패쓰웨이 지노믹스
투자
제퍼디! 우승
조선대병원
도입
한국 왓슨
컨소시움 출범
쥬피터  
메디컬  
센터
도입
가이드라인
메이요 클리닉
임상시험매칭
결과발표
2018
건양대병원
도입
IBM Watson Health Chronicle
WFO
최초 논문

메이요 클리닉 협력
(임상 시험 매칭)
전남대병원
도입
인도 마니팔 병원
WFO 도입
가이드라인 초안
메드트로닉과
혈당관리 앱 시연
2011 2012 2013 2014 2015
뉴욕 MSK암센터 협력
(폐암)
MD앤더슨 협력
(백혈병)
MD앤더슨
파일럿 결과 발표
@ASCO
왓슨 펀드,
웰톡에 투자
뉴욕게놈센터 협력
(교모세포종 분석)
GeneMD,
왓슨 모바일 디벨로퍼
챌린지 우승
클리블랜드 클리닉 협력
(암 유전체 분석)
한국 IBM
왓슨 사업부 신설
Watson Health 출범
피텔, 익스플로리스 인수
J&J, 애플, 메드트로닉 협력
에픽 시스템즈, 메이요클리닉
제휴 (EHR 분석)
동경대 도입
( WFO)
왓슨 펀드,
모더나이징 메디슨
투자
학계/의료계
산업계
패쓰웨이 지노믹스 OME
클로즈드 알파 서비스 시작
트루븐 헬스
인수
애플 리서치 키트
통한 수면 연구 시작
2017
가천대
길병원
도입
메드트로닉
Sugar.IQ 출시
제약사
테바와 제휴
태국 범룽랏 국제 병원,
WFO 도입
머지
헬스케어
인수
2016
언더 아머 제휴
브로드 연구소 협력 발표
(유전체 분석-항암제 내성)
마니팔 병원의  
WFO 정확성 발표
대구가톨릭병원
대구동산병원
도입
부산대병원
도입
왓슨 펀드,
패쓰웨이 지노믹스
투자
제퍼디! 우승
조선대병원
도입
한국 왓슨
컨소시움 출범
쥬피터  
메디컬  
센터
도입
가이드라인
2018
건양대병원
도입
메이요 클리닉
임상시험매칭
결과발표
WFO
최초 논문
IBM Watson Health Chronicle

Annals of Oncology (2016) 27 (suppl_9): ix179-ix180. 10.1093/annonc/mdw601
Validation study to assess performance of IBM cognitive
computing system Watson for oncology with Manipal
multidisciplinary tumour board for 1000 consecutive cases:  
An Indian experience
•인도 마니팔 병원의 1,000명의 암환자 에 대해 의사와 WFO의 권고안의 ‘일치율’을 비

•유방암 638명, 대장암 126명, 직장암 124명, 폐암 112명

•의사-왓슨 일치율

•추천(50%), 고려(28%), 비추천(17%)

•의사의 진료안 중 5%는 왓슨의 권고안으로 제시되지 않음

•일치율이 암의 종류마다 달랐음

•대장암(85%), 폐암 (17.8%)

•삼중음성 유방암(67.9%), HER2 음성 유방암 (35%)

San Antonio Breast Cancer Symposium—December 6-10, 2016
Concordance WFO (@T2) and MMDT (@T1* v. T2**)
(N= 638 Breast Cancer Cases)
Time Point
/Concordance
REC REC + FC
n % n %
T1* 296 46 463 73
T2** 381 60 574 90
This presentation is the intellectual property of the author/presenter.Contact somusp@yahoo.com for permission to reprint and/or distribute.26
* T1 Time of original treatment decision by MMDT in the past (last 1-3 years)
** T2 Time (2016) of WFO’s treatment advice and of MMDT’s treatment decision upon blinded re-review of non-concordant
cases

WFO in ASCO 2017
• Early experience with IBM WFO cognitive computing system for lung  
 
and colorectal cancer treatment (마니팔 병원) 
• 지난 3년간: lung cancer(112), colon cancer(126), rectum cancer(124)
• lung cancer: localized 88.9%, meta 97.9%
• colon cancer: localized 85.5%, meta 76.6%
• rectum cancer: localized 96.8%, meta 80.6%
Performance of WFO in India
2017 ASCO annual Meeting, J Clin Oncol 35, 2017 (suppl; abstr 8527)

WFO in ASCO 2017
•가천대 길병원의 대장암과 위암 환자에 왓슨 적용 결과

• 대장암 환자(stage II-IV) 340명

• 진행성 위암 환자 185명 (Retrospective) 
• 의사와의 일치율

• 대장암 환자: 73%

• 보조 (adjuvant) 항암치료를 받은 250명: 85%

• 전이성 환자 90명: 40% 
• 위암 환자: 49%

• Trastzumab/FOLFOX 가 국민 건강 보험 수가를 받지 못함

• S-1(tegafur, gimeracil and oteracil)+cisplatin):

• 국내는 매우 루틴; 미국에서는 X

ORIGINAL ARTICLE
Watson for Oncology and breast cancer treatment
recommendations: agreement with an expert
multidisciplinary tumor board
S. P. Somashekhar1*, M.-J. Sepu´lveda2
, S. Puglielli3
, A. D. Norden3
, E. H. Shortliffe4
, C. Rohit Kumar1
,
A. Rauthan1
, N. Arun Kumar1
, P. Patil1
, K. Rhee3
& Y. Ramya1
1
Manipal Comprehensive Cancer Centre, Manipal Hospital, Bangalore, India; 2
IBM Research (Retired), Yorktown Heights; 3
Watson Health, IBM Corporation,
Cambridge; 4
Department of Surgical Oncology, College of Health Solutions, Arizona State University, Phoenix, USA
*Correspondence to: Prof. Sampige Prasannakumar Somashekhar, Manipal Comprehensive Cancer Centre, Manipal Hospital, Old Airport Road, Bangalore 560017, Karnataka,
India. Tel: þ91-9845712012; Fax: þ91-80-2502-3759; E-mail: somashekhar.sp@manipalhospitals.com
Background: Breast cancer oncologists are challenged to personalize care with rapidly changing scientific evidence, drug
approvals, and treatment guidelines. Artificial intelligence (AI) clinical decision-support systems (CDSSs) have the potential to
help address this challenge. We report here the results of examining the level of agreement (concordance) between treatment
recommendations made by the AI CDSS Watson for Oncology (WFO) and a multidisciplinary tumor board for breast cancer.
Patients and methods: Treatment recommendations were provided for 638 breast cancers between 2014 and 2016 at the
Manipal Comprehensive Cancer Center, Bengaluru, India. WFO provided treatment recommendations for the identical cases in
2016. A blinded second review was carried out by the center’s tumor board in 2016 for all cases in which there was not
agreement, to account for treatments and guidelines not available before 2016. Treatment recommendations were considered
concordant if the tumor board recommendations were designated ‘recommended’ or ‘for consideration’ by WFO.
Results: Treatment concordance between WFO and the multidisciplinary tumor board occurred in 93% of breast cancer cases.
Subgroup analysis found that patients with stage I or IV disease were less likely to be concordant than patients with stage II or III
disease. Increasing age was found to have a major impact on concordance. Concordance declined significantly (P 0.02;
P < 0.001) in all age groups compared with patients <45 years of age, except for the age group 55–64 years. Receptor status
was not found to affect concordance.
Conclusion: Treatment recommendations made by WFO and the tumor board were highly concordant for breast cancer cases
examined. Breast cancer stage and patient age had significant influence on concordance, while receptor status alone did not.
This study demonstrates that the AI clinical decision-support system WFO may be a helpful tool for breast cancer treatment
decision making, especially at centers where expert breast cancer resources are limited.
Key words: Watson for Oncology, artiﬁcial intelligence, cognitive clinical decision-support systems, breast cancer,
concordance, multidisciplinary tumor board
Introduction
Oncologists who treat breast cancer are challenged by a large and
rapidly expanding knowledge base [1, 2]. As of October 2017, for
example, there were 69 FDA-approved drugs for the treatment of
breast cancer, not including combination treatment regimens
[3]. The growth of massive genetic and clinical databases, along
with computing systems to exploit them, will accelerate the speed
of breast cancer treatment advances and shorten the cycle time
for changes to breast cancer treatment guidelines [4, 5]. In add-
ition, these information management challenges in cancer care
are occurring in a practice environment where there is little time
available for tracking and accessing relevant information at the
point of care [6]. For example, a study that surveyed 1117 oncolo-
gists reported that on average 4.6 h per week were spent keeping
VC The Author(s) 2018. Published by Oxford University Press on behalf of the European Society for Medical Oncology.
All rights reserved. For permissions, please email: journals.permissions@oup.com.
Annals of Oncology 29: 418–423, 2018
doi:10.1093/annonc/mdx781
Published online 9 January 2018
Downloaded from https://academic.oup.com/annonc/article-abstract/29/2/418/4781689
by guest
•Annals of Oncology, 2018 January

•Peer-reviewed Journal 에 출판된 최초의&유일한 WFO 정확성 관련 논문

•IBM 최고의료책임자 Dr.Kyu Rhee 등이 저자에 포함

ORIGINAL ARTICLE
Watson for Oncology and breast cancer treatment
recommendations: agreement with an expert
multidisciplinary tumor board
S. P. Somashekhar1*, M.-J. Sepu´lveda2
, S. Puglielli3
, A. D. Norden3
, E. H. Shortliffe4
, C. Rohit Kumar1
,
A. Rauthan1
, N. Arun Kumar1
, P. Patil1
, K. Rhee3
& Y. Ramya1
1
Manipal Comprehensive Cancer Centre, Manipal Hospital, Bangalore, India; 2
IBM Research (Retired), Yorktown Heights; 3
Watson Health, IBM Corporation,
Cambridge; 4
Department of Surgical Oncology, College of Health Solutions, Arizona State University, Phoenix, USA
*Correspondence to: Prof. Sampige Prasannakumar Somashekhar, Manipal Comprehensive Cancer Centre, Manipal Hospital, Old Airport Road, Bangalore 560017, Karnataka,
India. Tel: þ91-9845712012; Fax: þ91-80-2502-3759; E-mail: somashekhar.sp@manipalhospitals.com
Background: Breast cancer oncologists are challenged to personalize care with rapidly changing scientific evidence, drug
approvals, and treatment guidelines. Artificial intelligence (AI) clinical decision-support systems (CDSSs) have the potential to
help address this challenge. We report here the results of examining the level of agreement (concordance) between treatment
recommendations made by the AI CDSS Watson for Oncology (WFO) and a multidisciplinary tumor board for breast cancer.
Patients and methods: Treatment recommendations were provided for 638 breast cancers between 2014 and 2016 at the
Manipal Comprehensive Cancer Center, Bengaluru, India. WFO provided treatment recommendations for the identical cases in
2016. A blinded second review was carried out by the center’s tumor board in 2016 for all cases in which there was not
agreement, to account for treatments and guidelines not available before 2016. Treatment recommendations were considered
concordant if the tumor board recommendations were designated ‘recommended’ or ‘for consideration’ by WFO.
Results: Treatment concordance between WFO and the multidisciplinary tumor board occurred in 93% of breast cancer cases.
Subgroup analysis found that patients with stage I or IV disease were less likely to be concordant than patients with stage II or III
disease. Increasing age was found to have a major impact on concordance. Concordance declined significantly (P 0.02;
P < 0.001) in all age groups compared with patients <45 years of age, except for the age group 55–64 years. Receptor status
was not found to affect concordance.
Conclusion: Treatment recommendations made by WFO and the tumor board were highly concordant for breast cancer cases
examined. Breast cancer stage and patient age had significant influence on concordance, while receptor status alone did not.
This study demonstrates that the AI clinical decision-support system WFO may be a helpful tool for breast cancer treatment
decision making, especially at centers where expert breast cancer resources are limited.
Key words: Watson for Oncology, artiﬁcial intelligence, cognitive clinical decision-support systems, breast cancer,
concordance, multidisciplinary tumor board
Introduction
Oncologists who treat breast cancer are challenged by a large and
rapidly expanding knowledge base [1, 2]. As of October 2017, for
example, there were 69 FDA-approved drugs for the treatment of
breast cancer, not including combination treatment regimens
[3]. The growth of massive genetic and clinical databases, along
with computing systems to exploit them, will accelerate the speed
of breast cancer treatment advances and shorten the cycle time
for changes to breast cancer treatment guidelines [4, 5]. In add-
ition, these information management challenges in cancer care
are occurring in a practice environment where there is little time
available for tracking and accessing relevant information at the
point of care [6]. For example, a study that surveyed 1117 oncolo-
gists reported that on average 4.6 h per week were spent keeping
VC The Author(s) 2018. Published by Oxford University Press on behalf of the European Society for Medical Oncology.
All rights reserved. For permissions, please email: journals.permissions@oup.com.
Annals of Oncology 29: 418–423, 2018
doi:10.1093/annonc/mdx781
Published online 9 January 2018
Downloaded from https://academic.oup.com/annonc/article-abstract/29/2/418/4781689
by guest
Table 2. MMDT and WFO recommendations after the initial and blinded second reviews
Review of breast cancer cases (N 5 638) Concordant cases, n (%) Non-concordant cases, n (%)
Recommended For consideration Total Not recommended Not available Total
Initial review (T1MMDT versus T2WFO) 296 (46) 167 (26) 463 (73) 137 (21) 38 (6) 175 (27)
Second review (T2MMDT versus T2WFO) 397 (62) 194 (30) 591 (93) 36 (5) 11 (2) 47 (7)
T1MMDT, original MMDT recommendation from 2014 to 2016; T2WFO, WFO advisor treatment recommendation in 2016; T2MMDT, MMDT treatment recom-
mendation in 2016; MMDT, Manipal multidisciplinary tumor board; WFO, Watson for Oncology.
31%
18%
1% 2% 33%
5% 31%
6%
0% 10% 20%
Not available Not recommended RecommendedFor consideration
30% 40% 50% 60% 70% 80% 90% 100%
8% 25% 61%
64%
64%
29% 51%
62%
Concordance, 93%
Concordance, 80%
Concordance, 97%
Concordance, 95%
Concordance, 86%
2%
2%
Overall
(n=638)
Stage I
(n=61)
Stage II
(n=262)
Stage III
(n=191)
Stage IV
(n=124)
5%
Figure 1. Treatment concordance between WFO and the MMDT overall and by stage. MMDT, Manipal multidisciplinary tumor board; WFO,
Watson for Oncology.
5%Non-metastatic
HR(+)HER2/neu(+)Triple(–)
Metastatic
Non-metastatic
Metastatic
Non-metastatic
Metastatic
10%
1%
2%
1% 5% 20%
20%10%
0%
Not applicable Not recommended For consideration Recommended
20% 40% 60% 80% 100%
5%
74%
65%
34% 64%
5% 38% 56%
15% 20% 55%
36% 59%
Concordance, 95%
Concordance, 75%
Concordance, 94%
Concordance, 98%
Concordance, 94%
Concordance, 85%
Figure 2. Treatment concordance between WFO and the MMDT by stage and receptor status. HER2/neu, human epidermal growth factor
receptor 2; HR, hormone receptor; MMDT, Manipal multidisciplinary tumor board; WFO, Watson for Oncology.
Annals of Oncology Original article

잠정적 결론
•왓슨 포 온콜로지와 의사의 일치율:

•암종별로 다르다.

•같은 암종에서도 병기별로 다르다.

•같은 암종에 대해서도 병원별/국가별로 다르다.

•시간이 흐름에 따라 달라질 가능성이 있다.

원칙이 필요하다
•어떤 환자의 경우, 왓슨에게 의견을 물을 것인가?

•왓슨을 (암종별로) 얼마나 신뢰할 것인가?

•왓슨의 의견을 환자에게 공개할 것인가?

•왓슨과 의료진의 판단이 다른 경우 어떻게 할 것인가?

•왓슨에게 보험 급여를 매길 수 있는가?
이러한 기준에 따라 의료의 질/치료효과가 달라질 수 있으나,

현재 개별 병원이 개별적인 기준으로 활용하게 됨

Empowering the Oncology Community for Cancer Care
Genomics
Oncology
Clinical
Trial
Matching
Watson Health’s oncology clients span more than 35 hospital systems
“Empowering the Oncology Community
for Cancer Care”
Andrew Norden, KOTRA Conference, March 2017, “The Future of Health is Cognitive”

• 복잡한 의료 데이터의 분석 및 insight 도출
• 영상 의료/병리 데이터의 분석/판독
• 연속 데이터의 모니터링 및 예방/예측
의료 인공지능의 세 유형

Deep Learning
http://theanalyticsstore.ie/deep-learning/

인공지능
기계학습
딥러닝
전문가 시스템
사이버네틱스
…
인공신경망
결정트리
서포트 벡터 머신
…
컨볼루션 신경망 (CNN)
순환신경망(RNN)
…
인공지능과 딥러닝의 관계

페이스북의 딥페이스
Taigman,Y. et al. (2014). DeepFace: Closing the Gap to Human-Level Performance in FaceVerification, CVPR’14.
Figure 2. Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by three
locally-connected layers and two fully-connected layers. Colors illustrate feature maps produced at each layer. The net includes more than 120 million
parameters, where more than 95% come from the local and fully connected layers.
very few parameters. These layers merely expand the input
into a set of simple local features.
The subsequent layers (L4, L5 and L6) are instead lo-
cally connected [13, 16], like a convolutional layer they ap-
ply a filter bank, but every location in the feature map learns
a different set of filters. Since different regions of an aligned
image have different local statistics, the spatial stationarity
The goal of training is to maximize the probability of
the correct class (face id). We achieve this by minimiz-
ing the cross-entropy loss for each training sample. If k
is the index of the true label for a given input, the loss is:
L = log pk. The loss is minimized over the parameters
by computing the gradient of L w.r.t. the parameters and
Human: 95% vs. DeepFace in Facebook: 97.35%
Recognition Accuracy for Labeled Faces in the Wild (LFW) dataset (13,233 images, 5,749 people)

Schroff, F. et al. (2015). FaceNet:A Unified Embedding for Face Recognition and Clustering
Human: 95% vs. FaceNet of Google: 99.63%
False accept
False reject
s. This shows all pairs of images that were
on LFW. Only eight of the 13 errors shown
he other four are mislabeled in LFW.
on Youtube Faces DB
ge similarity of all pairs of the first one
our face detector detects in each video.
False accept
False reject
Figure 6. LFW errors. This shows all pairs of images that were
incorrectly classified on LFW. Only eight of the 13 errors shown
here are actual errors the other four are mislabeled in LFW.
5.7. Performance on Youtube Faces DB
We use the average similarity of all pairs of the first one
hundred frames that our face detector detects in each video.
This gives us a classification accuracy of 95.12%±0.39.
Using the first one thousand frames results in 95.18%.
Compared to [17] 91.4% who also evaluate one hundred
frames per video we reduce the error rate by almost half.
DeepId2+ [15] achieved 93.2% and our method reduces this
error by 30%, comparable to our improvement on LFW.
5.8. Face Clustering
Our compact embedding lends itself to be used in order
to cluster a users personal photos into groups of people with
the same identity. The constraints in assignment imposed
by clustering faces, compared to the pure verification task,
lead to truly amazing results. Figure 7 shows one cluster in
a users personal photo collection, generated using agglom-
erative clustering. It is a clear showcase of the incredible
invariance to occlusion, lighting, pose and even age.
Figure 7. Face Clustering. Shown is an exemplar cluster for one
user. All these images in the users personal photo collection were
clustered together.
6. Summary
We provide a method to directly learn an embedding into
an Euclidean space for face verification. This sets it apart
from other methods [15, 17] who use the CNN bottleneck
layer, or require additional post-processing such as concate-
nation of multiple models and PCA, as well as SVM clas-
sification. Our end-to-end training both simplifies the setup
and shows that directly optimizing a loss relevant to the task
at hand improves performance.
Another strength of our model is that it only requires
False accept
False reject
Figure 6. LFW errors. This shows all pairs of images that were
incorrectly classified on LFW. Only eight of the 13 errors shown
here are actual errors the other four are mislabeled in LFW.
5.7. Performance on Youtube Faces DB
We use the average similarity of all pairs of the first one
hundred frames that our face detector detects in each video.
This gives us a classification accuracy of 95.12%±0.39.
Using the first one thousand frames results in 95.18%.
Compared to [17] 91.4% who also evaluate one hundred
frames per video we reduce the error rate by almost half.
DeepId2+ [15] achieved 93.2% and our method reduces this
error by 30%, comparable to our improvement on LFW.
5.8. Face Clustering
Our compact embedding lends itself to be used in order
to cluster a users personal photos into groups of people with
the same identity. The constraints in assignment imposed
by clustering faces, compared to the pure verification task,
Figure 7. Face Clustering. Shown is an exemplar cluster for one
user. All these images in the users personal photo collection were
clustered together.
6. Summary
We provide a method to directly learn an embedding into
an Euclidean space for face verification. This sets it apart
from other methods [15, 17] who use the CNN bottleneck
layer, or require additional post-processing such as concate-
nation of multiple models and PCA, as well as SVM clas-
구글의 페이스넷

바이두의 얼굴 인식 인공지능
Jingtuo Liu (2015) Targeting Ultimate Accuracy: Face Recognition via Deep Embedding
Human: 95% vs.Baidu: 99.77%
3
Although several algorithms have achieved nearly perfect
accuracy in the 6000-pair verification task, a more practical
can achieve 95.8% identification rate, relatively reducing the
error rate by about 77%.
TABLE 3. COMPARISONS WITH OTHER METHODS ON SEVERAL EVALUATION TASKS
Score = -0.060 (pair #113) Score = -0.022 (pair #202) Score = -0.034 (pair #656)
Score = -0.031 (pair #1230) Score = -0.073 (pair #1862) Score = -0.091(pair #2499)
Score = -0.024 (pair #2551) Score = -0.036 (pair #2552) Score = -0.089 (pair #2610)
Method
Performance on tasks
Pair-wise
Accuracy(%)
Rank-1(%)
DIR(%) @
FAR =1%
Verification(%
)@ FAR=0.1%
Open-set
Identification(%
)@ Rank =
1,FAR = 0.1%
IDL Ensemble
Model
99.77 98.03 95.8 99.41 92.09
IDL Single Model 99.68 97.60 94.12 99.11 89.08
FaceNet[12] 99.63 NA NA NA NA
DeepID3[9] 99.53 96.00 81.40 NA NA
Face++[2] 99.50 NA NA NA NA
Facebook[15] 98.37 82.5 61.9 NA NA
Learning from
Scratch[4]
97.73 NA NA 80.26 28.90
HighDimLBP[10] 95.17 NA NA
41.66(reported
in [4])
18.07(reported
in [4])
• 6,000쌍의 얼굴 사진 중에 바이두의 인공지능은 불과 14쌍만을 잘못 판단

• 알고 보니 이 14쌍 중의 5쌍의 사진은 오히려 정답에 오류가 있었고,  
 
실제로는 인공지능이 정확 (red box)

•손 엑스레이 영상을 판독하여 환자의 골연령 (뼈 나이)를 계산해주는 인공지능

• 기존에 의사는 그룰리히-파일(Greulich-Pyle)법 등으로 표준 사진과 엑스레이를 비교하여 판독

• 인공지능은 참조표준영상에서 성별/나이별 패턴을 찾아서 유사성을 확률로 표시 + 표준 영상 검색

•의사가 성조숙증이나 저성장을 진단하는데 도움을 줄 수 있음

- 1 -
보 도 자 료
국내에서 개발한 인공지능(AI) 기반 의료기기 첫 허가
- 인공지능 기술 활용하여 뼈 나이 판독한다 -
식품의약품안전처 처장 류영진 는 국내 의료기기업체 주 뷰노가
개발한 인공지능 기술이 적용된 의료영상분석장치소프트웨어
뷰노메드 본에이지 를 월 일 허가했다고
밝혔습니다
이번에 허가된 뷰노메드 본에이지 는 인공지능 이 엑스레이 영상을
분석하여 환자의 뼈 나이를 제시하고 의사가 제시된 정보 등으로
성조숙증이나 저성장을 진단하는데 도움을 주는 소프트웨어입니다
그동안 의사가 환자의 왼쪽 손 엑스레이 영상을 참조표준영상
과 비교하면서 수동으로 뼈 나이를 판독하던 것을 자동화하여
판독시간을 단축하였습니다
이번 허가 제품은 년 월부터 빅데이터 및 인공지능 기술이
적용된 의료기기의 허가 심사 가이드라인 적용 대상으로 선정되어
임상시험 설계에서 허가까지 맞춤 지원하였습니다
뷰노메드 본에이지 는 환자 왼쪽 손 엑스레이 영상을 분석하여 의
료인이 환자 뼈 나이를 판단하는데 도움을 주기 위한 목적으로
허가되었습니다
- 2 -
분석은 인공지능이 촬영된 엑스레이 영상의 패턴을 인식하여 성별
남자 개 여자 개 로 분류된 뼈 나이 모델 참조표준영상에서
성별 나이별 패턴을 찾아 유사성을 확률로 표시하면 의사가 확률값
호르몬 수치 등의 정보를 종합하여 성조숙증이나 저성장을 진단합
니다
임상시험을 통해 제품 정확도 성능 를 평가한 결과 의사가 판단한
뼈 나이와 비교했을 때 평균 개월 차이가 있었으며 제조업체가
해당 제품 인공지능이 스스로 인지 학습할 수 있도록 영상자료를
주기적으로 업데이트하여 의사와의 오차를 좁혀나갈 수 있도록
설계되었습니다
인공지능 기반 의료기기 임상시험계획 승인건수는 이번에 허가받은
뷰노메드 본에이지 를 포함하여 현재까지 건입니다
임상시험이 승인된 인공지능 기반 의료기기는 자기공명영상으로
뇌경색 유형을 분류하는 소프트웨어 건 엑스레이 영상을 통해
폐결절 진단을 도와주는 소프트웨어 건 입니다
참고로 식약처는 인공지능 가상현실 프린팅 등 차 산업과
관련된 의료기기 신속한 개발을 지원하기 위하여 제품 연구 개발부터
임상시험 허가에 이르기까지 전 과정을 맞춤 지원하는 차세대
프로젝트 신개발 의료기기 허가도우미 등을 운영하고 있
습니다
식약처는 이번 제품 허가를 통해 개개인의 뼈 나이를 신속하게
분석 판정하는데 도움을 줄 수 있을 것이라며 앞으로도 첨단 의료기기
개발이 활성화될 수 있도록 적극적으로 지원해 나갈 것이라고
밝혔습니다

저는 뷰노의 자문을 맡고 있으며, 지분 관계가 있음을 밝힙니다

AJR:209, December 2017 1
Since 1992, concerns regarding interob-
server variability in manual bone age esti-
mation [4] have led to the establishment of
several automatic computerized methods for
bone age estimation, including computer-as-
sisted skeletal age scores, computer-aided
skeletal maturation assessment systems, and
BoneXpert (Visiana) [5–14]. BoneXpert was
developed according to traditional machine-
learning techniques and has been shown to
have a good performance for patients of var-
ious ethnicities and in various clinical set-
tings [10–14]. The deep-learning technique
is an improvement in artificial neural net-
works. Unlike traditional machine-learning
techniques, deep-learning techniques allow
an algorithm to program itself by learning
from the images given a large dataset of la-
beled examples, thus removing the need to
specify rules [15].
Deep-learning techniques permit higher
levels of abstraction and improved predic-
tions from data. Deep-learning techniques
Computerized Bone Age
Estimation Using Deep Learning–
Based Program: Evaluation of the
Accuracy and Efficiency
Jeong Rye Kim1
Woo Hyun Shim1
Hee Mang Yoon1
Sang Hyup Hong1
Jin Seong Lee1
Young Ah Cho1
Sangki Kim2
Kim JR, Shim WH, Yoon MH, et al.
1
Department of Radiology and Research Institute of
Radiology, Asan Medical Center, University of Ulsan
College of Medicine, 88 Olympic-ro 43-gil, Songpa-gu,
Seoul 05505, South Korea. Address correspondence to
H. M. Yoon (espoirhm@gmail.com).
2
Vuno Research Center, Vuno Inc., Seoul, South Korea.
Pediatric Imaging • Original Research
Supplemental Data
Available online at www.ajronline.org.
AJR 2017; 209:1–7
0361–803X/17/2096–1
© American Roentgen Ray Society
B
one age estimation is crucial for
developmental status determina-
tions and ultimate height predic-
tions in the pediatric population,
particularly for patients with growth disor-
ders and endocrine abnormalities [1]. Two
major left-hand wrist radiograph-based
methods for bone age estimation are current-
ly used: the Greulich-Pyle [2] and Tanner-
Whitehouse [3] methods. The former is much
more frequently used in clinical practice.
Greulich-Pyle–based bone age estimation is
performed by comparing a patient’s left-hand
radiograph to standard radiographs in the
Greulich-Pyle atlas and is therefore simple
and easily applied in clinical practice. How-
ever, the process of bone age estimation,
which comprises a simple comparison of
multiple images, can be repetitive and time
consuming and is thus sometimes burden-
some to radiologists. Moreover, the accuracy
depends on the radiologist’s experience and
tends to be subjective.
Keywords: bone age, children, deep learning, neural
network model
DOI:10.2214/AJR.17.18224
J. R. Kim and W. H. Shim contributed equally to this work.
Received March 12, 2017; accepted after revision
July 7, 2017.
S. Kim is employed by Vuno, Inc., which created the deep
learning–based automatic software system for bone
age determination. J. R. Kim, W. H. Shim, H. M. Yoon,
S. H. Hong, J. S. Lee, and Y. A. Cho are employed by
Asan Medical Center, which holds patent rights for the
deep learning–based automatic software system for
bone age assessment.
OBJECTIVE. The purpose of this study is to evaluate the accuracy and efficiency of a
new automatic software system for bone age assessment and to validate its feasibility in clini-
cal practice.
MATERIALS AND METHODS. A Greulich-Pyle method–based deep-learning tech-
nique was used to develop the automatic software system for bone age determination. Using
this software, bone age was estimated from left-hand radiographs of 200 patients (3–17 years
old) using first-rank bone age (software only), computer-assisted bone age (two radiologists
with software assistance), and Greulich-Pyle atlas–assisted bone age (two radiologists with
Greulich-Pyle atlas assistance only). The reference bone age was determined by the consen-
sus of two experienced radiologists.
RESULTS. First-rank bone ages determined by the automatic software system showed a
69.5% concordance rate and significant correlations with the reference bone age (r = 0.992;
p < 0.001). Concordance rates increased with the use of the automatic software system for
both reviewer 1 (63.0% for Greulich-Pyle atlas–assisted bone age vs 72.5% for computer-as-
sisted bone age) and reviewer 2 (49.5% for Greulich-Pyle atlas–assisted bone age vs 57.5% for
computer-assisted bone age). Reading times were reduced by 18.0% and 40.0% for reviewers
1 and 2, respectively.
CONCLUSION. Automatic software system showed reliably accurate bone age estima-
tions and appeared to enhance efficiency by reducing reading times without compromising
the diagnostic accuracy.
Kim et al.
Accuracy and Efficiency of Computerized Bone Age Estimation
Pediatric Imaging
Original Research
Downloadedfromwww.ajronline.orgbyFloridaAtlanticUnivon09/13/17fromIPaddress131.91.169.193.CopyrightARRS.Forpersonaluseonly;allrightsreserved
• 총 환자의 수: 200명

• 레퍼런스: 경험 많은 소아영상의학과 전문의 2명(18년, 4년 경력)의 컨센서스

• 의사A: 소아영상 세부전공한 영상의학 전문의 (500례 이상의 판독 경험)

• 의사B: 영상의학과 2년차 전공의 (판독법 하루 교육 이수 + 20례 판독)

• 인공지능: VUNO의 골연령 판독 딥러닝
AJR Am J Roentgenol. 2017 Dec;209(6):1374-1380.

40
50
60
70
80
인공지능 의사 A 의사 B
69.5%
63%
49.5%
정확도(%)
영상의학과 펠로우

(소아영상 세부전공)
영상의학과

2년차 전공의
인공지능 vs 의사




골연령 판독에 인간 의사와 인공지능의 시너지 효과
Digital Healthcare Institute
Director,Yoon Sup Choi, PhD
yoonsup.choi@gmail.com

40
50
60
70
80
인공지능 의사 A 의사 B
40
50
60
70
80
의사 A  
+ 인공지능
의사 B  
+ 인공지능
69.5%
63%
49.5%
72.5%
57.5%
정확도(%)
영상의학과 펠로우

(소아영상 세부전공)
영상의학과

2년차 전공의
인공지능 vs 의사 인공지능 + 의사




골연령 판독에 인간 의사와 인공지능의 시너지 효과

총 판독 시간 (m)
0
50
100
150
200
w/o AI w/ AI
0
50
100
150
200
w/o AI w/ AI
188m
154m
180m
108m
saving 40%
of time
saving 18%
of time
의사 A 의사 B
골연령 판독에서 인공지능을 활용하면

판독 시간의 절감도 가능





당뇨성 망막병증 판독 인공지능

당뇨성 망막병증
• 당뇨병의 대표적 합병증: 당뇨병력이 30년 이상 환자 90% 발병

• 안과 전문의들이 안저(안구의 안쪽)를 사진으로 찍어서 판독

• 망막 내 미세혈관 생성, 출혈, 삼출물 정도를 파악하여 진단

Case Study: TensorFlow in Medicine - Retinal Imaging (TensorFlow Dev Summit 2017)

Copyright 2016 American Medical Association. All rights reserved.
Development and Validation of a Deep Learning Algorithm
for Detection of Diabetic Retinopathy
in Retinal Fundus Photographs
Varun Gulshan, PhD; Lily Peng, MD, PhD; Marc Coram, PhD; Martin C. Stumpe, PhD; Derek Wu, BS; Arunachalam Narayanaswamy, PhD;
Subhashini Venugopalan, MS; Kasumi Widner, MS; Tom Madams, MEng; Jorge Cuadros, OD, PhD; Ramasamy Kim, OD, DNB;
Rajiv Raman, MS, DNB; Philip C. Nelson, BS; Jessica L. Mega, MD, MPH; Dale R. Webster, PhD
IMPORTANCE Deep learning is a family of computational methods that allow an algorithm to
program itself by learning from a large set of examples that demonstrate the desired
behavior, removing the need to specify rules explicitly. Application of these methods to
medical imaging requires further assessment and validation.
OBJECTIVE To apply deep learning to create an algorithm for automated detection of diabetic
retinopathy and diabetic macular edema in retinal fundus photographs.
DESIGN AND SETTING A specific type of neural network optimized for image classification
called a deep convolutional neural network was trained using a retrospective development
data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy,
diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists
and ophthalmology senior residents between May and December 2015. The resultant
algorithm was validated in January and February 2016 using 2 separate data sets, both
graded by at least 7 US board-certified ophthalmologists with high intragrader consistency.
EXPOSURE Deep learning–trained algorithm.
MAIN OUTCOMES AND MEASURES The sensitivity and specificity of the algorithm for detecting
referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy,
referable diabetic macular edema, or both, were generated based on the reference standard
of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2
operating points selected from the development set, one selected for high specificity and
another for high sensitivity.
RESULTS TheEyePACS-1datasetconsistedof9963imagesfrom4997patients(meanage,54.4
years;62.2%women;prevalenceofRDR,683/8878fullygradableimages[7.8%]);the
Messidor-2datasethad1748imagesfrom874patients(meanage,57.6years;42.6%women;
prevalenceofRDR,254/1745fullygradableimages[14.6%]).FordetectingRDR,thealgorithm
hadanareaunderthereceiveroperatingcurveof0.991(95%CI,0.988-0.993)forEyePACS-1and
0.990(95%CI,0.986-0.995)forMessidor-2.Usingthefirstoperatingcutpointwithhigh
specificity,forEyePACS-1,thesensitivitywas90.3%(95%CI,87.5%-92.7%)andthespecificity
was98.1%(95%CI,97.8%-98.5%).ForMessidor-2,thesensitivitywas87.0%(95%CI,81.1%-
91.0%)andthespecificitywas98.5%(95%CI,97.7%-99.1%).Usingasecondoperatingpoint
withhighsensitivityinthedevelopmentset,forEyePACS-1thesensitivitywas97.5%and
specificitywas93.4%andforMessidor-2thesensitivitywas96.1%andspecificitywas93.9%.
CONCLUSIONS AND RELEVANCE In this evaluation of retinal fundus photographs from adults
with diabetes, an algorithm based on deep machine learning had high sensitivity and
specificity for detecting referable diabetic retinopathy. Further research is necessary to
determine the feasibility of applying this algorithm in the clinical setting and to determine
whether use of the algorithm could lead to improved care and outcomes compared with
current ophthalmologic assessment.
JAMA. doi:10.1001/jama.2016.17216
Published online November 29, 2016.
Editorial
Supplemental content
Author Affiliations: Google Inc,
Mountain View, California (Gulshan,
Peng, Coram, Stumpe, Wu,
Narayanaswamy, Venugopalan,
Widner, Madams, Nelson, Webster);
Department of Computer Science,
University of Texas, Austin
(Venugopalan); EyePACS LLC,
San Jose, California (Cuadros); School
of Optometry, Vision Science
Graduate Group, University of
California, Berkeley (Cuadros);
Aravind Medical Research
Foundation, Aravind Eye Care
System, Madurai, India (Kim); Shri
Bhagwan Mahavir Vitreoretinal
Services, Sankara Nethralaya,
Chennai, Tamil Nadu, India (Raman);
Verily Life Sciences, Mountain View,
California (Mega); Cardiovascular
Division, Department of Medicine,
Brigham and Women’s Hospital and
Harvard Medical School, Boston,
Massachusetts (Mega).
Corresponding Author: Lily Peng,
MD, PhD, Google Research, 1600
Amphitheatre Way, Mountain View,
CA 94043 (lhpeng@google.com).
Research
JAMA | Original Investigation | INNOVATIONS IN HEALTH CARE DELIVERY
(Reprinted) E1
Copyright 2016 American Medical Association. All rights reserved.
세계 최고의 의학 저널에 발표

안저 판독 인공지능의 개발
• CNN으로 후향적으로 128,175개의 안저 이미지 학습

• 미국의 안과전문의 54명이 3-7회 판독한 데이터

• 우수한 안과전문의들 7-8명의 판독 결과와 인공지능의 판독 결과 비교

• EyePACS-1 (9,963 개), Messidor-2 (1,748 개)a) Fullscreen mode
b) Hit reset to reload this image. This will reset all of the grading.
c) Comment box for other pathologies you see
eFigure 2. Screenshot of the Second Screen of the Grading Tool, Which Asks Graders to Assess the
Image for DR, DME and Other Notable Conditions or Findings

• EyePACS-1 과 Messidor-2 의 AUC = 0.991, 0.990
• 7-8명의 안과 전문의와 민감도와 특이도가 동일한 수준
• F-score: 0.95 (vs. 인간 의사는 0.91)
Additional sensitivity analyses were conducted for sev- effects of data set size on algorithm performance were exam-
Figure 2. Validation Set Performance for Referable Diabetic Retinopathy
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,%
1 – Specificity, %
20
EyePACS-1: AUC, 99.1%; 95% CI, 98.8%-99.3%A
100
High-sensitivity operating point
High-specificity operating point
100
80
60
40
20
0
0
70
80
85
95
90
75
0 5 10 15 20 25 30
100806040
Sensitivity,% 1 – Specificity, %
20
Messidor-2: AUC, 99.0%; 95% CI, 98.6%-99.5%B
100
High-specificity operating point
High-sensitivity operating point
Performance of the algorithm (black curve) and ophthalmologists (colored
circles) for the presence of referable diabetic retinopathy (moderate or worse
diabetic retinopathy or referable diabetic macular edema) on A, EyePACS-1
(8788 fully gradable images) and B, Messidor-2 (1745 fully gradable images).
The black diamonds on the graph correspond to the sensitivity and specificity of
the algorithm at the high-sensitivity and high-specificity operating points.
In A, for the high-sensitivity operating point, specificity was 93.4% (95% CI,
92.8%-94.0%) and sensitivity was 97.5% (95% CI, 95.8%-98.7%); for the
high-specificity operating point, specificity was 98.1% (95% CI, 97.8%-98.5%)
and sensitivity was 90.3% (95% CI, 87.5%-92.7%). In B, for the high-sensitivity
operating point, specificity was 93.9% (95% CI, 92.4%-95.3%) and sensitivity
was 96.1% (95% CI, 92.4%-98.3%); for the high-specificity operating point,
specificity was 98.5% (95% CI, 97.7%-99.1%) and sensitivity was 87.0% (95%
CI, 81.1%-91.0%). There were 8 ophthalmologists who graded EyePACS-1 and 7
ophthalmologists who graded Messidor-2. AUC indicates area under the
receiver operating characteristic curve.
Research Original Investigation Accuracy of a Deep Learning Algorithm for Detection of Diabetic Retinopathy
안저 판독 인공지능의 정확도

0 0 M O N T H 2 0 1 7 | V O L 0 0 0 | N A T U R E | 1
LETTER doi:10.1038/nature21056
Dermatologist-level classification of skin cancer
with deep neural networks
Andre Esteva1
*, Brett Kuprel1
*, Roberto A. Novoa2,3
, Justin Ko2
, Susan M. Swetter2,4
, Helen M. Blau5
& Sebastian Thrun6
Skin cancer, the most common human malignancy1–3
, is primarily
diagnosed visually, beginning with an initial clinical screening
and followed potentially by dermoscopic analysis, a biopsy and
histopathological examination. Automated classification of skin
lesions using images is a challenging task owing to the fine-grained
variability in the appearance of skin lesions. Deep convolutional
neural networks (CNNs)4,5
show potential for general and highly
variable tasks across many fine-grained object categories6–11
.
Here we demonstrate classification of skin lesions using a single
CNN, trained end-to-end from images directly, using only pixels
and disease labels as inputs. We train a CNN using a dataset of
129,450 clinical images—two orders of magnitude larger than
previous datasets12
—consisting of 2,032 different diseases. We
test its performance against 21 board-certified dermatologists on
biopsy-proven clinical images with two critical binary classification
use cases: keratinocyte carcinomas versus benign seborrheic
keratoses; and malignant melanomas versus benign nevi. The first
case represents the identification of the most common cancers, the
second represents the identification of the deadliest skin cancer.
The CNN achieves performance on par with all tested experts
across both tasks, demonstrating an artificial intelligence capable
of classifying skin cancer with a level of competence comparable to
dermatologists. Outfitted with deep neural networks, mobile devices
can potentially extend the reach of dermatologists outside of the
clinic. It is projected that 6.3 billion smartphone subscriptions will
exist by the year 2021 (ref. 13) and can therefore potentially provide
low-cost universal access to vital diagnostic care.
There are 5.4 million new cases of skin cancer in the United States2
every year. One in five Americans will be diagnosed with a cutaneous
malignancy in their lifetime. Although melanomas represent fewer than
5% of all skin cancers in the United States, they account for approxi-
mately 75% of all skin-cancer-related deaths, and are responsible for
over 10,000 deaths annually in the United States alone. Early detection
is critical, as the estimated 5-year survival rate for melanoma drops
from over 99% if detected in its earliest stages to about 14% if detected
in its latest stages. We developed a computational method which may
allow medical practitioners and patients to proactively track skin
lesions and detect cancer earlier. By creating a novel disease taxonomy,
and a disease-partitioning algorithm that maps individual diseases into
training classes, we are able to build a deep learning system for auto-
mated dermatology.
Previous work in dermatological computer-aided classification12,14,15
has lacked the generalization capability of medical practitioners
owing to insufficient data and a focus on standardized tasks such as
dermoscopy16–18
and histological image classification19–22
. Dermoscopy
images are acquired via a specialized instrument and histological
images are acquired via invasive biopsy and microscopy; whereby
both modalities yield highly standardized images. Photographic
images (for example, smartphone images) exhibit variability in factors
such as zoom, angle and lighting, making classification substantially
more challenging23,24
. We overcome this challenge by using a data-
driven approach—1.41 million pre-training and training images
make classification robust to photographic variability. Many previous
techniques require extensive preprocessing, lesion segmentation and
extraction of domain-specific visual features before classification. By
contrast, our system requires no hand-crafted features; it is trained
end-to-end directly from image labels and raw pixels, with a single
network for both photographic and dermoscopic images. The existing
body of work uses small datasets of typically less than a thousand
images of skin lesions16,18,19
, which, as a result, do not generalize well
to new images. We demonstrate generalizable classification with a new
dermatologist-labelled dataset of 129,450 clinical images, including
3,374 dermoscopy images.
Deep learning algorithms, powered by advances in computation
and very large datasets25
, have recently been shown to exceed human
performance in visual tasks such as playing Atari games26
, strategic
board games like Go27
and object recognition6
. In this paper we
outline the development of a CNN that matches the performance of
dermatologists at three key diagnostic tasks: melanoma classification,
melanoma classification using dermoscopy and carcinoma
classification. We restrict the comparisons to image-based classification.
We utilize a GoogleNet Inception v3 CNN architecture9
that was pre-
trained on approximately 1.28 million images (1,000 object categories)
from the 2014 ImageNet Large Scale Visual Recognition Challenge6
,
and train it on our dataset using transfer learning28
. Figure 1 shows the
working system. The CNN is trained using 757 disease classes. Our
dataset is composed of dermatologist-labelled images organized in a
tree-structured taxonomy of 2,032 diseases, in which the individual
diseases form the leaf nodes. The images come from 18 different
clinician-curated, open-access online repositories, as well as from
clinical data from Stanford University Medical Center. Figure 2a shows
a subset of the full taxonomy, which has been organized clinically and
visually by medical experts. We split our dataset into 127,463 training
and validation images and 1,942 biopsy-labelled test images.
To take advantage of fine-grained information contained within the
taxonomy structure, we develop an algorithm (Extended Data Table 1)
to partition diseases into fine-grained training classes (for example,
amelanotic melanoma and acrolentiginous melanoma). During
inference, the CNN outputs a probability distribution over these fine
classes. To recover the probabilities for coarser-level classes of interest
(for example, melanoma) we sum the probabilities of their descendants
(see Methods and Extended Data Fig. 1 for more details).
We validate the effectiveness of the algorithm in two ways, using
nine-fold cross-validation. First, we validate the algorithm using a
three-class disease partition—the first-level nodes of the taxonomy,
which represent benign lesions, malignant lesions and non-neoplastic
1
Department of Electrical Engineering, Stanford University, Stanford, California, USA. 2
Department of Dermatology, Stanford University, Stanford, California, USA. 3
Department of Pathology,
Stanford University, Stanford, California, USA. 4
Dermatology Service, Veterans Affairs Palo Alto Health Care System, Palo Alto, California, USA. 5
Baxter Laboratory for Stem Cell Biology, Department
of Microbiology and Immunology, Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, California, USA. 6
Department of Computer Science, Stanford University,
Stanford, California, USA.
*These authors contributed equally to this work.
© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.

LETTERH
his task, the CNN achieves 72.1±0.9% (mean±s.d.) overall
he average of individual inference class accuracies) and two
gists attain 65.56% and 66.0% accuracy on a subset of the
set. Second, we validate the algorithm using a nine-class
rtition—the second-level nodes—so that the diseases of
have similar medical treatment plans. The CNN achieves
two trials, one using standard images and the other using
images, which reflect the two steps that a dermatologist m
to obtain a clinical impression. The same CNN is used for a
Figure 2b shows a few example images, demonstrating th
distinguishing between malignant and benign lesions, whic
visual features. Our comparison metrics are sensitivity an
Acral-lentiginous melanoma
Amelanotic melanoma
Lentigo melanoma
…
Blue nevus
Halo nevus
Mongolian spot
…
Training classes (757)Deep convolutional neural network (Inception v3) Inference classes (varies by task)
92% malignant melanocytic lesion
8% benign melanocytic lesion
Skin lesion image
Convolution
AvgPool
MaxPool
Concat
Dropout
Fully connected
Softmax
Deep CNN layout. Our classification technique is a
Data flow is from left to right: an image of a skin lesion
e, melanoma) is sequentially warped into a probability
over clinical classes of skin disease using Google Inception
hitecture pretrained on the ImageNet dataset (1.28 million
1,000 generic object classes) and fine-tuned on our own
29,450 skin lesions comprising 2,032 different diseases.
ning classes are defined using a novel taxonomy of skin disease
oning algorithm that maps diseases into training classes
(for example, acrolentiginous melanoma, amelanotic melano
melanoma). Inference classes are more general and are comp
or more training classes (for example, malignant melanocytic
class of melanomas). The probability of an inference class is c
summing the probabilities of the training classes according to
structure (see Methods). Inception v3 CNN architecture repr
from https://research.googleblog.com/2016/03/train-your-ow
classifier-with.html
• 129,450개의 피부과 병변 이미지 데이터를 자체 제작
• 미국의 피부과 전문의 18명이 데이터 교정
• CNN (Inception v3)으로 이미지를 학습
• 피부과 전문의들 21명과 인공지능의 판독 결과 비교
• 표피세포 암 (keratinocyte carcinoma)과 지루각화증(benign seborrheic keratosis)의 구분
• 악성 흑색종과 양성 병변 구분 (표준 이미지 데이터 기반)
• 악성 흑색종과 양성 병변 구분 (더마토스코프로 찍은 이미지 기반)
피부암 판독 인공지능의 개발

딥러닝과 피부과 전문의의

피부암 분류 정확도 LETTE
a
b
0 1
Sensitivity
0
1
Specificity
Melanoma: 130 images
1
Specificity
Melanoma: 225 images
0 1
Sensitivity
0
1
Specificity
Melanoma: 111 dermoscopy images
1
Specificity
Carcinoma: 707 images
1
Specificity
Melanoma: 1,010 dermoscopy images
0 1
Sensitivity
0
1
Specificity
Carcinoma: 135 images
Algorithm: AUC = 0.96
Dermatologists (25)
Average dermatologist
Dermatologists (22)
Dermatologists (21)
21명 중에 인공지능보다 정확성이 떨어지는 피부과 전문의들이 상당수 있었음

피부과 전문의들의 평균 성적도 인공지능보다 좋지 않았음

Skin Cancer Image Classiﬁcation (TensorFlow Dev Summit 2017)
Skin cancer classiﬁcation performance of
the CNN and dermatologists.
https://www.youtube.com/watch?v=toK1OSLep3s&t=419s

병리과
조직검사; 확진을 내리는 대법관

A B DC
Benign without atypia / Atypic / DCIS (ductal carcinoma in situ) / Invasive Carcinoma
Interpretation?
Elmore etl al. JAMA 2015
Diagnostic Concordance Among Pathologists
유방암 병리 데이터 판독하기

Figure 4. Participating Pathologists’ Interpretations of Each of the 240 Breast Biopsy Test Cases
0 25 50 75 100
Interpretations, %
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
Case
Benign without atypia
72 Cases
2070 Total interpretations
A
0 25 50 75 100
Interpretations, %
218
220
222
224
226
228
230
232
234
236
238
240
Case
Invasive carcinoma
23 Cases
D
0 25 50 75 100
Interpretations, %
147
145
149
151
153
155
157
159
161
163
165
167
169
171
173
175
177
179
181
183
185
187
189
191
193
195
197
199
201
203
205
207
209
211
213
215
217
Case
DCIS
73 Cases
C
0 25 50 75 100
Interpretations, %
74
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
120
122
124
126
128
130
132
134
136
138
140
142
144
Case
Atypia
72 Cases
B
Benign without atypia
Atypia
DCIS
Invasive carcinoma
Pathologist interpretation
DCIS indicates ductal carcinoma in situ.
Diagnostic Concordance in Interpreting Breast Biopsies Original Investigation Research
유방암 판독에 대한 병리학과 전문의들의 불일치도

•정확도: 75.3% 
(정답은 경험이 많은 세 명의 병리학과 전문의가 협의를 통해 정하였음)
spentonthisactivitywas16(95%CI,15-17);43participantswere
awarded the maximum 20 hours.
Pathologists’ Diagnoses Compared With Consensus-Derived
Reference Diagnoses
The 115 participants each interpreted 60 cases, providing 6900
total individual interpretations for comparison with the con-
sensus-derived reference diagnoses (Figure 3). Participants
agreed with the consensus-derived reference diagnosis for
75.3% of the interpretations (95% CI, 73.4%-77.0%). Partici-
pants (n = 94) who completed the CME activity reported that
Patient and Pathologist Characteristics Associated With
Overinterpretation and Underinterpretation
The association of breast density with overall pathologists’
concordance (as well as both overinterpretation and under-
interpretation rates) was statistically significant, as shown
in Table 3 when comparing mammographic density grouped
into 2 categories (low density vs high density). The overall
concordance estimates also decreased consistently with
increasing breast density across all 4 Breast Imaging-
Reporting and Data System (BI-RADS) density categories:
BI-RADS A, 81% (95% CI, 75%-86%); BI-RADS B, 77% (95%
Figure 3. Comparison of 115 Participating Pathologists’ Interpretations vs the Consensus-Derived Reference
Diagnosis for 6900 Total Case Interpretationsa
Participating Pathologists’ Interpretation
ConsensusReference
Diagnosisb
Benign
without atypia Atypia DCIS
Invasive
carcinoma Total
Benign without atypia 1803 200 46 21 2070
Atypia 719 990 353 8 2070
DCIS 133 146 1764 54 2097
Invasive carcinoma 3 0 23 637 663
Total 2658 1336 2186 720 6900
DCIS indicates ductal carcinoma
in situ.
a
Concordance noted in 5194 of
6900 case interpretations or
75.3%.
b
Reference diagnosis was obtained
from consensus of 3 experienced
breast pathologists.
Diagnostic Concordance in Interpreting Breast Biopsies Original Investigation Research
총 240개의 병리 샘플에 대해서,

115명의 병리학과 전문의들이 판독한 총 6900건의 사례를 정답과 비교
유방암 판독에 대한 병리학과 전문의들의 불일치도

ISBI Grand Challenge on
Cancer Metastases Detection in Lymph Node

International Symposium on Biomedical Imaging 2016
H&E Image Processing Framework
Train
whole slide image
sample
sample
training data
normaltumor
Test
whole slide image
overlapping image
patches tumor prob. map
1.0
0.0
0.5
Convolutional Neural
Network
P(tumor)

https://blogs.nvidia.com/blog/2016/09/19/deep-learning-breast-cancer-diagnosis/

Clinical study on ISBI dataset
Error Rate
Pathologist in competition setting 3.5%
Pathologists in clinical practice (n = 12) 13% - 26%
Pathologists on micro-metastasis(small tumors) 23% - 42%
Beck Lab Deep Learning Model 0.65%
Beck Lab’s deep learning model now outperforms pathologist
Andrew Beck, Machine Learning for Healthcare, MIT 2017

구글의 유방 병리 판독 인공지능
• The localization score(FROC) for the algorithm reached 89%, which signiﬁcantly
exceeded the score of 73% for a pathologist with no time constraint.

인공지능의 민감도 + 인간의 특이도
Yun Liu et al. Detecting Cancer Metastases on Gigapixel Pathology Images (2017)
• 구글의 인공지능은 민감도에서 큰 개선 (92.9%, 88.5%)

•@8FP: FP를 8개까지 봐주면서, 달성할 수 있는 민감도

•FROC: FP를 슬라이드당 1/4, 1/2, 1, 2, 4, 8개를 허용한 민감도의 평균

•즉, FP를 조금 봐준다면, 인공지능은 매우 높은 민감도를 달성 가능

• 인간 병리학자는 민감도 73%에 반해, 특이도는 거의 100% 달성
•인간 병리학자와 인공지능 병리학자는 서로 잘하는 것이 다름

•양쪽이 협력하면 판독 효율성, 일관성, 민감도 등에서 개선 기대 가능

http://www.rolls-royce.com/about/our-technology/enabling-technologies/engine-health-management.aspx#sense
250 sensors to monitor the “health” of the GE turbines

Fig 1. What can consumer wearables do? Heart rate can be measured with an oximeter built into a ring [3], muscle activity with an electromyographi
sensor embedded into clothing [4], stress with an electodermal sensor incorporated into a wristband [5], and physical activity or sleep patterns via an
accelerometer in a watch [6,7]. In addition, a female’s most fertile period can be identified with detailed body temperature tracking [8], while levels of me
attention can be monitored with a small number of non-gelled electroencephalogram (EEG) electrodes [9]. Levels of social interaction (also known to a
PLOS Medicine 2016

S E P S I S
A targeted real-time early warning score (TREWScore)
for septic shock
Katharine E. Henry,1
David N. Hager,2
Peter J. Pronovost,3,4,5
Suchi Saria1,3,5,6
*
Sepsis is a leading cause of death in the United States, with mortality highest among patients who develop septic
shock. Early aggressive treatment decreases morbidity and mortality. Although automated screening tools can detect
patients currently experiencing severe sepsis and septic shock, none predict those at greatest risk of developing
shock. We analyzed routinely available physiological and laboratory data from intensive care unit patients and devel-
oped “TREWScore,” a targeted real-time early warning score that predicts which patients will develop septic shock.
TREWScore identified patients before the onset of septic shock with an area under the ROC (receiver operating
characteristic) curve (AUC) of 0.83 [95% confidence interval (CI), 0.81 to 0.85]. At a specificity of 0.67, TREWScore
achieved a sensitivity of 0.85 and identified patients a median of 28.2 [interquartile range (IQR), 10.6 to 94.2] hours
before onset. Of those identified, two-thirds were identified before any sepsis-related organ dysfunction. In compar-
ison, the Modified Early Warning Score, which has been used clinically for septic shock prediction, achieved a lower
AUC of 0.73 (95% CI, 0.71 to 0.76). A routine screening protocol based on the presence of two of the systemic inflam-
matory response syndrome criteria, suspicion of infection, and either hypotension or hyperlactatemia achieved a low-
er sensitivity of 0.74 at a comparable specificity of 0.64. Continuous sampling of data from the electronic health
records and calculation of TREWScore may allow clinicians to identify patients at risk for septic shock and provide
earlier interventions that would prevent or mitigate the associated morbidity and mortality.
INTRODUCTION
Seven hundred fifty thousand patients develop severe sepsis and septic
shock in the United States each year. More than half of them are
admitted to an intensive care unit (ICU), accounting for 10% of all
ICU admissions, 20 to 30% of hospital deaths, and $15.4 billion in an-
nual health care costs (1–3). Several studies have demonstrated that
morbidity, mortality, and length of stay are decreased when severe sep-
sis and septic shock are identified and treated early (4–8). In particular,
one study showed that mortality from septic shock increased by 7.6%
with every hour that treatment was delayed after the onset of hypo-
tension (9).
More recent studies comparing protocolized care, usual care, and
early goal-directed therapy (EGDT) for patients with septic shock sug-
gest that usual care is as effective as EGDT (10–12). Some have inter-
preted this to mean that usual care has improved over time and reflects
important aspects of EGDT, such as early antibiotics and early ag-
gressive fluid resuscitation (13). It is likely that continued early identi-
fication and treatment will further improve outcomes. However, the
Acute Physiology Score (SAPS II), SequentialOrgan Failure Assessment
(SOFA) scores, Modified Early Warning Score (MEWS), and Simple
Clinical Score (SCS) have been validated to assess illness severity and
risk of death among septic patients (14–17). Although these scores
are useful for predicting general deterioration or mortality, they typical-
ly cannot distinguish with high sensitivity and specificity which patients
are at highest risk of developing a specific acute condition.
The increased use of electronic health records (EHRs), which can be
queried in real time, has generated interest in automating tools that
identify patients at risk for septic shock (18–20). A number of “early
warning systems,” “track and trigger” initiatives, “listening applica-
tions,” and “sniffers” have been implemented to improve detection
andtimelinessof therapy forpatients with severe sepsis andseptic shock
(18, 20–23). Although these tools have been successful at detecting pa-
tients currently experiencing severe sepsis or septic shock, none predict
which patients are at highest risk of developing septic shock.
The adoption of the Affordable Care Act has added to the growing
excitement around predictive models derived from electronic health
R E S E A R C H A R T I C L E
onNovember3,2016http://stm.sciencemag.org/Downloadedfrom

puted as new data became avail
when his or her score crossed t
dation set, the AUC obtained f
0.81 to 0.85) (Fig. 2). At a spec
of 0.33], TREWScore achieved a s
a median of 28.2 hours (IQR, 10
Identification of patients b
A critical event in the developme
related organ dysfunction (seve
been shown to increase after th
more than two-thirds (68.8%) o
were identified before any sepsi
tients were identified a median
(Fig. 3B).
Comparison of TREWScore
Weevaluatedtheperformanceof
methods for the purpose of provid
use of TREWScore. We first com
to MEWS, a general metric used
of catastrophic deterioration (17)
oped for tracking sepsis, MEWS
tion of patients at risk for severe
Fig. 2. ROC for detection of septic shock before onset in the validation
set. The ROC curve for TREWScore is shown in blue, with the ROC curve for
MEWS in red. The sensitivity and specificity performance of the routine
screening criteria is indicated by the purple dot. Normal 95% CIs are shown
for TREWScore and MEWS. TPR, true-positive rate; FPR, false-positive rate.
R E S E A R C H A R T I C L E
A targeted real-time early warning score (TREWScore)
for septic shock
AUC=0.83
At a speciﬁcity of 0.67,TREWScore achieved a sensitivity of 0.85  
and identiﬁed patients a median of 28.2 hours before onset.

Sugar.IQ
사용자의 음식 섭취와 그에 따른 혈당 변화,
인슐린 주입 등의 과거 기록 기반
식후 사용자의 혈당이 어떻게 변화할지
Watson 이 예측

ADA 2017, San Diego, Courtesy of Taeho Kim (Seoul Medical Center)

•미국에서 아이폰 앱으로 출시

•사용이 얼마나 번거로울지가 관건

•어느 정도의 기간을 활용해야 효과가 있는가: 2주? 평생?

•Food logging 등을 어떻게 할 것인가?

•과금 방식도 아직 공개되지 않은듯

Prediction ofVentricular Arrhythmia

An Algorithm Based on Deep Learning for Predicting In-Hospital
Cardiac Arrest
Joon-myoung Kwon, MD;* Youngnam Lee, MS;* Yeha Lee, PhD; Seungwoo Lee, BS; Jinsik Park, MD, PhD
Background-—In-hospital cardiac arrest is a major burden to public health, which affects patient safety. Although traditional track-
and-trigger systems are used to predict cardiac arrest early, they have limitations, with low sensitivity and high false-alarm rates.
We propose a deep learning–based early warning system that shows higher performance than the existing track-and-trigger
systems.
Methods and Results-—This retrospective cohort study reviewed patients who were admitted to 2 hospitals from June 2010 to July
2017. A total of 52 131 patients were included. Specifically, a recurrent neural network was trained using data from June 2010 to
January 2017. The result was tested using the data from February to July 2017. The primary outcome was cardiac arrest, and the
secondary outcome was death without attempted resuscitation. As comparative measures, we used the area under the receiver
operating characteristic curve (AUROC), the area under the precision–recall curve (AUPRC), and the net reclassification index.
Furthermore, we evaluated sensitivity while varying the number of alarms. The deep learning–based early warning system (AUROC:
0.850; AUPRC: 0.044) significantly outperformed a modified early warning score (AUROC: 0.603; AUPRC: 0.003), a random forest
algorithm (AUROC: 0.780; AUPRC: 0.014), and logistic regression (AUROC: 0.613; AUPRC: 0.007). Furthermore, the deep learning–
based early warning system reduced the number of alarms by 82.2%, 13.5%, and 42.1% compared with the modified early warning
system, random forest, and logistic regression, respectively, at the same sensitivity.
Conclusions-—An algorithm based on deep learning had high sensitivity and a low false-alarm rate for detection of patients with
cardiac arrest in the multicenter study. (J Am Heart Assoc. 2018;7:e008678. DOI: 10.1161/JAHA.118.008678.)
Key Words: artificial intelligence • cardiac arrest • deep learning • machine learning • rapid response system • resuscitation
In-hospital cardiac arrest is a major burden to public health,
which affects patient safety.1–3
More than a half of cardiac
arrests result from respiratory failure or hypovolemic shock,
and 80% of patients with cardiac arrest show signs of
deterioration in the 8 hours before cardiac arrest.4–9
However,
209 000 in-hospital cardiac arrests occur in the United States
each year, and the survival discharge rate for patients with
cardiac arrest is <20% worldwide.10,11
Rapid response systems
(RRSs) have been introduced in many hospitals to detect
cardiac arrest using the track-and-trigger system (TTS).12,13
Two types of TTS are used in RRSs. For the single-parameter
TTS (SPTTS), cardiac arrest is predicted if any single vital sign
(eg, heart rate [HR], blood pressure) is out of the normal
range.14
The aggregated weighted TTS calculates a weighted
score for each vital sign and then finds patients with cardiac
arrest based on the sum of these scores.15
The modified early
warning score (MEWS) is one of the most widely used
approaches among all aggregated weighted TTSs (Table 1)16
;
however, traditional TTSs including MEWS have limitations, with
low sensitivity or high false-alarm rates.14,15,17
Sensitivity and
false-alarm rate interact: Increased sensitivity creates higher
false-alarm rates and vice versa.
Current RRSs suffer from low sensitivity or a high false-
alarm rate. An RRS was used for only 30% of patients before
unplanned intensive care unit admission and was not used for
22.8% of patients, even if they met the criteria.18,19
From the Departments of Emergency Medicine (J.-m.K.) and Cardiology (J.P.), Mediplex Sejong Hospital, Incheon, Korea; VUNO, Seoul, Korea (Youngnam L., Yeha L.,
S.L.).
*Dr Kwon and Mr Youngnam Lee contributed equally to this study.
Correspondence to: Joon-myoung Kwon, MD, Department of Emergency medicine, Mediplex Sejong Hospital, 20, Gyeyangmunhwa-ro, Gyeyang-gu, Incheon 21080,
Korea. E-mail: kwonjm@sejongh.co.kr
Received January 18, 2018; accepted May 31, 2018.
ª 2018 The Authors. Published on behalf of the American Heart Association, Inc., by Wiley. This is an open access article under the terms of the Creative Commons
Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for
commercial purposes.
DOI: 10.1161/JAHA.118.008678 Journal of the American Heart Association 1
ORIGINAL RESEARCH
byguestonJune28,2018http://jaha.ahajournals.org/Downloadedfrom

• 환자 수: 86,290
• cardiac arrest: 633
• Input: Heart rate, Respiratory rate, Body temperature, Systolic Blood Pressure
(source: VUNO)
Cardiac Arrest Prediction Accuracy

• 대학병원 신속 대응팀에서 처리 가능한 알림 수 (A, B 지점) 에서 더 큰 정확도 차이를 보임
• A: DEWS 33.0%, MEWS 0.3%
• B: DEWS 42.7%, MEWS 4.0%
(source: VUNO)
APPH(Alarms Per Patients Per Hour)
(source: VUNO)
Less False Alarm

(source: VUNO)
시간에 따른 DEWS 예측 변화

•인공지능은 의사를 대체하는가

•인간 의사의 새로운 역할은

•결과에 대한 책임은 누가 지는가

•탈숙련화 문제

•의학적 효용을 어떻게 증명할 것인가

•어떻게 인허가/규제할 것인가

•블랙박스 문제
Issues

인공지능은 의사를 대체하는가?

•인공지능이 의사를 대체할 수 있을까?

•인공지능이 의사를 모두 대체할 수 있을까?

•인공지능이 의사를 대체할 수 있을까? 있다.

•인공지능이 의사를 모두 대체할 수 있을까? 없다.

• J&J이 수면 유도 마취로봇인 ‘세더시스(Sedasys)' 를 2014년 출시
• 결장경, 내시경 검사 때 프로포폴을 주사해 환자 수면을 유도하는 마취용 의료로봇
• 혈중 산소 함량, 심장박동 수 등 환자 신체 징후에 따라 투약량을 조절
• 2013년 FDA가 승인하면서 미국, 호주, 캐나다 등 병원에 2014년부터 보급
• 수면내시경 의료비를 1/10 로 낮춤 (2000달러 vs 150-200달러)
• 마취전문의협회 등은 대대적인 반대 캠페인을 벌이고 정치권에 규제 로비를 전개
• 월스트리트 저널:“J&J가 수입원이 줄어들 위기에 처한 마취전문의들과 싸움에서 패한 것"

기계적인 일을 모두 기계가 대신한다면,
과연 인간의 역할은 무엇일까?
현재 의사의 역할에는 어떤 것들이 있을까?

기계적인 일을 모두 기계가 대신한다면,
과연 인간의 역할은 무엇일까?
현재 의사의 역할에는 어떤 것들이 있을까?
• 사라질 역할
• 유지될 역할
• 새로운 역할

•근거 및 논리에 의한 판단

•순서도로 도식화할 수 있는 것

•시각적 인지능력에 기반한 역할

•아래의 질문에 대한 답이 YES인 것

•‘왜 그런 결정을 내렸는지 논리적으로 설명할 수 있는가?’

•‘다른 의사들에게 가도 비슷한 결정을 내릴 것인가?’

•‘내가 한 달 뒤에 보더라도 같은 결정을 내릴까?’
사라질 역할

NCCN Guidelines Version 4.2014
Non-Small Cell Lung Cancer
NCCN Guidelines Index
NSCLC Table of Contents
Discussion
Version 4.2014, 06/05/14 © National Comprehensive Cancer Network, Inc. 2014, All rights reserved. The NCCN Guidelines®
and this illustration may not be reproduced in any form without the express written permission of NCCN®
.
Note: All recommendations are category 2A unless otherwise indicated.
Clinical Trials: NCCN believes that the best management of any cancer patient is in a clinical trial. Participation in clinical trials is especially encouraged.
NSCL-2
dT3, N0 related to size or satellite nodules.
fTesting is not listed in order of priority and is dependent upon clinical
circumstances, institutional processes, and judicious use of resources.
gMethods for evaluation include mediastinoscopy, mediastinotomy, EBUS, EUS,
and CT-guided biopsy.
hPositive PET/CT scan findings for distant disease need pathologic or other
radiologic confirmation. If PET/CT scan is positive in the mediastinum, lymph
node status needs pathologic confirmation.
iSee Principles of Surgical Therapy (NSCL-B).
jSee Principles of Radiation Therapy (NSCL-C).
kSee Chemotherapy Regimens for Neoadjuvant and Adjuvant Therapy (NSCL-D).
lExamples of high-risk factors may include poorly differentiated tumors (including
lung neuroendocrine tumors [excluding well-differentiated neuroendocrine tumors]),
vascular invasion, wedge resection, tumors >4 cm, visceral pleural involvement,
and incomplete lymph node sampling (Nx). These factors independently may not
be an indication and may be considered when determining treatment with adjuvant
chemotherapy.
mSee Chemotherapy Regimens Used with Radiation Therapy (NSCL-E).
CLINICAL ASSESSMENT PRETREATMENT EVALUATIONf INITIAL TREATMENT
Stage IA
(peripheral T1ab, N0)
Stage IB
(peripheral T2a, N0)
Stage I
(central T1ab–T2a, N0)
Stage II
(T1ab–2ab, N1; T2b, N0)
Stage IIB
(T3, N0)d
• PFTs (if not previously
done)
• Bronchoscopy
(intraoperative
preferred)
• Pathologic mediastinal
lymph node evaluationg
(category 2B)
• PET/CT scanh (if not
previously done)
• PFTs (if not previously
done)
• Bronchoscopy
• Pathologic mediastinal
lymph node evaluationg
• PET/CT scanh (if not
previously done)
• Brain MRI (Stage II,
Stage IB [category 2B])
Negative
mediastinal
nodes
Positive
mediastinal
nodes
Operable
Medically
inoperable
Negative
mediastinal
nodes
Positive
mediastinal
nodes
Operable
Medically
inoperable
Surgical exploration and
resectioni + mediastinal lymph
node dissection or systematic
lymph node sampling
Definitive RT including stereotactic
ablative radiotherapyj (SABR)
See Stage IIIA (NSCL-8) or Stage IIIB (NSCL-11)
Surgical exploration and
resectioni + mediastinal lymph
node dissection or systematic
lymph node sampling
N0
N1
See Stage IIIA (NSCL-8) or Stage IIIB (NSCL-11)
Definitive RT
including SABRj
Definitive chemoradiationj,m
See Adjuvant
Treatment (NSCL-3)
See Adjuvant
Treatment (NSCL-3)
Consider adjuvant
chemotherapyk
(category 2B) for
high-risk stages IB-IIl
Printed by yoon sup choi on 6/19/2014 8:23:15 PM. For personal use only. Not approved for distribution. Copyright © 2014 National Comprehensive Cancer Network, Inc., All Rights Reserved.

NCCN Guidelines Version 4.2014
Non-Small Cell Lung Cancer
NCCN Guidelines Index
NSCLC Table of Contents
Discussion
Version 4.2014, 06/05/14 © National Comprehensive Cancer Network, Inc. 2014, All rights reserved. The NCCN Guidelines®
and this illustration may not be reproduced in any form without the express written permission of NCCN®
.
Note: All recommendations are category 2A unless otherwise indicated.
Clinical Trials: NCCN believes that the best management of any cancer patient is in a clinical trial. Participation in clinical trials is especially encouraged.
NSCL-8
hPositive PET/CT scan findings for distant disease need pathologic or other
radiologic confirmation. If PET/CT scan is positive in the mediastinum, lymph
node status needs pathologic confirmation.
iSee Principles of Surgical Therapy (NSCL-B).
jSee Principles of Radiation Therapy (NSCL-C).
kSee Chemotherapy Regimens for Neoadjuvant and Adjuvant Therapy (NSCL-D).
mSee Chemotherapy Regimens Used with Radiation Therapy (NSCL-E).
nR0 = no residual tumor, R1 = microscopic residual tumor, R2 = macroscopic
residual tumor.
sPatients likely to receive adjuvant chemotherapy may be treated with induction
chemotherapy as an alternative.
MEDIASTINAL BIOPSY
FINDINGS
INITIAL TREATMENT ADJUVANT TREATMENT
T1-3, N0-1
(including T3
with multiple
nodules in
same lobe)
Surgeryi,s
Resectable
Medically
inoperable
Surgical resectioni
+ mediastinal lymph
node dissection or
systematic lymph
node sampling
See Treatment
according to clinical
stage (NSCL-2)
N0–1
N2
See NSCL-3
Margins
negative (R0)n
Sequential chemotherapyk
(category 1) + RTj
Margins
positiven
Surveillance
(NSCL-14)
R1n
R2n
Chemoradiationj
(sequentialk or concurrentm)
Surveillance
(NSCL-14)
Concurrent
chemoradiationj,m
Surveillance
(NSCL-14)
T1-2,
T3 (≥7 cm),
N2 nodes
positivei
• Brain MRI
• PET/CT
scan,h
if not
previously
done
Negative for
M1 disease
Positive
Definitive concurrent
chemoradiationj,m
(category 1)
or
Induction
chemotherapyk ± RTj
See Treatment for Metastasis
solitary site (NSCL-13) or
distant disease (NSCL-15)
No apparent
progression
Progression
Surgeryi ± chemotherapyk (category 2B)
± RTj (if not given)
RTj (if not given)
± chemotherapykLocal
Systemic
T3
(invasion),
N2 nodes
positive
• Brain MRI
• PET/CT
scan,h
if not
previously
done
Negative for
M1 disease
Positive
Definitive concurrent
chemoradiationj,m
Printed by yoon sup choi on 6/19/2014 8:23:15 PM. For personal use only. Not approved for distribution. Copyright © 2014 National Comprehensive Cancer Network, Inc., All Rights Reserved.

•마지막 의료적 의사 결정 
•인간만이 할 수 있는 인간적인 일

•Human touch

•커뮤니케이션, 공감 능력 
•환자를 진료/치료하는 이외의 역할

•기초 연구

•새로운 데이터와 기준을 만들어내는 일
유지될 역할

Over the course of a career, an oncologist may impart bad news an average of 20,000 times,
but most practicing oncologists have never received any formal training to help them
prepare for such conversations.

High levels of empathy in primary care physicians correlate with  
better clinical outcomes for their patients with diabetes

the three levels of physicians’ empathy
was highly significant (␹2
(4) ϭ 22.04, P Ͻ
.001). The likelihood of good control
(A1c Ͻ 7.0%) was significantly greater in
the patients of physicians with high
empathy scores than in the patients of
physicians with low scores (56% and
40%, respectively; z ϭ 4.0, P Ͻ .01).
Conversely, the likelihood of poor
control (A1c Ͼ 9) was significantly lower
in the patients of physicians with high
empathy scores than it was in the patients
of physicians in the low-scoring group
Statistical control for gender, age, and
type of insurance
Logistic regression was used to examine
the unique contribution of levels of
physicians’ empathy in predicting
optimal clinical outcomes after
controlling for physicians’ and patients’
gender and age, and patients’ health
insurance. In the first logistic model, the
outcomes of the hemoglobin A1c test
were dichotomized according to
whether they had achieved good
scoring category of physic
the odds of good control o
by 50%. Physicians’ gende
was associated with good
patients’ A1c outcome), p
(younger age was associat
control of patients’ A1c),
type of insurance (Medica
associated with good cont
contributed significantly t
Patients’ gender and age d
contribute. The Hosmer–
goodness-of-fit test showe
model was mathematically
(␹2
(8) ϭ 7.03, P ϭ .53). Th
indicated that the physicia
empathy was a unique and
contributor to the predict
control of hemoglobin A1
patients, beyond the contr
gender and age of the phy
patients, and type of patie
insurance.
In another logistic regress
classified the results of the
into two categories in whi
test result of less than 100
good control. The same p
in the previous model wer
the independent variables
results of this analysis are
Table 3.
The odds ratios for physic
Table 2
Frequency and Percent Distributions of the Hemoglobin A1c and LDL-C Test
Results for 891 Diabetic Patients, Treated Between July 2006 and June 2009, by
Levels of Their Physicians’ Empathy*
No. (%) of patients by levels of physicians’ empathy
Patient outcome
High
(n ‫؍‬ 205)
Moderate
(n ‫؍‬ 282)
Low
(n ‫؍‬ 404)
Hemoglobin A1c†
.........................................................................................................................................................................................................
Ͻ7.0% 115 (56) 139 (49) 163 (40)
.........................................................................................................................................................................................................
Ն7.0% and Յ9.0% 59 (29) 99 (35) 135 (34)
.........................................................................................................................................................................................................
Ͼ9.0% 31 (15) 44 (16) 106 (26)
LDL-C‡
.........................................................................................................................................................................................................
Ͻ100 121 (59) 149 (53) 180 (44)
.........................................................................................................................................................................................................
Ն100 and Յ130 56 (27) 86 (30) 128 (32)
.........................................................................................................................................................................................................
Ͼ130 28 (14) 47 (17) 96 (24)
* From a study of physicians’ empathy and patients’ outcomes, Jefferson Medical College.
†
␹2
(4) ϭ 22.04, P Ͻ .001.
‡
␹2
(4) ϭ 15.55, P Ͻ .001.
•891명의 당뇨병 환자를 대상으로 한 2011년 연구

•공감 능력이 높은 의사에게 진료받은 환자들이,

•혈당 관리(당화혈색소)도 잘 되었으며,

•나쁜 콜레스테롤(LDL-C) 수치도 더 낮았다.

•의사의 공감능력 외의 다른 변수 (의료진 성별, 환자 성별, 보험 여부 등등)는 차이 없음

의료 인공지능: 인공지능은 의료를 어떻게 혁신하는가

의료 인공지능: 인공지능은 의료를 어떻게 혁신하는가

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a 의료 인공지능: 인공지능은 의료를 어떻게 혁신하는가

Semelhante a 의료 인공지능: 인공지능은 의료를 어떻게 혁신하는가 (20)

Mais de Yoon Sup Choi

Mais de Yoon Sup Choi (14)

Último

Último (20)

의료 인공지능: 인공지능은 의료를 어떻게 혁신하는가