SlideShare uma empresa Scribd logo
1 de 15
May 1, 2012 • 2012 HMORN Conference • Seattle, Washington

Use of SAS-Based Natural Language Processing
to Identify Incident and Recurrent Malignancies
                   Justin A. Strauss, MA
                  Research Associate III
          Kaiser Permanente Southern California
Co-Authors & Funding
• Chun R. Chao, PhD
• Marilyn L. Kwan, PhD
• Syed A. Ahmed, MD
• Joanne E. Schottinger, MD
• Virginia P. Quinn, PhD
Acknowledgements & Funding
• Mayra Martinez, Michelle McGuire, Melissa
  Preciado, Nirupa Ghai, and Jeff Slezak
  (KPSC); Lawrence Kushi (KPNC); Debra
  Ritzwoller (KPCO); Joan Warren (NCI);
  Jianyu Rao and Jiaoti Huang (UCLA)
• Funding was provided by KPSC
  Community Benefit and the Cancer
  Research Network
Malignancy Identification
• Malignancy identification is important for clinical
  and epidemiologic cancer research.
• Limited quality and availability of incident and
  recurrent malignancy data within health plans.
   • Delayed availability of incident malignancy data from
     cancer registries.
   • Few registries track cancer recurrences.
   • Manual chart abstraction slow and expensive.
   • Previous research has shown electronic diagnosis codes
     (e.g., ICD-9) to be unreliable.
Natural Language Processing
• Natural language processing (NLP) can be used to identify
  and extract information from electronic clinical text,
  including incident and recurrent malignancy data.
• Increasing opportunity for NLP with adoption of electronic
  clinical systems in patient care delivery.
• Despite its potential value in clinical and research settings,
  NLP usage has been relatively sparse. Contributing factors
  may include:
       • Technical complexity
       • Systems integration requirements
       • Habitual use of existing methods
SCENT Overview
• A SAS-based coding, extraction, and nomenclature tool
  (SCENT) was developed to identify incident and recurrent
  malignancies using text from pathology reports.

• SCENT is currently being implemented in two research
  studies at Kaiser Permanente Southern California (KPSC):
   • Intervention to improve medication adherence among breast
     cancer patients.

   • Differences in the prognosis of prostate cancer patients
     according to their genetic factors

• Use of SAS programming minimizes implementation
  barriers and increases availability for multisite research.
Description of Methods
• SCENT identifies non-negated clinical concepts within
  pathology report text.

• Built using SAS Base (does not require Text Miner add-on).
   • Makes extensive use of SAS hash objects and regular expressions.

• Includes components for preprocessing, matching, negation
  and uncertainty detection, extracting diagnostic information
  (e.g., staging and Gleason score), and classifying report
  malignancy status.

• Flexibility to assign codes using variety of coding systems.
   • Validation used subset of SNOMED 3.x (~1000 concepts).
SCENT Process Diagram
     Clinical Concepts (Excel)                   Pathology Text (Research Database)                           [moderately-differentiated ductal adenocarcinoma
                                                                                                              with papillary]
Type : Morphology, topology, or procedural      Text : Raw text segment from report                           [features.]
Code : SNOMED 3.X                               Line : Sequential text segment identifier                     [the tumor involves 0.6 cm of one core.]
Class : Malignant , basaloid , benign, or N/A
Description : Concept description
                                                                                                              [moderately-differentiated ductal adenocarcinoma
                                                    Regular                      Preprocessed                 with papillary features.]
Code : M-85033                                    Expressions                        Text                     [the tumor involves 0.6 cm of one core.]
Description : intraductal papillary
adenocarcinoma with invasion

                                                                                                              moderately differentiated <nlp snm=m85033
                       [intraductal]                                                                          type=m class=3>ductal adenocarcinomawith
                                                                                                              papillary</nlp snm=m85033> features
 Tokenize              [papillary]                  Examine
                       [adenocarcinoma]                                           Extract Data
  Words                [with]                      Segments                                                                                              Disease
                       [invasion]                                                                                                                        Extent
                                                                                                                               Code
                                                                               Tumor           Gleason
                       [intraductal]                                           Staging          Score
                                                                                                                              Matches               Diagnostic
                       [papillary]                                                                                                                  Certainty
    Clean              [adenocarcinoma]
                       [with]                       Tokenize              [moderately] [differentiated]
                       [invasion]                                         [ductal] [adenocarcinoma]
                                                     Words                with [papillary] [features]

                                                                                                     [adenocarcinoma[ls]?]
                       [((intra)?duct(al)?)]                                         [papillar (y|ies)]
                                                                                                                                          Check
  Enhance              [papillar (y|ies)]
                                                                   [((intra)?duct(al)?)]                                                 Negation
                       [adenocarcinoma[ls]?]

                                                     Loop                                    Match                              free (of|from)
 Concept Dictionary (SAS)                          Concepts                                  Tokens                             not? (support[a-z]*|identified)
                                                                                                                                non(?!small|hodgkins)
Sample Report Coding
Preprocessed Text
 LEFT BREAST CORE BIOPSY TWO O CLOCK.
 <BR>
 INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2.
 <BR>
 NO CALCIFICATION IS IDENTIFIED.
 <BR>
 NO VASCULAR INVASION IS IDENTIFIED.
 <BR>
 HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.

Coded Text
 <NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP
 SNM=P1140> TWO O CLOCK.
 <BR>
 INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM
 GRADE 2.
 <BR>
 NO CALCIFICATION IS IDENTIFIED.
 <BR>
 NO VASCULAR INVASION IS IDENTIFIED.
 <BR>
 HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
Validation Study
• To validate SCENT, trained chart abstractors reviewed
  electronic pathology reports.
• Random samples of breast (n=400) and prostate
  (n=400) cancer patients.
   • Patients diagnosed at KPSC between 2000-2007.
   • Reports included from six months post-diagnosis
     through end of 2008.
• In total, 206 breast and 186 prostate cancer patients
  contributed 490 and 425 eligible reports, respectively.
• SCENT classifications were compared with those of
  abstractors.
Classification Concordance
                                                                  Abstractor Classifications
                                                               Cancer                     Other
                                         Benign                                                              Suspicious
                                                             Recurrence               Primary Cancer

SCENT Classifications               %             N         %           N              %         N      %             N       Kappa

Breast Cancer (Total)                         (436)                    (32)                     (18)                  (4)
  Benign                          99.8            435        -           -             -         -     25.0           1       0.96
  Cancer Recurrence                 -              -      100.0         32             -         -       -                -
  Other Primary Cancer             0.2             1         -           -        100.0          18    50.0           2
  Suspicious                        -              -         -           -             -         -     25.0           1
Prostate Cancer (Total)                       (356)                    (29)                     (36)                  (4)
  Benign                          99.4            354        -           -            5.6        2       -                -   0.95
  Cancer Recurrence                 -              -       96.6         28            2.8        1       -                -
  Other Primary Cancer             0.6             2       3.4           1            91.7       33      -                -
  Suspicious                        -              -         -           -             -         -     100.0          4

Note: incident contralateral breast malignancies were considered to be recurrences.
SCENT Performance Metrics
                                  Sensitivity*      Specificity*          PPV*               NPV*



Breast Cancer                  1.00 (0.93-1.00)   0.99 (0.98-1.00)   0.94 (0.85-0.98)   1.00 (0.99-1.00)




Prostate Cancer                0.97 (0.89-0.99)   0.99 (0.98-1.00)   0.97 (0.89-0.99)   0.99 (0.98-1.00)



* Shown with Wilson's 95% confidence interval.
Conclusions
• Favorable results suggest SCENT can identify and extract
  information about primary and recurrent malignancies from
  pathology reports.
   • Rapid cancer case identification.
   • Improved measurement accuracy of common study endpoint.

• SCENT has the potential to expedite chart reviews by
  narrowing the search and highlighting relevant concepts.
• Generalized utility for extracting standardized disease
  scores and other clinical information.
• SCENT is proof of concept for SAS-based NLP that can be
  easily shared between institutions to support research.
Limitations & Next Steps
• SCENT has a number of limitations, including:
   • Unable to disambiguate and contextualize identified clinical concepts
     without part-of-speech (POS) tagging.
   • More susceptible to changes in text structure and increased linguistic
     variability than statistical NLP approaches.
       • General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.

• Next steps include:
   • Release SCENT source code and requisite support files.
   • Optimize current functionality and assess feasibility of adding methods
     (e.g., POS tagging, n-grams, statistical classifiers).
   • Attempt to identify non-pathologically diagnosed malignancies using
     radiology reports and clinical progress notes.
   • Quantify cost savings associated with SCENT-assisted chart reviews.
Questions?

Mais conteúdo relacionado

Semelhante a Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Refined blood-borne miRNome of human diseases via PCA-based feature extraction
Refined blood-borne miRNome of human diseases via PCA-based feature extractionRefined blood-borne miRNome of human diseases via PCA-based feature extraction
Refined blood-borne miRNome of human diseases via PCA-based feature extractionY-h Taguchi
 
TriStar Presentation 2011
TriStar Presentation 2011TriStar Presentation 2011
TriStar Presentation 2011thnkstudios
 
2013 machine learning_choih
2013 machine learning_choih2013 machine learning_choih
2013 machine learning_choihHongyoon Choi
 
Thorax cardio nsclc yw hang
Thorax cardio nsclc yw hangThorax cardio nsclc yw hang
Thorax cardio nsclc yw hangJFIM
 
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...NEHA Kapoor
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisEfi Athieniti
 
Iaetsd classification of lung tumour using
Iaetsd classification of lung tumour usingIaetsd classification of lung tumour using
Iaetsd classification of lung tumour usingIaetsd Iaetsd
 
Interpretable Spiculation Quantification for Lung Cancer Screening
Interpretable Spiculation Quantification for Lung Cancer ScreeningInterpretable Spiculation Quantification for Lung Cancer Screening
Interpretable Spiculation Quantification for Lung Cancer ScreeningWookjin Choi
 
Automatic System for Detection and Classification of Brain Tumors
Automatic System for Detection and Classification of Brain TumorsAutomatic System for Detection and Classification of Brain Tumors
Automatic System for Detection and Classification of Brain TumorsFatma Sayed Ibrahim
 
The Fédération Nationale des Centres de Lutte Contre le Cancer Grading
The Fédération Nationale des Centres de Lutte Contre le Cancer GradingThe Fédération Nationale des Centres de Lutte Contre le Cancer Grading
The Fédération Nationale des Centres de Lutte Contre le Cancer GradingDevitaWidjaja1
 
Cardiovascular Imaging
Cardiovascular ImagingCardiovascular Imaging
Cardiovascular ImagingMuhammad Ayub
 
bristol myerd squibb American Society of Clinical Oncology (ASCO) Highlights
bristol myerd squibb American Society of Clinical Oncology (ASCO) Highlightsbristol myerd squibb American Society of Clinical Oncology (ASCO) Highlights
bristol myerd squibb American Society of Clinical Oncology (ASCO) Highlightsfinance13
 

Semelhante a Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS (15)

Refined blood-borne miRNome of human diseases via PCA-based feature extraction
Refined blood-borne miRNome of human diseases via PCA-based feature extractionRefined blood-borne miRNome of human diseases via PCA-based feature extraction
Refined blood-borne miRNome of human diseases via PCA-based feature extraction
 
TriStar Presentation 2011
TriStar Presentation 2011TriStar Presentation 2011
TriStar Presentation 2011
 
Atlas drenaje ganglionar de Martínez - Monge
Atlas drenaje ganglionar de Martínez - MongeAtlas drenaje ganglionar de Martínez - Monge
Atlas drenaje ganglionar de Martínez - Monge
 
2013 machine learning_choih
2013 machine learning_choih2013 machine learning_choih
2013 machine learning_choih
 
Thorax cardio nsclc yw hang
Thorax cardio nsclc yw hangThorax cardio nsclc yw hang
Thorax cardio nsclc yw hang
 
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
Applying Deep Learning Techniques in Automated Analysis of CT scan images for...
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
 
Iaetsd classification of lung tumour using
Iaetsd classification of lung tumour usingIaetsd classification of lung tumour using
Iaetsd classification of lung tumour using
 
Interpretable Spiculation Quantification for Lung Cancer Screening
Interpretable Spiculation Quantification for Lung Cancer ScreeningInterpretable Spiculation Quantification for Lung Cancer Screening
Interpretable Spiculation Quantification for Lung Cancer Screening
 
Automatic System for Detection and Classification of Brain Tumors
Automatic System for Detection and Classification of Brain TumorsAutomatic System for Detection and Classification of Brain Tumors
Automatic System for Detection and Classification of Brain Tumors
 
The Fédération Nationale des Centres de Lutte Contre le Cancer Grading
The Fédération Nationale des Centres de Lutte Contre le Cancer GradingThe Fédération Nationale des Centres de Lutte Contre le Cancer Grading
The Fédération Nationale des Centres de Lutte Contre le Cancer Grading
 
Cardiovascular Imaging
Cardiovascular ImagingCardiovascular Imaging
Cardiovascular Imaging
 
bristol myerd squibb American Society of Clinical Oncology (ASCO) Highlights
bristol myerd squibb American Society of Clinical Oncology (ASCO) Highlightsbristol myerd squibb American Society of Clinical Oncology (ASCO) Highlights
bristol myerd squibb American Society of Clinical Oncology (ASCO) Highlights
 
Barth imt
Barth imtBarth imt
Barth imt
 
Barth imt
Barth imtBarth imt
Barth imt
 

Mais de HMO Research Network

New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...
New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...
New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...HMO Research Network
 
Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...
Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...
Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...HMO Research Network
 
Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...
Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...
Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...HMO Research Network
 
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...HMO Research Network
 
A Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGER
A Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGERA Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGER
A Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGERHMO Research Network
 
The Use of Administrative Data and Natural Language Processing to Estimate th...
The Use of Administrative Data and Natural Language Processing to Estimate th...The Use of Administrative Data and Natural Language Processing to Estimate th...
The Use of Administrative Data and Natural Language Processing to Estimate th...HMO Research Network
 
Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...
Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...
Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...HMO Research Network
 
Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...
Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...
Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...HMO Research Network
 
CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...
CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...
CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...HMO Research Network
 
An Application of Doubly Robust Estimation JOHNSON
An Application of Doubly Robust Estimation JOHNSONAn Application of Doubly Robust Estimation JOHNSON
An Application of Doubly Robust Estimation JOHNSONHMO Research Network
 
Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...
Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...
Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...HMO Research Network
 
Expanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOK
Expanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOKExpanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOK
Expanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOKHMO Research Network
 
Drug Characteristics Associated with Medication Adherence Across Eight Diseas...
Drug Characteristics Associated with Medication Adherence Across Eight Diseas...Drug Characteristics Associated with Medication Adherence Across Eight Diseas...
Drug Characteristics Associated with Medication Adherence Across Eight Diseas...HMO Research Network
 
Feasibility of Implementing Screening Brief Intervention and Referral to Trea...
Feasibility of Implementing Screening Brief Intervention and Referral to Trea...Feasibility of Implementing Screening Brief Intervention and Referral to Trea...
Feasibility of Implementing Screening Brief Intervention and Referral to Trea...HMO Research Network
 
eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...
eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...
eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...HMO Research Network
 
A Telephone Based Diabetes Prevention Program and Social Support for Weight L...
A Telephone Based Diabetes Prevention Program and Social Support for Weight L...A Telephone Based Diabetes Prevention Program and Social Support for Weight L...
A Telephone Based Diabetes Prevention Program and Social Support for Weight L...HMO Research Network
 
Technological Resources & Personnel Costs Required to Implement an Automated ...
Technological Resources & Personnel Costs Required to Implement an Automated ...Technological Resources & Personnel Costs Required to Implement an Automated ...
Technological Resources & Personnel Costs Required to Implement an Automated ...HMO Research Network
 
Online Patient Access to their Medical Record and Health Providers is Associa...
Online Patient Access to their Medical Record and Health Providers is Associa...Online Patient Access to their Medical Record and Health Providers is Associa...
Online Patient Access to their Medical Record and Health Providers is Associa...HMO Research Network
 
Documentations of Advanced Heath Care Directives Where Are They TAI_SEALE
Documentations of Advanced Heath Care Directives Where Are They TAI_SEALEDocumentations of Advanced Heath Care Directives Where Are They TAI_SEALE
Documentations of Advanced Heath Care Directives Where Are They TAI_SEALEHMO Research Network
 

Mais de HMO Research Network (20)

New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...
New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...
New Rules Dealing with Conflicts of Interest in Public Health Service Funded ...
 
From Populations to Patients
From Populations to PatientsFrom Populations to Patients
From Populations to Patients
 
Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...
Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...
Evaluation of the Validity of the Gestational Length Assumptions Based Upon A...
 
Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...
Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...
Comparative Safety of Infliximaband Etanercept on the Risk of Serious Infecti...
 
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
 
A Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGER
A Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGERA Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGER
A Descriptive Study of Vaccinations Occuring During Pregnancy HENNINGER
 
The Use of Administrative Data and Natural Language Processing to Estimate th...
The Use of Administrative Data and Natural Language Processing to Estimate th...The Use of Administrative Data and Natural Language Processing to Estimate th...
The Use of Administrative Data and Natural Language Processing to Estimate th...
 
Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...
Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...
Patient Views of KRAS Testing for Treatment of Metastatic Colorectal Cancer L...
 
Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...
Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...
Comparative Effectiveness of Chemotherapy Regimens for Advanced Lung Cancer C...
 
CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...
CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...
CER HUB An Informatics Platform for Conducting Compartive Effectiveness with ...
 
An Application of Doubly Robust Estimation JOHNSON
An Application of Doubly Robust Estimation JOHNSONAn Application of Doubly Robust Estimation JOHNSON
An Application of Doubly Robust Estimation JOHNSON
 
Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...
Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...
Risk Factors for Short Term Virologic Outcomes Among HIV Infected Patients Un...
 
Expanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOK
Expanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOKExpanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOK
Expanding SEER Reporting with Comorbidity Data Colorectal Cancer HORNBROOK
 
Drug Characteristics Associated with Medication Adherence Across Eight Diseas...
Drug Characteristics Associated with Medication Adherence Across Eight Diseas...Drug Characteristics Associated with Medication Adherence Across Eight Diseas...
Drug Characteristics Associated with Medication Adherence Across Eight Diseas...
 
Feasibility of Implementing Screening Brief Intervention and Referral to Trea...
Feasibility of Implementing Screening Brief Intervention and Referral to Trea...Feasibility of Implementing Screening Brief Intervention and Referral to Trea...
Feasibility of Implementing Screening Brief Intervention and Referral to Trea...
 
eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...
eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...
eCare for Heart Wellness A Trial to Test the Feasibility of Web Based Dietici...
 
A Telephone Based Diabetes Prevention Program and Social Support for Weight L...
A Telephone Based Diabetes Prevention Program and Social Support for Weight L...A Telephone Based Diabetes Prevention Program and Social Support for Weight L...
A Telephone Based Diabetes Prevention Program and Social Support for Weight L...
 
Technological Resources & Personnel Costs Required to Implement an Automated ...
Technological Resources & Personnel Costs Required to Implement an Automated ...Technological Resources & Personnel Costs Required to Implement an Automated ...
Technological Resources & Personnel Costs Required to Implement an Automated ...
 
Online Patient Access to their Medical Record and Health Providers is Associa...
Online Patient Access to their Medical Record and Health Providers is Associa...Online Patient Access to their Medical Record and Health Providers is Associa...
Online Patient Access to their Medical Record and Health Providers is Associa...
 
Documentations of Advanced Heath Care Directives Where Are They TAI_SEALE
Documentations of Advanced Heath Care Directives Where Are They TAI_SEALEDocumentations of Advanced Heath Care Directives Where Are They TAI_SEALE
Documentations of Advanced Heath Care Directives Where Are They TAI_SEALE
 

Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

  • 1. May 1, 2012 • 2012 HMORN Conference • Seattle, Washington Use of SAS-Based Natural Language Processing to Identify Incident and Recurrent Malignancies Justin A. Strauss, MA Research Associate III Kaiser Permanente Southern California
  • 2. Co-Authors & Funding • Chun R. Chao, PhD • Marilyn L. Kwan, PhD • Syed A. Ahmed, MD • Joanne E. Schottinger, MD • Virginia P. Quinn, PhD
  • 3. Acknowledgements & Funding • Mayra Martinez, Michelle McGuire, Melissa Preciado, Nirupa Ghai, and Jeff Slezak (KPSC); Lawrence Kushi (KPNC); Debra Ritzwoller (KPCO); Joan Warren (NCI); Jianyu Rao and Jiaoti Huang (UCLA) • Funding was provided by KPSC Community Benefit and the Cancer Research Network
  • 4. Malignancy Identification • Malignancy identification is important for clinical and epidemiologic cancer research. • Limited quality and availability of incident and recurrent malignancy data within health plans. • Delayed availability of incident malignancy data from cancer registries. • Few registries track cancer recurrences. • Manual chart abstraction slow and expensive. • Previous research has shown electronic diagnosis codes (e.g., ICD-9) to be unreliable.
  • 5. Natural Language Processing • Natural language processing (NLP) can be used to identify and extract information from electronic clinical text, including incident and recurrent malignancy data. • Increasing opportunity for NLP with adoption of electronic clinical systems in patient care delivery. • Despite its potential value in clinical and research settings, NLP usage has been relatively sparse. Contributing factors may include: • Technical complexity • Systems integration requirements • Habitual use of existing methods
  • 6. SCENT Overview • A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to identify incident and recurrent malignancies using text from pathology reports. • SCENT is currently being implemented in two research studies at Kaiser Permanente Southern California (KPSC): • Intervention to improve medication adherence among breast cancer patients. • Differences in the prognosis of prostate cancer patients according to their genetic factors • Use of SAS programming minimizes implementation barriers and increases availability for multisite research.
  • 7. Description of Methods • SCENT identifies non-negated clinical concepts within pathology report text. • Built using SAS Base (does not require Text Miner add-on). • Makes extensive use of SAS hash objects and regular expressions. • Includes components for preprocessing, matching, negation and uncertainty detection, extracting diagnostic information (e.g., staging and Gleason score), and classifying report malignancy status. • Flexibility to assign codes using variety of coding systems. • Validation used subset of SNOMED 3.x (~1000 concepts).
  • 8. SCENT Process Diagram Clinical Concepts (Excel) Pathology Text (Research Database) [moderately-differentiated ductal adenocarcinoma with papillary] Type : Morphology, topology, or procedural Text : Raw text segment from report [features.] Code : SNOMED 3.X Line : Sequential text segment identifier [the tumor involves 0.6 cm of one core.] Class : Malignant , basaloid , benign, or N/A Description : Concept description [moderately-differentiated ductal adenocarcinoma Regular Preprocessed with papillary features.] Code : M-85033 Expressions Text [the tumor involves 0.6 cm of one core.] Description : intraductal papillary adenocarcinoma with invasion moderately differentiated <nlp snm=m85033 [intraductal] type=m class=3>ductal adenocarcinomawith papillary</nlp snm=m85033> features Tokenize [papillary] Examine [adenocarcinoma] Extract Data Words [with] Segments Disease [invasion] Extent Code Tumor Gleason [intraductal] Staging Score Matches Diagnostic [papillary] Certainty Clean [adenocarcinoma] [with] Tokenize [moderately] [differentiated] [invasion] [ductal] [adenocarcinoma] Words with [papillary] [features] [adenocarcinoma[ls]?] [((intra)?duct(al)?)] [papillar (y|ies)] Check Enhance [papillar (y|ies)] [((intra)?duct(al)?)] Negation [adenocarcinoma[ls]?] Loop Match free (of|from) Concept Dictionary (SAS) Concepts Tokens not? (support[a-z]*|identified) non(?!small|hodgkins)
  • 9. Sample Report Coding Preprocessed Text LEFT BREAST CORE BIOPSY TWO O CLOCK. <BR> INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2. <BR> NO CALCIFICATION IS IDENTIFIED. <BR> NO VASCULAR INVASION IS IDENTIFIED. <BR> HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW. Coded Text <NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP SNM=P1140> TWO O CLOCK. <BR> INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM GRADE 2. <BR> NO CALCIFICATION IS IDENTIFIED. <BR> NO VASCULAR INVASION IS IDENTIFIED. <BR> HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.
  • 10. Validation Study • To validate SCENT, trained chart abstractors reviewed electronic pathology reports. • Random samples of breast (n=400) and prostate (n=400) cancer patients. • Patients diagnosed at KPSC between 2000-2007. • Reports included from six months post-diagnosis through end of 2008. • In total, 206 breast and 186 prostate cancer patients contributed 490 and 425 eligible reports, respectively. • SCENT classifications were compared with those of abstractors.
  • 11. Classification Concordance Abstractor Classifications Cancer Other Benign Suspicious Recurrence Primary Cancer SCENT Classifications % N % N % N % N Kappa Breast Cancer (Total) (436) (32) (18) (4) Benign 99.8 435 - - - - 25.0 1 0.96 Cancer Recurrence - - 100.0 32 - - - - Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2 Suspicious - - - - - - 25.0 1 Prostate Cancer (Total) (356) (29) (36) (4) Benign 99.4 354 - - 5.6 2 - - 0.95 Cancer Recurrence - - 96.6 28 2.8 1 - - Other Primary Cancer 0.6 2 3.4 1 91.7 33 - - Suspicious - - - - - - 100.0 4 Note: incident contralateral breast malignancies were considered to be recurrences.
  • 12. SCENT Performance Metrics Sensitivity* Specificity* PPV* NPV* Breast Cancer 1.00 (0.93-1.00) 0.99 (0.98-1.00) 0.94 (0.85-0.98) 1.00 (0.99-1.00) Prostate Cancer 0.97 (0.89-0.99) 0.99 (0.98-1.00) 0.97 (0.89-0.99) 0.99 (0.98-1.00) * Shown with Wilson's 95% confidence interval.
  • 13. Conclusions • Favorable results suggest SCENT can identify and extract information about primary and recurrent malignancies from pathology reports. • Rapid cancer case identification. • Improved measurement accuracy of common study endpoint. • SCENT has the potential to expedite chart reviews by narrowing the search and highlighting relevant concepts. • Generalized utility for extracting standardized disease scores and other clinical information. • SCENT is proof of concept for SAS-based NLP that can be easily shared between institutions to support research.
  • 14. Limitations & Next Steps • SCENT has a number of limitations, including: • Unable to disambiguate and contextualize identified clinical concepts without part-of-speech (POS) tagging. • More susceptible to changes in text structure and increased linguistic variability than statistical NLP approaches. • General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology. • Next steps include: • Release SCENT source code and requisite support files. • Optimize current functionality and assess feasibility of adding methods (e.g., POS tagging, n-grams, statistical classifiers). • Attempt to identify non-pathologically diagnosed malignancies using radiology reports and clinical progress notes. • Quantify cost savings associated with SCENT-assisted chart reviews.