SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Quick Start Tutorial of KH Coder:
Quantitative Content Analysis or Text Mining
of English Language Data
Koichi Higuchi
1
2
Preface
 This presentation is a part of tutorials for using KH Coder.
 KH Coder is a free software for quantitative content
analysis or text mining. It is also utilized for
computational linguistics.
 Details and downloads:
http://khc.sourceforge.net/en/
Table of Contents
3
Configure KH Coder for English speaking people / English data
 1. Change the interface language to English
 2. Settings for analyzing English text
 Notes on the stopwords
Create a new project and prepare for analysis
 3. Create a new project
 4. Run pre-processing
Frequently appeared words and co-occurrences
 5. Word frequency list
 6. KWIC and collocation stats
 7. Co-occurrence network of words
 Methods for exploring co-occurrences of words
Characteristics of each chapter
 8. Distinctive words of each chapter
 9. Correspondence analysis of words and chapters
Coding Rules
 Use coding rules to count concepts
 10. Search documents with coding rules
 11. Cross tabulation of the codes
1. Change the Interface Language to English
4
Choose “English” here
and restart KH Coder.
If you prefer the Japanese interface, you may skip this step.
You may also change the interface font.
Go to [Project] [Settings] in the menubar.
2. Settings for Analyzing English Text
5
(1) Go to [Project] [Settings] in the menubar.
(2) Select “Lemmatization.”
(3) Click “config.”
(4) Open the “tutorial_en”
folder, then drag the file
“stopwords_sample_en.txt”
and drop here. (Or just paste
the content of the file here)
(5) Click “OK.”(6) Click “OK.”
Notes on the Stopwords
6
You can specify any words as stopwords in KH Coder.
The stopwords will be given the special POS tag “OTHER.”
Words with “OTHER” tag will be excluded from analyses by default.
3. Create a New Project
7
(1) Go to [Project] [New] in the menubar.
(2) Click “Browse” and open the file
“tutorial_en/botchan_en.txt”
(3) fill in whatever
memo you like
(4) Click “OK.”
In this tutorial we analyze a
novel “Botchan” by Soseki.
“botchan_en.txt” contains all 11
chapters of the novel.
Chapter headings are marked
with h1 tag
Next time you start KH Coder,
go to [Project] [Open] in the
menubar and open the project
you have created here.
4. Run Pre-Processing
8
Go to [Pre-Processing] [Run Pre-Processing]
in the menubar. Then click “OK.”
Sentence splitting, tokenization, POS tagging
and lemmatization are performed.
The results are compiled into MySQL database
for searching and statistical analysis.
When processing data, KH Coder
“concentrates” on the job. So sometimes it
looks frozen. But it is normal when CPU or disk
is busy.
5. Word Frequency List
9
Go to [Tools] [Words] [Frequency List] in the menubar.
These are counts of base forms / lemmas
6. KWIC and Collocation Stats 1/2
10
(1) Go to [Tools] [Words] [KWIC Concordance] in the menubar.
(2) Input a base form of a word
and hit “Enter” on the keybord
When you change sort options,
click “Search” button again.
Double click any line to view
wider contexts. You can
change viewing Units below.
(3) Click “Stats” to open
the collocation stats.
6. KWIC and Collocation Stats 2/2
11
(1) Follow the steps in the previous slide to open the collocation stats.
(2) You can filter words
by POS tags.
“L1” stands for “Left 1.” Numbers in this column
indicate how many times each words appeared
just before the Node Word (left side, distance 1).
7. Co-Occurrence Network of Words
12
(3) Click “Config” and check “Larger nodes
for higher frequency words”, then lick “OK.”
Now you can see a co-occurrence network of high frequency words in the text.
The color change from blue (low) to pink (high). It indicates the centrality index.
(1) Go to [Tools] [Words] [Co-Occurrence Network] in the menubar.
(2) Select “Paragraphs” as Unit, then click “OK”
(4) Click “Config” and increase “edges” (co-
occurences) to “top 100,” then lick “OK.”
(5) Select “Community: modularity” as “color.”
Which version did you like?
Methods for Exploring Co-Occurrences of Words
13
To explore co-occurrences of words, you can also use:
 hierarchical cluster analysis
 multidimensional scaling
co-occurrence network cluster analysis MDS
By interpreting these result, you may find major themes of the text
from groups of words which tend to appear together.
KH Coder uses R as back end to execute these multivariate methods.
8. Distinctive Words of Each Chapter
14
(2) Click “Heading 1.”
Top 10 distinctive words of each chapter
are tabulated. The “distinctiveness” is
calculated using Jaccard index.
Basically, if a word shows larger
probability of appearance in a specific
chapter, It’s considered distinctive.
(1) Go to [Tools] [Variables & Headings] [List] in the menubar.
(3) Select “Sentences.”
(4) Select “catalogue: Excel.”
9. Correspondence Analysis of Words and Chapters
15
(2) Click “OK”
Using correspondence analysis,
you can visually interpret
characteristics of each chapter.
(1) Go to [Tools] [Words] [Correspondence Analysis] in the menubar.
(3) Click “Config”, then reduce words
to “Top 30,” check “Bubble plot,”
uncheck “Size of variables...,” and
click “OK.” (This step is optional.)
Use Coding Rules to Count Concepts
16
In some cases, we have to count concepts, not words.
To count concepts, you can compose “cording rules” like this:
*shopping
store or shop or ( merchandise and not develop )
Indicates the name of this code.
The conditions for attaching this code. Cases that contain words
like store and shop are given the code “shopping.” The
parenthetical notation means that cases should contain the word
“merchandise” but should not contain the word “develop.”
If a case is acceptable under multiple coding rules, multiple codes will
be given to the case.
We use “tutorial_en/themes.txt”
as example coding rules in this
tutorial. Please open this file and
check the content.
10. Search Documents with Coding Rules
17
(1) Go to [Tools] [Documents] [Search Documents] in the menubar.
(2) Click “Browse” and select
“tutorial_en/themes.txt”
(3) Select “Paragraphs”
(4) Double click a code
(5) Double click a result to
view the whole paragraph. When you compose a coding
rule, it is important to search and
check the actual documents
which are acceptable under the
rule.
11. Cross Tabulation of Codes
18
(1) Go to [Tools] [Coding] [Crosstab] in the menubar.
(2) Click “Browse” and select
“tutorial_en/themes.txt”
(3) Select “Sentences”
(5) Click “all” to
make a graph.
In the latter half of the novel,
it looks like “aggression”
overwhelms “positive affect”
and forms the climax of the
story at chapter X.
(4) Click “Run”
Acknowledgement
I am grateful to students who attended the 2011
“text mining” class at Doshisha University (Faculty
of Culture and Information Science) for giving me
some hints on composing coding rules for
“Botchan.”
Questions or Comments?
Please feel free to post questions or comments at
web forum here:
https://sourceforge.net/p/khc/discussion/

Mais conteúdo relacionado

Mais procurados

Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
Rajarshi Guha
 
Ontology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical studyOntology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical study
Debashisnaskar
 
Constructing Searches
Constructing SearchesConstructing Searches
Constructing Searches
KatyKavanagh
 

Mais procurados (20)

H index
H indexH index
H index
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Ontology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical studyOntology and Ontology Libraries: a critical study
Ontology and Ontology Libraries: a critical study
 
Open Science: What, why, how?
Open Science: What, why, how? Open Science: What, why, how?
Open Science: What, why, how?
 
Literature search and review
Literature search and reviewLiterature search and review
Literature search and review
 
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
Computational Drug Discovery: Machine Learning for Making Sense of Big Data i...
 
h-index
h-indexh-index
h-index
 
Role of review of literature in research process
Role of review of literature in research processRole of review of literature in research process
Role of review of literature in research process
 
Molecular Modeling and virtual screening techniques
Molecular Modeling and virtual screening techniquesMolecular Modeling and virtual screening techniques
Molecular Modeling and virtual screening techniques
 
LaTeX Basics
LaTeX BasicsLaTeX Basics
LaTeX Basics
 
How to write A Research Article?
How to write A Research Article?How to write A Research Article?
How to write A Research Article?
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
 
Citation analysis
Citation analysisCitation analysis
Citation analysis
 
Constructing Searches
Constructing SearchesConstructing Searches
Constructing Searches
 
RDA and Editing Bibliographic Records
RDA and Editing Bibliographic RecordsRDA and Editing Bibliographic Records
RDA and Editing Bibliographic Records
 
Zotero PPT
Zotero PPTZotero PPT
Zotero PPT
 
Lit Reviews for the Health Sciences
Lit Reviews for the Health SciencesLit Reviews for the Health Sciences
Lit Reviews for the Health Sciences
 
Cheminformatics-1.ppt
Cheminformatics-1.pptCheminformatics-1.ppt
Cheminformatics-1.ppt
 
Scopus: a changing world of Research
Scopus: a changing world of ResearchScopus: a changing world of Research
Scopus: a changing world of Research
 
The Systematic Literature Search
The Systematic Literature SearchThe Systematic Literature Search
The Systematic Literature Search
 

Semelhante a Quick Start Tutorial of KH Coder 2: Quantitative Content Analysis or Text Mining of English Language Data

web development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.pptweb development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.ppt
PuniNihithasree
 

Semelhante a Quick Start Tutorial of KH Coder 2: Quantitative Content Analysis or Text Mining of English Language Data (20)

[OUTDATED] Quick Start Tutorial of KH Coder 3
[OUTDATED] Quick Start Tutorial of KH Coder 3[OUTDATED] Quick Start Tutorial of KH Coder 3
[OUTDATED] Quick Start Tutorial of KH Coder 3
 
1428393873 mhkx3 ln
1428393873 mhkx3 ln1428393873 mhkx3 ln
1428393873 mhkx3 ln
 
Hku Ppt
Hku PptHku Ppt
Hku Ppt
 
HKU ppt
HKU pptHKU ppt
HKU ppt
 
ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)ATLAS.ti Training - Covering the Basics (Mac edition)
ATLAS.ti Training - Covering the Basics (Mac edition)
 
ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics ATLAS.ti training presentation: Covering the basics
ATLAS.ti training presentation: Covering the basics
 
902350_HTML_Jar.ppt
902350_HTML_Jar.ppt902350_HTML_Jar.ppt
902350_HTML_Jar.ppt
 
DOC-20220920-WA0012..pptx
DOC-20220920-WA0012..pptxDOC-20220920-WA0012..pptx
DOC-20220920-WA0012..pptx
 
web development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.pptweb development html css javascrptt902350_HTML_Jar.ppt
web development html css javascrptt902350_HTML_Jar.ppt
 
Basics-of-HTML.ppt
Basics-of-HTML.pptBasics-of-HTML.ppt
Basics-of-HTML.ppt
 
902350_HTML_Jar.ppt
902350_HTML_Jar.ppt902350_HTML_Jar.ppt
902350_HTML_Jar.ppt
 
html presentation on basis of tage .ppt
html presentation on basis of tage  .ppthtml presentation on basis of tage  .ppt
html presentation on basis of tage .ppt
 
Intro to HTML
Intro to HTMLIntro to HTML
Intro to HTML
 
902350_HTML_Jar.ppt
902350_HTML_Jar.ppt902350_HTML_Jar.ppt
902350_HTML_Jar.ppt
 
902350 html jar
902350 html jar902350 html jar
902350 html jar
 
HTML
HTMLHTML
HTML
 
html tags
 html tags html tags
html tags
 
Mdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuningMdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuning
 
HTML Start Up - Introduction to HTML
HTML Start Up - Introduction to HTMLHTML Start Up - Introduction to HTML
HTML Start Up - Introduction to HTML
 
Google code search
Google code searchGoogle code search
Google code search
 

Mais de khcoder

Example of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence AnalysisExample of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence Analysis
khcoder
 
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
khcoder
 

Mais de khcoder (8)

KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)
 
【旧版】KH Coder 3 チュートリアル(スライド版)
【旧版】KH Coder 3 チュートリアル(スライド版)【旧版】KH Coder 3 チュートリアル(スライド版)
【旧版】KH Coder 3 チュートリアル(スライド版)
 
Jaccard係数の計算式と特徴(2)
Jaccard係数の計算式と特徴(2)Jaccard係数の計算式と特徴(2)
Jaccard係数の計算式と特徴(2)
 
Jaccard係数の計算式と特徴(1)
Jaccard係数の計算式と特徴(1)Jaccard係数の計算式と特徴(1)
Jaccard係数の計算式と特徴(1)
 
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41stフリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
フリーソフトウェア「KH Coder」を使った計量テキスト分析 ―手軽なマウス操作による分析からプラグイン作成まで― #TokyoWebmining 41st
 
KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)KH Coder 2 チュートリアル(スライド版)
KH Coder 2 チュートリアル(スライド版)
 
Example of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence AnalysisExample of Using R #1: Exporting the Result of Correspondence Analysis
Example of Using R #1: Exporting the Result of Correspondence Analysis
 
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
Rファイルの保存と活用1―KH Coderによる対応分析の結果のエクスポートと活用―
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Último (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Quick Start Tutorial of KH Coder 2: Quantitative Content Analysis or Text Mining of English Language Data

  • 1. Quick Start Tutorial of KH Coder: Quantitative Content Analysis or Text Mining of English Language Data Koichi Higuchi 1
  • 2. 2 Preface  This presentation is a part of tutorials for using KH Coder.  KH Coder is a free software for quantitative content analysis or text mining. It is also utilized for computational linguistics.  Details and downloads: http://khc.sourceforge.net/en/
  • 3. Table of Contents 3 Configure KH Coder for English speaking people / English data  1. Change the interface language to English  2. Settings for analyzing English text  Notes on the stopwords Create a new project and prepare for analysis  3. Create a new project  4. Run pre-processing Frequently appeared words and co-occurrences  5. Word frequency list  6. KWIC and collocation stats  7. Co-occurrence network of words  Methods for exploring co-occurrences of words Characteristics of each chapter  8. Distinctive words of each chapter  9. Correspondence analysis of words and chapters Coding Rules  Use coding rules to count concepts  10. Search documents with coding rules  11. Cross tabulation of the codes
  • 4. 1. Change the Interface Language to English 4 Choose “English” here and restart KH Coder. If you prefer the Japanese interface, you may skip this step. You may also change the interface font. Go to [Project] [Settings] in the menubar.
  • 5. 2. Settings for Analyzing English Text 5 (1) Go to [Project] [Settings] in the menubar. (2) Select “Lemmatization.” (3) Click “config.” (4) Open the “tutorial_en” folder, then drag the file “stopwords_sample_en.txt” and drop here. (Or just paste the content of the file here) (5) Click “OK.”(6) Click “OK.”
  • 6. Notes on the Stopwords 6 You can specify any words as stopwords in KH Coder. The stopwords will be given the special POS tag “OTHER.” Words with “OTHER” tag will be excluded from analyses by default.
  • 7. 3. Create a New Project 7 (1) Go to [Project] [New] in the menubar. (2) Click “Browse” and open the file “tutorial_en/botchan_en.txt” (3) fill in whatever memo you like (4) Click “OK.” In this tutorial we analyze a novel “Botchan” by Soseki. “botchan_en.txt” contains all 11 chapters of the novel. Chapter headings are marked with h1 tag Next time you start KH Coder, go to [Project] [Open] in the menubar and open the project you have created here.
  • 8. 4. Run Pre-Processing 8 Go to [Pre-Processing] [Run Pre-Processing] in the menubar. Then click “OK.” Sentence splitting, tokenization, POS tagging and lemmatization are performed. The results are compiled into MySQL database for searching and statistical analysis. When processing data, KH Coder “concentrates” on the job. So sometimes it looks frozen. But it is normal when CPU or disk is busy.
  • 9. 5. Word Frequency List 9 Go to [Tools] [Words] [Frequency List] in the menubar. These are counts of base forms / lemmas
  • 10. 6. KWIC and Collocation Stats 1/2 10 (1) Go to [Tools] [Words] [KWIC Concordance] in the menubar. (2) Input a base form of a word and hit “Enter” on the keybord When you change sort options, click “Search” button again. Double click any line to view wider contexts. You can change viewing Units below. (3) Click “Stats” to open the collocation stats.
  • 11. 6. KWIC and Collocation Stats 2/2 11 (1) Follow the steps in the previous slide to open the collocation stats. (2) You can filter words by POS tags. “L1” stands for “Left 1.” Numbers in this column indicate how many times each words appeared just before the Node Word (left side, distance 1).
  • 12. 7. Co-Occurrence Network of Words 12 (3) Click “Config” and check “Larger nodes for higher frequency words”, then lick “OK.” Now you can see a co-occurrence network of high frequency words in the text. The color change from blue (low) to pink (high). It indicates the centrality index. (1) Go to [Tools] [Words] [Co-Occurrence Network] in the menubar. (2) Select “Paragraphs” as Unit, then click “OK” (4) Click “Config” and increase “edges” (co- occurences) to “top 100,” then lick “OK.” (5) Select “Community: modularity” as “color.” Which version did you like?
  • 13. Methods for Exploring Co-Occurrences of Words 13 To explore co-occurrences of words, you can also use:  hierarchical cluster analysis  multidimensional scaling co-occurrence network cluster analysis MDS By interpreting these result, you may find major themes of the text from groups of words which tend to appear together. KH Coder uses R as back end to execute these multivariate methods.
  • 14. 8. Distinctive Words of Each Chapter 14 (2) Click “Heading 1.” Top 10 distinctive words of each chapter are tabulated. The “distinctiveness” is calculated using Jaccard index. Basically, if a word shows larger probability of appearance in a specific chapter, It’s considered distinctive. (1) Go to [Tools] [Variables & Headings] [List] in the menubar. (3) Select “Sentences.” (4) Select “catalogue: Excel.”
  • 15. 9. Correspondence Analysis of Words and Chapters 15 (2) Click “OK” Using correspondence analysis, you can visually interpret characteristics of each chapter. (1) Go to [Tools] [Words] [Correspondence Analysis] in the menubar. (3) Click “Config”, then reduce words to “Top 30,” check “Bubble plot,” uncheck “Size of variables...,” and click “OK.” (This step is optional.)
  • 16. Use Coding Rules to Count Concepts 16 In some cases, we have to count concepts, not words. To count concepts, you can compose “cording rules” like this: *shopping store or shop or ( merchandise and not develop ) Indicates the name of this code. The conditions for attaching this code. Cases that contain words like store and shop are given the code “shopping.” The parenthetical notation means that cases should contain the word “merchandise” but should not contain the word “develop.” If a case is acceptable under multiple coding rules, multiple codes will be given to the case. We use “tutorial_en/themes.txt” as example coding rules in this tutorial. Please open this file and check the content.
  • 17. 10. Search Documents with Coding Rules 17 (1) Go to [Tools] [Documents] [Search Documents] in the menubar. (2) Click “Browse” and select “tutorial_en/themes.txt” (3) Select “Paragraphs” (4) Double click a code (5) Double click a result to view the whole paragraph. When you compose a coding rule, it is important to search and check the actual documents which are acceptable under the rule.
  • 18. 11. Cross Tabulation of Codes 18 (1) Go to [Tools] [Coding] [Crosstab] in the menubar. (2) Click “Browse” and select “tutorial_en/themes.txt” (3) Select “Sentences” (5) Click “all” to make a graph. In the latter half of the novel, it looks like “aggression” overwhelms “positive affect” and forms the climax of the story at chapter X. (4) Click “Run”
  • 19. Acknowledgement I am grateful to students who attended the 2011 “text mining” class at Doshisha University (Faculty of Culture and Information Science) for giving me some hints on composing coding rules for “Botchan.” Questions or Comments? Please feel free to post questions or comments at web forum here: https://sourceforge.net/p/khc/discussion/