Since the mid-1990s, researchers have been using the Linguistic Inquiry and Word Count (LIWC pronounced “luke”) software tool to explore various text corpora for hidden insights from linguistic patterns. The LIWC tool has evolved over the years. Simultaneously, research using computational text analysis has evolved and shed light on areas of deception, threat assessment, personality, predictive analytics, and other areas. This presentation will highlight some of the applications of LIWC in the research literature and showcase the tool on some original text sets.
2. Overview
• Since the mid-1990s, researchers have been using the Linguistic
Inquiry and Word Count (LIWC, pronounced “luke”) software tool to
explore various text corpora for hidden insights from linguistic
patterns. The LIWC tool has evolved over the years. Simultaneously,
research using computational text analysis has evolved and shed light
on areas of deception, threat assessment, personality, predictive
analytics, and others. This presentation will highlight some of the
applications of LIWC in the research literature and showcase the tool
on some original text sets.
2
4. Content Overview
• Notes About Language
• Reading = Decoding; Writing =
Encoding
• Computational Linguistic Analysis
• Curation of Text Sets
• LIWC: Linguistic Inquiry & Word
Count
• Work Space in LIWC2015
• Structure of a Custom External
Dictionary
• Insights from Experiences Working
on a Custom Dictionary* (added)
• Creating .dic Files* (added)
• LIWC in Applied Research
• A Basic Walk-through with
LIWC2015
• Live Demos
• Some Types of Askable Questions
4
5. Content Overview(cont.)
• Challenges with Internal and
External Validation
• Some Other Computational
Linguistic Analysis Tools
• Some Other Approaches to the
Data
• Conclusion (and some Newbie
Observations)
• Addendum: An Applied Example
5
7. Some Generalities About Language
• Most language is natural language which evolves common practices and
structures over time based on human interaction (vs. constructed
language, like Esperanto, or those created for the silver screen).
• Language evolves over time based on human usage, particularly in local geographical
areas.
• Unique dialects may develop locally in particular regions or within certain social groups.
• Language itself tends to be patterned but not necessarily internally logical.
• Modern languages originate from language families and are influenced by other
languages.
• Languages are shared codes (oral and written) for people to communicate
and exchange information.
• Because languages have to be understood broadly, they tend to be highly patterned.
• Modern languages tend to have written and phonology aspects; they tend to include
both content (semantic) and structure (syntactic) aspects.
7
8. Some Generalities About Language (cont.)
• Only 200 of the world’s 6,000 – 7,000 languages have a written
version; most are / have been oral only.
• Language is social; it plays a core role in how people make meaning and
interact with each other.
• Changes in a language (based on new technologies, interactions between
cultures, and fashion) are often adopted first orally and then integrated in
more formal written forms.
• World’s languages are disappearing as their users discontinue the
uses of the languages for more commonly shared languages (“Lists of
endangered languages”).
• Globalization has complex effects on world languages.
8
9. Some Generalities About Language (cont.)
• Semantic terms tend towards polysemy (being multi-meaninged) and
nuance, and so are inherently ambiguous.
• Words must be understood in context (translate: proximity to the target term) to
understand their particular respective word sense (connotative application vs. only
denotative).
• There are statistical probabilities for which meaning of a word is likely being used,
and based on proxemics terms to the target term, it is possible to “understand” the
particular meaning of a term in a context.
• Language contains high dimensionality data; it involves many facets.
• Language has text and subtext as well, so the meanings conveyed are not only
surface ones but some hidden (or latent) aspects.
• People wield language in non-obvious ways, such as by using humor, irony,
symbolism, historical referencing, tone, and other aspects.
9
10. Changing Roles of Writing in Societies
• Writing used to be practiced by those with political and social power,
and their creations were based on formal structures and conventions.
• Originally focused on religious issues
• Broadened to address issues of interest for the literate upper and political
classes
• Writing is now way more practiced by the masses, who are much
more broadly literate.
• Topics cover anything of interest but still along certain code-able topics (in
terms of library system labeling, and others) for formal publishing.
10
11. Common Forms of Writing
Non-fiction
• Journalism
• Essay writing
• Autobiography, memoir
• Biography
• Research writing
Fiction
• Poetry
• Short Stories
• Novelas
• Novels
• Plays and scripts
11
12. Common Forms of Writing (cont.)
Non-fiction
• Documents
• Letters
• Oral histories
• Interviews,
• Manifestos,
• Statements, and others
Fiction
• Songs
• Jokes
• Synthetic data for dummy case-
or scenario- research, and others
12
13. Reading = Decoding Text
Writing = Encoding Text
“A language is a fecund, redolent buzzing mess of a thing, in every facet, glint, and corner, even in single
words.”
-- John McWhorter, What Language Is (2011)
13
encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> …
The two activities hold each other in tension and constrain themselves and the other.
14. Complementary Human and Machine
Reading
Human: Close Reading
• Informed by training,
experience, personality,
intellect, emotion, and other
factors
• Full sensory: sight, smell, taste,
touch, hearing, and
proprioception (embodied)
Computer: Distant Reading
• May include supervised or
unsupervised machine learning
• More efficient and scalable than
human reading
• Results in objective counts
14
15. Complementary Human and Machine
Reading (cont.)
Human: Close Reading
• Interpretive and subjective,
filtered through the person
Computer: Distant Reading
• Objectivist
• Reproducible (theoretically and
practically)
15
16. Some Common Distant Reading Approaches
• Human and computer (supervised machine learning):
• XML tagging and data queries of the tagged texts
• Literary analysis including on dimensions of time, characters, dialogue, locations, and
other aspects (such as in the digital humanities)
• Coding by existing pattern (with original human coding: emergent, a priori, or
mixed)
16
17. Some Common Distant Reading Approaches
(cont.)
• Human and computer (data queries):
• Text frequency counts for issues of main focus
• Both:
• Absolute frequency counts
• Relative frequency counts (relative to the document and corpus)
• Mitigation of counts based on how frequently a term appears in a document and corpus,
with more common word appearances diluting the informational importance of that
word (TF-IDF)
• Word search to find all word contexts for word and phrase disambiguation
17
18. Some Common Distant Reading Approaches
(cont.)
• Computer (unsupervised machine learning):
• Sentiment analysis (against a pre-coded sentiment word set)
• Emotion analysis (based on a number of different models)
• Derivation of gender, personality [Big 5 Personality Traits, Dark Triad
Personality traits (narcissism, Machiavellianism, and psychopathy), with
evidence of “within-person stability” that enables profiling and
comparisons/contrasts between people], age, cultural background, and
others
• Remote profiling (with “zero interaction”)
• Predictive analytics, such as anticipation of actions by leaders based on public
and private signaling (extrapolation of intentionality)
18
19. Some Common Distant Reading Approaches
(cont.)
• Computer (unsupervised machine learning) (cont.):
• Deception detection and analysis (such as through “pronoun drop”)
• Stylometry (including author identification): convergence of linguistic
features to an (un)identified author
• “psychological signatures” of writers based on their written works and communications
and extending to profiles
• Domain mapping (topic extraction)
• Theme and subtheme extraction / topic modeling (unsupervised machine
learning)
• Machine-extraction of textual features that are “tells” for certain
outcomes
19
21. Brief History of Quantitative Approaches and
Linguistics
• Manual or “counting by hand” back in the day
• Computational systems tested against human experts in a particular
field or domain
• Computational linguistics in 1950s at beginning of the Cold War to
translate texts from foreign languages into English (machine
translation)
• Used in translating spoken language to text translation
• Used in summarizing texts (topic modeling) at scale for search tools
(“Computational linguistics,” Apr. 1, 2016)
21
22. Computational Linguistic Research Design Built on
Theories, Models, and Empirical Research
• Informed by research in language, (social) psychology, computer
science, and other fields
• May involve text exploration, discovery, or targeted research
questions (or some combination)
• Building on theories, models, and empirical research
• Hypothesizing based on theories and models
• Grouping writing based on particular outcome variables to identify
differences in writing, using selected observed indicators in (written and
spoken) language as potential indicators of difference between the groups
with the differing outcomes
• Using a combination of insights from theories, models, and empirical research
22
23. Computational Linguistic Research Design Built on
Theories, Models, and Empirical Research(cont.)
• Relationship between natural language expression and…
• hidden internal states of people (gender, personality, cognition, state of mind,
intentionality, etc.) and hidden internal states of groups and cultures
• health
• genres of writing
• different language structures
• gender differences
• Language features as certain “tells” (indicators, signs, signals)
• Reverse engineering backwards in time
• Predictive analytics forwards in time
23
24. Computational Linguistic Research Design Built on
Theories, Models, and Empirical Research(cont.)
• So essentially: plaintext = code (indicators) of latent (hidden) realities
• So can profile various genres of text for general characteristics / baselines
• So can compare new exemplars of particular texts against baselines
• So can profile an “unknown” text based on its quant characteristics
• So can compare historical texts against future ones
• So can compare historical occurrences and related texts…and possibly apply
in predictive ways into the future
24
25. Creation of Dedicated Dictionaries
• Informed by in-world texts
• Suggested words and stems and synonyms
• Vetted by people
• Empirically tested for research value
• Does the dictionary provide practical research insights?
25
26. Consumptive and Non-Consumptive Text
Analysis
Consumptive Text Analysis
• Access to the analytics AND the
underlying text set(s)
Non-Consumptive Text Analysis
• No access to the underlying text
set
• Google Books Ngram Viewer is
one popular example of non-
consumptive computational text
analysis (with access to the
shadow text set of ngrams only)
26
27. General Sequence
• Theoretical underpinning
• Research design
• Collection of target text
documents into corpora
• May need to negotiate the release
of particular rights
• Preservation of raw data into a
pristine master set
• Development of familiarity or
intimacy with the text sets
(through close reading and other
types of explorations)
• Translation of non-base
languages to the base language
(or separation for different data
runs using different language
dictionaries)
27
28. General Sequence(cont.)
Text cleaning
• Separating each text bit into its own
file based on a unit of analysis (quote,
paragraph, article, section, novel, or
play, etc.)
• Data normalization / spell check
• Clear and representative file naming
protocols
• De-identification of data (if relevant)
• Cleaning of notes from transcripts,
and others
Text file transcoding (with close
observations of data capture and data
loss at each phase)
• Images with word content are often
not represented as textual contents
• Metadata may / may not be captured
• Sequences of text processing,
software used, and such, affect the
word counts
• Preferred: .pdf -> MS Word (less lossy)
• Not preferred: .pdf -> .txt (lossy)
28
29. General Sequence(cont.)
• File formatting (.txt., rtf, .pdf,
.doc, .docx, .csv, .xl, xlsx, NOT
.pptx, .ppt, .wpd, .jpg, .png) for
LIWC2015
• Versioning of text corpora for
different queries
• “Bag of words” paradigm or
structure / context preservation
• End-sentence markers for
sentence length
• Extraction of data tables
• Creation of data visualizations
from the extracted data
• Interpretations and analyses
• (In)validation of the linguistic
analysis
29
30. Sense-making from Linguistic Patterning
• Starting with known information and prior research (and prior theory)
that will inform the analyses; may be based on a stated hypothesis
• Selecting relevant texts that are comparable along particular
dimensions
• May use from-world text corpora (such as based on dichotomous
nonparametric outcome variables or multi-factor outcomes)…and
looking for linguistic differences and similarities
30
31. Sense-making from Linguistic Patterning (cont.)
• Identifying linguistic “markers” / indicators that are a “tell” for a
particular construct or state-of-the-world
• Setting baselines (controls) for certain types of texts and then
comparing a subset against the general baselines
• Looking at text clustering as factor loading and applying human
understandings of the respective factors
• Comparisons and contrasts across dimensions of texts (such as text
types across cultures)
31
32. “Linguistic Style”
• Study of semantic terms makes sense to human readers and aligns
with how the human brain works (in terms of what is noticed /
perceived and remembered in a text), but semantic terms and unique
phrasing are eminently emulatable and manipulate-able
• Point is to find indicators that are not so easily tampered with by people who
may want to manage impressions
• Going to “function words” or particles (articles, pronouns,
prepositions, conjunctions, auxiliary / helping verbs, adverbs,
negations, etc.)
32
34. Selected Texts from a Domain
• Delimiting of targeted texts helps focus the function of the software
• All included texts should relate to the topic being studied
• Said to work better with natural language text than non-natural
language text
• Texts do not have to be only one type, but if they are mixed text sets,
that should be noted in the analysis
• Types of text data should be reflected in the types of dictionary dimensions
(and categories) applied…as well as the selected language dictionary
34
36. Curation of Text Sets
1. Gathering of textual and non-textual (such as multimedia) data
2. Selection of relevant texts
3. Arrangement of rights releases as needed (staying legal)
4. Transcoding of multimedia content to textual and textual-to-textual
5. (Non-destructive) data cleaning: normalization of terms, spelling,
foreign language translation, treatment of symbols and punctuation,
and others (with archival of all raw files in pristine format prior to
any normalizing, for “non-destructiveness”)
6. Data segmentation / grouping
7. Data / file labeling
36
37. Curation of Text Sets (cont.)
8. Data formatting / file conversions
• Must be searchable (machine readable, with optical character recognition /
OCR) files in any of the following formats: .pdf, .doc, .docx, .txt, .rtf, .xl, .xlsx,
.csv, etc. (Ability to read within and across columns in spreadsheet formats is
a new LIWC2015 capability.)
• Considerations for digital preservation, with common strategies of going to
the lowest common denominator files and open source if possible (.txt, .rtf,
.csv, .html, .xml)
9. Metadata creation (linked to the respective files or kept in a
README file with the textual data)
37
38. Curation of Text Sets (cont.)
10. Conducting of research on the text set: Inherent enablements for certain
types of data queries and data explorations and research based on
dataset contents and structure (such as test text sets to train new
models)
11. Scrub of dataset for publishing
12. Descriptions of the text set: Its origins, its contents, its quantitative
numbers, its standards for text inclusion, its copyright releases, its prior
uses, its potential uses, proper citation methods, originator’s /
originators’ contact information, and others; datasets named based on
curated textual and other contents or the data curator or some other
naming method (for easier reference, for building up a user base)
38
39. A “Sanity Check” for Text Processing
• Transcoding from one document type to another often results in
information loss because of how each software program handles the
transferred information.
• Data lossiness:
• There is some degree of expected lossiness. For example, text in images will not be
recognized unless there is optical character recognition (OCR) applied.
• Embedded videos will not have a text equivalency unless a transcript is also downloaded
and included with the text document, corpus, or corpora.
• In multi-lingual text files, messages that are not in the main base language may not be
transferred accurately.
• Some valid words may turn to garble in the transcoding.
39
40. A “Sanity Check” for Text Processing (cont.)
• Added extraneous data: There is some degree of extraneous information
included, such as web page data in between-page gutters captured in a web-
to-PDF capture.
• Comments made to a .pdf may be included in a transcoding context, say, to
Word.
• There may also be extra information in web pages if they are not captured in
print style but include advertising in designated ad spaces and pop-up
windows.
• Print styles of web pages are not always included as an option.
40
41. A “Sanity Check” for Text Processing (cont.)
• If batch processing, check results with smaller sets first. Check the
outputs as well.
• In automation, it is possible to lose information if this is done unthinkingly.
• To see if there are systematic challenges with losing and / or gaining
text during transcoding, run a “sanity check” after processing data to
see how much of the original was preserved.
• One type of “sanity check” is to run a simple average word count of a
particular unit (document, message, or other).
• Does the per-unit word count jibe with what is observed by the researcher?
If not, it’s important to figure out a less lossy way of capturing query-able
files.
41
42. A “Sanity Check” for Text Processing (cont.)
• There may be differences between “Save as,” “Export as,” “Print as,” “Send
to,” and other sorts of functions that enable transcoding between file
types.
• Allowing character substitution or not will affect the transcoded contexts
from MS Word.
• For those text files with UTF-8 characters, it is important to ensure that the
coding enables UTF-8 characters.
• There are more optimal sequences and technologies to move text from one
file type to another, so researchers should experiment with what works
best for them.
• In a PDF file, go to .docx (via MS Word) to capture much more recognizable text (vs. a
.txt or a .rtf).
42
43. Creation of Text Metadata from Multimedia
Sources
• Types of files:
• Digital imagery, audio files, video files, slideshows, games, simulations, and
others
• Analog-to-digital files (transcoding)
• Text versioning: metadata descriptors, transcripts, locational
information, coding (whether manual, automated, or mixed) and
others
• Using the extracted text transcripts for linguistic analysis…but a step or two
out from the original source
• Updated capabilities to read image files (in PDF) to text in an automated way
43
45. LIWC and its History
• Developed in early-to-mid 1990s by Martha E. Francis (then a grad
student and programmer) and James W. Pennebaker (1993) to study
possible therapeutic use of language
• Named LIWC (Linguistic Inquiry and Word Count), helpfully descriptive but
also disambiguated; “LIWC” pronounced “luke” (according to its makers)
• Comprised of two parts: (1) a processing component and (2) dictionaries
(based on certain categories of data and / or constructs)
• Factors broadening rapidly in v. 1 to 80 factors (variables)
45
46. LIWC and its History(cont.)
• v. 2 evolved with an expanded dictionary and more modern software
processing capabilities (2001), also known as SLIWC (Second LIWC)
• LIWC2007 has even broader dictionary capabilities by James W.
Pennebaker, Roger J. Booth, and Martha E. Francis (2007)
46
47. LIWC and its History(cont.)
• Most recent version is LIWC2015 (Pennebaker, Boyd, Jordan, &
Blackburn, 2015), with new software and new dictionary (vs. an
upgrade) and extensive documentation
• LIWC2015 dictionary contains nearly 6,400 words, word stems, and emoticons
• “Each dictionary entry additionally defines one or more word categories or
subdictionaries”
• Includes a feature to include customized dictionaries
47
48. LIWC and its History(cont.)
• Systematic process of LIWC2015 dictionary creation involved building on
LIWC2007, having 2-6 judges individually generating word lists, having
collected words analyzed by a group of 4-8 judges, application of a
Meaning Extraction Helper to set base rates of word usage in the wild,
creating candidate word lists of terms possibly missed by judges, and
psychometric evaluation of respective words’ influences on the constructs,
then refinement of the terms, and a re-review of the prior steps to catch
potential errors (Pennebaker, Boyd, Jordan, & Blackburn, 2015, pp. 5 – 6)
• Is a well documented software tool (rare)
• Is tested for both internal validity (based on real-world text sets) and
external validity (based on research designs, with validity not applicable
just across-the-board but in case-by-case bases)
48
49. LIWC and its History(cont.)
• Tested again large text corpuses: blogs, “expressive writing,” novels, natural
speech, NY Times, and Twitter to set baselines
• LIWC2015 captures “on average, over 86 percent of the words people use in writing and
speech” (Pennebaker, Boyd, Jordan, & Blackburn, 2015, p. 10)
• Fairly high correlations between LIWC2007 and LIWC2015 means (p. 13)
• Removal of categories “largely due to their consistently low base rates, low
internal reliability, or their infrequent use by researchers” (past tense verbs,
present tense verbs, future tense verbs, human words, inhibition words,
inclusives, exclusives); versions 2001 and 2007 enabled
• Internally validated on a variety of psychometric dimensions;
backstopped by empirical research across a number of modern
languages
• Informed by decades of empirical research
49
52. LIWC and its History(cont.)
• As a commercial product
• May be purchased (http://liwc.wpengine.com/) from Pennebaker Conglomerates,
Inc.
• Free trial version may be accessed (http://www.liwc.net/tryonline.php) but with text
size limits
• Includes a LIWC API with all LIWC2015 variables “plus 30+ additional validated
measures of psychology, personality, moetion, tone, sentiment and more—all in real
time” and access to “social media integration, time-series analysis, statistical models,
machine learning models and more” (through Receptiviti)
• Is not open-source
• Runs on both Windows PCs and Macs OSes via the Java Virtual Machine
• May be downloaded to the local machine or accessed as a web-based
version
52
53. LIWC and its History(cont.)
• Processes text files sequentially, finds each word, looks to see if it is
its built-in dictionary, counts that word, and increments as a straight
count and then applies that count to a simple percentage function (%
of the complete document) or a variable-based scale function (based
on dedicated algorithms) for psychometric, psycholinguistic, or other
human-related measures
• Outputs one of the following:
• Raw counts
• Frequency percentages
• Processed scores (percentiles)…but no access to the coded text sets
53
54. LIWC and its History(cont.)
• Evolution of built-in dictionaries over time based on documented
standards and researcher majority consensus
• Based on English (with 171,000 “English” words in use, 100,000 English words
used by the average native English speaker)
• Considered the foremost linguistics analysis tools in use today
• Backstopped by hundreds of research articles
• Can handle any language representable by UTF-8 charset / character set (but
analytics done in a base language)
54
55. Human-Created Non-English Translations
based on LIWC2001 or LIWC2007
Available
• Spanish
• German
• Dutch
• Norwegian
• Italian
• Portuguese
In Process
• Arabic
• Korean
• Turkish
• Chinese
55
56. Downloadable External Dictionaries (.dic)
from LIWC2007 and LIWC2001
LIWC2007
• Spanish
• French
• Russian
• Italian
• Dutch
LIWC2001
• German
56
57. Downloadable Customized Dictionaries
• Dedicated site for dictionary downloads also enables access to user-
created dictionaries
• Four were available as of mid-2016
• Two were coherent (structurally and conceptually), and of those, one was a
sample one to show how to set up a dictionary for use in LIWC2015
• Linguistic analysis dictionary-creators need to be expert in an area of research
• They need a clear grasp of the language that they are using
• They need to work with others to ensure that the linguistic analysis dictionary is as
comprehensive and as accurate as possible
• Such dictionaries—like any research instrument—should be fully documented and tested
for validity (of construct) and reliability (of consistent results); the first is based on the
subject matter field, and the latter is based on counting (which is usually very high
reliability for item counting)
57
59. Exporting Pre-Built Internal Dictionaries
• Can export internal dictionaries (LIWC2001, LIWC2007, and
LIWC2015) as “posters” in secured (non-editable) .pdf files
59
60. Original Customized External Dictionaries
• How to create:
• Conceptualize a construct
• Identify terms that fit that construct
• In a text editor, list in the proper format (next slide): first the constructs and
then the terms in each of those constructs
• Be sure to have the proper placement of the % and %
• Version the file as a .dic file (changing the text file extension of a basic text file
BUT—go to Slide 75 for the easiest way to create a .dic file that works); in
LIWC2015, .doc or .docx or .txt files may be used as dictionary files as long as
the other formatting is in place
60
61. Original Customized External Dictionaries (cont.)
• Adding to LIWC (for the analyses)
• Dictionary -> Load New Dictionary
• Can only run one dictionary at a time, but can run various dictionaries over
the same text set for different insights
• Conduct research and test the respective constructs for internal
reliability and external validity
61
62. Structure of a Custom External Dictionary
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Custom Dictionary Structure:
• Constructs or dimensions on the top
section; words representing the various
constructs below
• Use of unusual characters ($, #, %, ?, ^, *,
etc.) to separate dimensions or
categories from the words themselves,
with these on their own lines
• May represent any language depicted
with UTF-8 charset (of Unicode)
• May use created characters for
imaginary created languages (vs.
natural languages)…but haven’t seen
this yet in the LIWC custom dictionary
collection
62
63. Structure of a Custom External Dictionary (cont.)
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Selected words indicative of particular
dimensions or categories (single
dimensions or multiple ones) for the
bottom section
• Words may indicate several constructs,
but multiple counting of terms will mean
more noise in the data (as compared to
signal)…and will err on the side of recall
vs. precision (in terms of an f-measure)
• May use empirical data and sources to
stock lists, then add synonyms to
expand the dictionary’s transferability
beyond the “training data”
• Word list should be alphabetized for
easier perusal and for elimination of
repeated words
63
64. Structure of a Custom External Dictionary (cont.)
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Helpful to have custom dictionaries
constructed multi-dimensionally to
capture a full and complex issue
• May run multiple dictionaries against
a corpus or combined corpora
• May divide up corpuses into separate
documents and sets for different sorts
of queries
• For example, use separate corpora to
analyze different time periods with sets
representing different time periods
• Use the creation of corpora and the
separation of documents and datasets
into different sets…as a way to enhance
LIWC capabilities
64
65. Structure of a Custom External Dictionary (cont.)
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Results as straight raw word or phrase
or emoticon counts and computed
percentages of occurrences against
the entire corpora (not scores)
• For transferability and research
efficacy, need to validate / invalidate a
custom dictionary through pilot-
testing and usage
• Testing may involve
• Review by experts in the field
• Application of dictionary against various
text sets
• Statistical testing for whether words
represent the respective constructs
65
66. Required Notation in Customized Dictionaries
• Category names must be one word (and can be written as several words in
camel case)
• Separate words by spaces or tabs or new lines / hard returns (but be
consistent)
• Stemmed words may be counted (separate from the core word)
• Stemmed words are created with changes to a word’s form, such as with the addition
of prefixes, suffixes; the adding of count (pluralizing); the expression of verb tense;
and other transformations from a core or base term
• Asterisk (*) tells LIWC2015 to ignore all subsequent letters that follow to
capture all forms of the word based on a base form or lemma or stem (so
the differently inflected word forms may be treated as a single item)
• Telephon*
66
67. Required Notation in Customized Dictionaries
(cont.)
• Inclusion of multi-word phrases possible, such as for specific
compound terms or n-gram sequences
• Single-form versions of those terms will not be counted separately, and
phrases are ultimately treated as one-word units
• Words in alphabetical order in the customized dictionaries
67
69. Dictionaries for the Study of Various
Constructs and Dimensions
Built-in Dictionaries
• Selective coded word set that
define a particular category
• Dictionaries affect the
fundamental tool capabilities
• Validated
Customizable Dictionaries
• May apply custom dictionaries
to the analyses
• External dictionaries as plain text
files delimited by % and %
69
70. Insights from Experiences Working on a
Custom Dictionary
• Study the issue in depth, both in the formal and informal literature. Use a
“greedy” and “voracious” capture for sources.
• Create constructs (to be sufficiently mutually exclusive but also to cover
the research topic as comprehensively as possible). Write these as single
words or phrases using camel case.
• Capture words that indicate the respective constructs from all possible
respectable sources.
• Go beyond text to images and multimedia. Code everything that is
relevant.
• Capture as many natural language words that represent the various
constructs as possible.
• If using social media as the source, pay attention to abbreviations (from
everywhere), #hashtags, @expressions, emoticons, and a range of other details…
70
71. Insights from Experiences Working on a
Custom Dictionary (cont.)
• Avoid early lock-in or early finalization of a dictionary. (Assume that a
custom dictionary is never really finalized.)
• If a word applies to multiple constructs, include it in the multiple
constructs. (Don’t commit a word to only one construct.)
• Build a table in Word or Excel. Do not number the cells. Keep this as
freeform and inclusive as possible. Extend the brainstorm stage as
long as possible, so that there is not early commitment to an early
draft.
71
72. Insights from Experiences Working on a
Custom Dictionary (cont.)
72
Construct(s) Related Words, Phrases, Symbols,
Numbers, etc. to the Construct(s)
1 ConstructA …words…phrases…symbols…numbers,
and others
2 ConstructB "
3 ConstructC "
4 ConstructD "
5 ConstructF "
6 ConstructG "
73. Insights from Experiences Working on a
Custom Dictionary (cont.)
• When the table is complete (at least for this round)…and the
dictionary has to be collated…
• Assign numbers to the constructs.
• Assign numbers to the related words showing their respective
relationships to the respective constructs.
• List the constructs in numerical order.
• You now have the top part of the custom dictionary.
73
74. Insights from Experiences Working on a
Custom Dictionary (cont.)
• Make a “bag of words” of all the words (with
their assigned construct numbers in the
adjacent cells in Excel).
• Filter the column of words into alphabetical
order and include the “Expand the Selection”
command so all the row data follows the sorted
file.
• Take the alphabetized word list, and you have
the bottom part of the custom dictionary.
• Test this in LIWC… by loading the new dictionary
and selecting the type of analysis desired…
74
75. Creating .dic Files
• Open MS Word.
• Click “File” tab in the ribbon.
• Click “Options” at the bottom left.
• Select “Proofing” in the “Word Options” window.
• Click the “Custom Dictionaries” button.
• Indicate a “New” dictionary.
• Give the new dictionary a name and save it to the correct location
with the .dic file format.
75
76. Creating .dic Files(cont.)
• Open the .doc file in Word and paste the dictionary (with new words
on each line) into the file. Save. Load. Run.
• If you’ll be making multiple dictionaries, make a few extras with
generic names to serve as templates!
76
82. Dimensions of Language in LIWC2015:
Summary Language Variables
• Four Summary Language
Variables (standardized
composite scores based on
algorithms created from prior
linguistic analysis research and
large “training” text sets)
• Reported out as percentiles from 0
to 100
• Relative and comparative standing
of a target text document or text
set (against training text set) vs.
any “absolute” measure
• These summary language
variables include the following:
Analytic, Clout, Authentic, and
Tone
• These variables each have unique
meanings, so it is important to
read the official manuals to
understand the respective
meanings
• These are “black box” features,
so the underlying algorithms are
not available
82
83. Dimensions of Language in LIWC2015:
Summary Language Variables(cont.)
• Analytic (formerly categorical
dynamic index or “CDI”):
• high score: formal, logical, hierarchical
• low score: informal, personal, narrative
thinking
• Clout:
• high score: “perspective of high
expertise”
• low score: tentative or humble style
• may be indicative of relative social status,
confidence, and leadership
• Authentic:
• high score: honest and disclosing (being
“personal, humble, and vulnerable” and
authentic)
• low score: more guarded “distanced
form of discourse”
• (Sentiment and Emotional) Tone:
• high score (>50): positive emotion
• low score(<50): “greater anxiety, sadness,
or hostility”
• at 50: “suggests either a lack of
emotionality or different levels of
ambivalence” (LIWC2015 Operator’s
Manual)
• below 50 is negative, above 50 is positive
83
84. Dimensions of Language in LIWC2015:
Summary Counts to Indicate “Structural
Composition” and Complexity
• WC (total word count) (raw count)
• length of the particular text used as a
proxy for how in-depth that work may
be in addressing the target topic
• WPS (words per sentence)
(average)
• used as a proxy for sentence
complexity
• Sixltr (words longer than six letters)
(raw count)
• count used as a proxy for word
complexity
• Dic (dictionary words count)
(percentage of target words
captured by the applied dictionary
/ dictionary words)
• used as an understanding of how
much of a text was addressed in the
LIWC analyses, assuming that the
various counts were all applied
84
85. Dimensions of Language in LIWC2015:
Understanding Most Output Numbers
• 90 output variables in LIWC2015
• Most are percentages of certain words in the total document or text
set (text corpus or corpora) (“Interpreting LIWC Output,” 2015; “How
it Works,” 2015)
85
86. Dimensions of Language in LIWC2015:
Percentages of Standard Linguistic Dimensions
• Function words (pronouns,
articles, helping / auxiliary verbs,
and others)
• Othergram (other grammar),
including verbs, adjectives,
comparisons, interrogatives,
numbers, and quantifiers
86
87. Dimensions of Language in LIWC2015:
Percentages of Psychological Constructs
• Affect (including positive
emotions, negative emotions
and particularly anxiety, anger,
and sadness)
• Social (including family, friends,
female, male)
• Perceptual processes (including
seeing, hearing, and feeling)
• Drives (affiliation, achievement,
power, reward, risk)
87
88. Dimensions of Language in LIWC2015:
Percentages of Other Human-Based Constructs
• Biological Processes (including
body, health, sexual, ingestion)
• Time Orientation (including
past, present, or future focus)
• Relativity (including motion,
space, time)
• Personal Concerns (including
work, leisure, home, money,
religion, death)
• Informal Language [including
swearing, netspeak, assent,
nonfluencies (meaningless filler
words), and filler words]
• Cognitive Processes (including
insight, causal, discrepancies,
tentativeness, certainty, and
differentiation)
88
89. Dimensions of Language in LIWC2015:
Punctuation Marks
• Punctuation marks (12
categories)
• Considered part of “structural
composition”
89
90. Meaning in the Dimensions
• Based on empirical research
• Based on constructs within
particular fields (particularly
psychology, linguistics)
• Based on the selected text
corpus or corpora
• Dimensions are applied singly
and in combination with other
descriptors and analytical
approaches to create value-
added understandings.
90
91. Additional LIWC Dictionaries
• Dictionary -> Get More Dictionaries
• Download as .dic (dictionary) files
• Dictionary -> Load New Dictionary
91
92. Beyond English
• Spinoff dictionaries as translations from English terms, but not native
created and not native coded
• Some are spinoffs of the English sentiment core, with added grammatical and
cultural variables
• External dictionary in LIWC2001: German
• External dictionaries in LIWC2007: Spanish, French, Russian, Italian,
and Dutch
• Versioned in some other languages like KLIWC for Korean LIWC,
Tagalog, and others based on custom research (according to articles
in the research literature)
92
95. Some Types of Research with Computational Linguistic Analysis:
Research Approaches
• Lab-based (and / or classroom-
based) capture of text sets based
on particular directions for
eliciting writing
• Stream-of-consciousness writing,
free writing, diary writing /
journaling, deceptive vs. non-
deceptive writing, responding to
visual prompts, completing
cliffhangers, and others
• Uses of Electronically Activated
Recorder (EAR)
• Pre- and post- experimental
methods
• Categorical outcomes used to
separate text sets and the study
of various linguistic variable
associations (“markers” or
“indicators”) with particular
outcomes; application of
statistical analysis for
significance and correlation
effect size (r)
95
96. Some Types of Research with Computational Linguistic Analysis:
Research Approaches(cont.)
• Sometimes used as a part of a
research sequence, not as the
main research
96
97. Some Types of Research with Computational Linguistic Analysis:
Baseline / Control Setting
• Baseline / control setting for
how males write / talk vs. how
females write / talk
• Within languages
• Between languages
• Status indicators in language
use; power vs. powerlessness
• Language-based baselines
• Cultural-based baselines
• Genre writing baselines
• General age trajectories
baselines and language use
97
98. Some Types of Research with Computational Linguistic Analysis:
Efficacy of Writing Interventions
• Whether writing has therapeutic
value; what types of writing has
therapeutic value
• Upper and lower boundaries of
therapeutic writing
98
99. Some Types of Research with Computational Linguistic Analysis:
Predictive Analytics
• Handling of individual trauma;
handling of collective trauma
• Authorship attribution (through
psycholinguistic profiling)
• Deception detection
• Fraud detection
• Male- female- authorship inference
• Suicidality detection
• College student performance
prediction
• Employee performance prediction
• Research article popularity based
on writing fluency
• Threat detection
• Remote personality reading
(including author / speaker
cognition, psychological health,
and others)
• Reading of mental and emotional
states
99
100. Some Types of Research with Computational Linguistic Analysis:
Predictive Analytics (cont.)
• Belongingness, social realities
• Cognitive judgment
• Attitudes
• Motives and intentions, and
others
• (An easy starter book on this is
The Secret Life of Pronouns by
J.W. Pennebaker, one of the
main originators of LIWC.)
100
101. Some Types of Research with Computational Linguistic Analysis:
Some Origins of Extant From-world Text Sets
• Historical documents
• Court records
• Research articles
• Journalism text sets
• Genres of fiction writing
• Gray (informal) literature
• Company or organization-based
writing
• Grants
• Personal writing, like letters
• Large-scale writing sets from
college students, K-12 students
• Applications for college entry
• Writing for standardized testing
• Synthetic data (created to test
particular research hypotheses)
• Computer-generated
• Crowd-generated, and others
• Related text sets across languages,
also between languages
101
102. Some Types of Research with Computational Linguistic Analysis:
Some Origins of Extant From-world Text Sets (cont.)
• Spoken speech
• Speeches
• Debates
• Panel discussions
• Focus groups
• Meeting agendas and discussions
• Television programs
• Telephone transcripts
• Music lyrics, and others
• Social media text sets
• Web pages and sites
• Crowd-sourced blog entries
• Web encyclopedia pages
• Tweetstreams and microblogging
message collections
• Social network user accounts
• SMS datasets
• Email sets
• Sub-Reddits and discussion threads
• Image tags,
• Video tags, and others
102
111. General Process (redux)
• Theoretical underpinning(s)
• Research design
• Text collection (searchable file types, file naming protocols)
• Text cleaning (normalization)
• LIWC runs and re-runs (word counts, percentages of words in content
categories)
• Analytic conclusions
• Further research within and beyond LIWC
111
114. Some Types of Askable Questions
From Computational Linguistic Analysis
114
115. Some Types of Basic Askable Questions
Pre-existent (or “found”) text:
• Are there statistically significant differences in linguistic writing styles
between authors (author style profiles)?
• Authors of different genders? Age groups? Cultures? Languages?
Backgrounds? Experiences?
• If so, what are the differences? Are these consistent differences? Do these
differences hold across different conditions and contexts? What could these
differences mean?
• What is the text profile of “successful” vs. “unsuccessful” genres of
writing? Do such differentiating text profiles exist in a meaningful
way? Are these effects explainable based on the linguistic features?
115
116. Some Types of Basic Askable Questions(cont.)
Pre-existent (or “found”) text (cont.):
• Are there linguistic markers / indicators in text sets that may indicate
particular outcomes in terms of reception of the text / text sets?
Outcomes for the authors of the text sets? Other outcomes?
• Are there linguistic patterns from certain genres of writing? Genres
of writing in certain time periods?
• Are there patterned observable differences between spoken words
and written ones in a particular context? How spontaneous (raw) or
edited (processed) were the respective source texts?
116
117. Some Types of Basic Askable Questions (cont.)
Pre-existent (or “found”) text (cont.):
• What are some summary features (descriptions) of the document or
text corpus? How are function words used in the text set?
• What are some observed sentiment features of the text set? How do
these features correlate with features in the real-world?
• What are some observable psychometric features of the text set?
How do these features correlate with features in the real-world?
• How do the various features of one document or text set compare
and contrast against another?
117
118. Some Types of Basic Askable Questions (cont.)
Elicited text:
• What are some linguistic features of the elicited texts?
• What are some creative prompts for elicited spoken words (like think-
aloud prompts) vs. elicited written words?
• Are there identifiable patterns that may be found in those elicited
texts? Do different prompts result in identifiably different types of
texts, and if so how, and how?
• How does writing change over time (in terms of observed linguistic
features)?
118
119. Some Types of Basic Askable Questions (cont.)
Elicited text (cont.):
• How does writing change in a pre- and post- intervention scenario?
• What role can writing play as an intervention itself?
• Are there different writing patterns that may be identified among
different people groups (such as based on demographic factors or
categorical factors)? What might this mean?
119
121. Some Challenges with the Word Counting
Method
• Inherent lexical ambiguity and polysemous nature of language (a
counted term can be understood different ways based on the context,
author intention, and usage)
• Focus on the single unigram / one-gram (instead of two-grams, three-
grams, four-grams, and so on, as phrases)
• A lack of contextual awareness in a “bag of words” paradigmatic
approach (except for counts within documents in sets comprised of
stand-alone documents)
• Some small mitigation in terms of the color coding of terms found in a
document that are in the LIWC2015 dictionary, but this requires human “close
reading” of the document (whether academic reading, skimming, or
scanning)
121
122. Some Challenges with the Word Counting
Method(cont.)
• A core base language has to be selected even though there are
dictionaries in English and non-English languages
• Multi-language datasets cannot be run simultaneously (but may be
run individually, with findings applied in a complementary way)
122
124. Definitions: Validation / (In)Validation
Internal Validity
• How well the words represent
the constructs that they are
supposed to represent
• How solidly does LIWC2015
work based on its
conceptualization, creation,
testing, and design
External Validity
• How well identified textual
indicators predict “ground truth” or
“state of the world”
• How well the findings may be
generalized to the world (or the slice
of the world that is being studied)
• Also how applicable the findings
may be to other similar (~) cases
• How findings compare to base rates
of particular textual phenomena in
particular text genres
124
125. Internal (In)validation
• Evaluation of each of the steps to the process, the execution at each
step, and the overall work
• Theoretical underpinning
• Research design
• Text collection
• Text cleaning
• Text analysis instrument functioning
• LIWC runs and re-runs
• Text set treated as individual files and as a collection
• Analytic conclusions
125
126. Testing of Predictive Modeling Accuracy
• Testing predictive modeling based on other measures (created by people or by
other programs)
• Both precision and recall are important for a predictive construct but there may be tradeoffs
between these two features
• To ensure that predicted positives are actual positives, a threshold may be set too high, leaving out many
actual positives but resulting in fewer false positives (so high precision, but low recall)
• To ensure that that all the positives that exist in a set are captured, a threshold for inclusion may be set
too low, capturing a lot of false positives (so high recall, but low precision)
• Ideally, both precision and recall should be as high as possible
• In a perfect balance, all positives are true positives, and every single actual positive is identified from a
set
• F-measure / F1 score / F-score (“weighted harmonic mean” between precision
and recall)
• Is expressed as a number between 0 and 1 (where 1 is perfect precision and perfect recall)
• 0 < p < 1
• 0 < r < 1
126
127. F-measure: Precision and Recall
Precision “p” (predicted positive results):
true positives / true positives and false
positives
• How sensitive is the test to the
identification of true positives (without
confusing false positives with the true)?
• How much noise is in the results? Is the
test overweighted towards finding
positives and so falsely categorizing false
positives (undesirable)?
• High precision means that an identified
positive is highly likely to actually be a
true positive (and not a false positive).
Low precision means that an identified
positive could well be a false positive.
Recall “r” (capturing of actual positive
results): true positives / true positives +
false negatives
• How many of the true positives
have been identified (from the full
set of all true positives
possibilities)?
• High recall means that most or all
of the true positives are identified
by the test.
• Low recall means that many of the
true positives were missed. In this
case, the test is not trusted to
include all possible true positives
because many are missed.
127
128. Testing of Predictive Modeling Accuracy (cont.)
• F1 =
2
1
recall
+
1
precision
• F1 simplified = 2 x
Precision x Recall
Precision+Recall
• OR F1 simplified: F = 2 * [ (pr) / (p+r) ]
• An ideal test identifies the target phenomena accurately (p) and
thoroughly (r).
128
129. Some Limits to Dictionary-Based Classifiers
• Dictionary-based classifier systems tend to be high on “precision” but
low on “recall” (Aoqui, 2012)
• Natural language (particularly in an age of social media) evolves quickly, with
new terms and new word usages occurring constantly
• Dictionaries used in classifiers are often updated through rigorous manual
(“by hand”) processes, which require large human investments and
efforts…and time
129
130. External (In)validation
• Light “sanity check” of text counts
• Comparisons against extant
baselines (if available)
• Comparisons against text results of
a control group (vs. the
experimental group)
• Comparisons against human coding
(if feasible)
• Comparisons with other similar
research (if available)
• Comparisons of phenomena based
on other similar / dissimilar text
sets
• Testing of linguistic indicators in
other similar (~) contexts for
applicability
• Fine-tuning
• Testing of linguistic insights with
“ground truth” (assessed by other
means)
130
132. Some Other Common
Computational Linguistic Analysis Tools
• Computational Social Science
Lab (CSSL) at the U of S.
California’s Text Analysis,
Crawling and Interpretation Tool
(TACIT)
• http://tacit.usc.edu/
• Free and open-source
• Art Graesser’s Coh-Metrix
program (coherence metrics in
text)
• http://cohmetrix.com/
• Rod Hart’s DICTION program
• http://www.dictionsoftware.com
• Tom Landauer’s Latent Semantic
Analysis
• http://lsa.colorado.edu
• CASOS’ AutoMap (network text
analysis)
• http://www.casos.cs.cmu.edu/proj
ects/automap/
• Free
132
134. File Export Formats
• No direct way to save a project in LIWC2015 so need to “Save Results”
• “Analyze Text” and “Categorize Words” data export as following file
types:
• .txt (ASCII text)
• .csv (comma separated values)
• .xlsx (XML spreadsheet file format in Excel 2007 onwards)
• “Color-Code Text” function data results cannot be saved out directly
but may be copied and pasted in MS Word with color intact (not in
Notepad or simple text editors)
134
135. From LIWC2015 -> Other analytics
• Light analytics and counts in LIWC2015
• No access to coded text sets from which numbers are extrapolated
(so a kind of black-box processing except for the software manual
documentation)
• Export…depending on the text curation process…sometimes requiring data
restructuring…sometimes requiring information and assumptions beyond the
extracted texts…
• …In Excel: mostly descriptive and comparison data
• Range of quant processing (averaging data, summing data, and others)
• Data visualizations (stacked bar charts, line graphs, and others)
135
136. From LIWC2015 -> Other analytics (cont.)
• …In SPSS: asking harder questions based on the studied texts
• Statistical significance computations
• Chi-square computations
• Factor analyses
• Content analysis through human “close reading”
• LIWC2015 variables involve (reproducible) counts and some psychometrics,
but its use is always through a researcher interpretive lens
• Researcher identifies what is relevant and why
• Researcher brings knowledge of the topic and field to the interpretation
136
138. Initial Impressions of LIWC
• Functions are simple and mechanistic: counting and tallying
• User interface is simple
• Processes are simple
• Potential is intriguing and promising:
• Power lies in the dictionaries and the domain-based insights
• Power lies in the respective text sets for sufficiency of data
• Power lies in discovery and hypothesis-testing
• Power lies in surfacing insights that would not be knowable otherwise in an
efficient way
138
139. Conceptualizing the Extracted Data
• Because these tools offer predictions based on probability, however,
such insights will never be definitive. “In the final analysis, our
situation is much like that of economists,” (James W.) Pennebaker
says. “It’s too early to come up with a standardized analysis. But at
the end of the day, we all are making educated guesses, the same way
economists can understand, explain and predict economic ups and
downs.”
• – Jan Dönges, “What Your Choice of Words Says about Your Personality,” July
1, 2009, Scientific American
139
140. Some Newbie Observations about the
Software Tool
• LIWC2015 (and computational linguistics) is young yet and still very much
in the exploratory phases
• There are challenges with setting baselines against which subsets and new
sets of texts may be compared for insights; likewise, there are challenges in
setting control groups for experimental research
• Some insights are domain-specific (and culture specific), and others may be
general and cross-domain
• A major trap involves over-asserting from limited observations (and limited
text sets)
• Facile interpretations are risky and should be avoided
• J.W. Pennebaker: “Don’t trust your instincts.”
• LIWC2015 is a lot of fun to use (maybe a little dangerously so)!
140
141. Some Newbie Observations for Researchers
• As with some software tools, it is easy to come up with a lot of data
with just a few clicks…but accurate analysis will require the following:
• Understanding
• the strengths and weaknesses of the tool
• the strengths and weaknesses of the curated text sets
• with some text sets more amenable to the application of computational linguistic
analysis than others
• with Heaps’ law / Herdan’s law implications to the amount of text and diminishing
returns on numbers of distinct vocabulary elements after a certain amount of collected
text
• the particular domain / discipline
• the research context
141
142. Some Newbie Observations for Researchers
(cont.)
• Accurate analysis will require understanding… (cont.)
• what to select as relevant from the mass of numerical data
• ways to test the findings (internal and external validation)
• Internal: the text set(s)
• External: the world (what the text findings may indicate about the author…the state of
the genre…the state of the world, often through statistical means)
• ways to hypothesize about and interpret the findings
142
143. Going to the Source
• “The Development and Psychometric Properties of LIWC2015” by James
W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate Blackburn
addresses the following and more:
• Psychometric baselines
• Examples
• Test text corpora details
• Internal consistency measures of the various psychological constructs (based on
uncorrected α and corrected α)
• Uncorrected α based on average of standard Cronbach’s alpha calculation for the term
across corpora
• Challenges of highly variant base rates of word occurrences across documents and corpora
• Corrected α based on Spearman-Brown prediction (prophecy) formula which takes into
account the amount of underlying text (“test length”) in attributing internal reliability (the
more data, the stronger the potential reliability)
143
144. Internal Consistency of Words as Indicators of
Psychometric Constructs…and Confidence Levels
• The built-in psychological constructs have a range of confidence based
on prior research based on a diverse and large text corpora.
• The higher the internal consistency measures (usually measuring correlations
between different items on the same test / words in the same construct, and
measured on a restricted scale of 0 – 1 outputs), the greater reliability of the
software tool in using language to identify a psychological construct (and
therefore the higher confidence users may have in the output).
• Uncorrected α tends to “grossly underestimate reliability in language categories
due (to) the highly variable base rates of word usage within any given category”
• Corrected α is based on the Spearman-Brown prediction formula and this is
considered “a more accurate approximation of each category’s ‘true’ internal
consistency” (Pennebaker, Boyd, Jordan, & Blackburn, 2015, p. 8).
144
145. Basic Approaches for Researchers
Researchers…
• study the built-in dictionary and other available .dics in LIWC2015
• read up on the related academic research literature (and extract the
methods that make the most sense)
• experiment with testable hypotheses about particular textual
datasets in particular disciplines
• trial-run the tool on datasets about which the researcher is already
intimate to get a sense of the tool
• sample broadly in terms of texts *and* sample texts in a targeted and
strategic way, too
145
146. Basic Approaches for Researchers (cont.)
Researchers… (cont.)
• practice cleaning (pre-processing) textual data for processing in
LIWC2015
• formulate and temper hypotheses based on the LIWC findings
• work to find counter-evidence to one’s hypotheses (both seeking
validation and invalidation)
• use in-world knowledge to interpret and test the findings from
LIWC2015
146
147. Advanced Approaches for Researchers
Researchers…
• design their own research approaches using LIWC2015
• use a variety of research methods to capture the desired knowledge
more fully (so not just using LIWC alone)
• create their own dictionaries for limited and targeted questions
147
148. Interpreting One Document
(this slideshow in an earlier iteration)
• Scores vs. counts (with computed percentages)
• Scores involving calculations (some algorithmic processing) of raw counts
• Counts (raw) within groups of dimensions (and computed percentages)
148
149. Addendum: An Applied Example
… based on this slideshow v. 1 and v. 2…
149
Data Visualizations: The following data visualizations were mostly created after a second run of the slideshow
was done in LIWC2015, so there are some small discrepancies between the posted numerical data and
the numerical data used to create the data visualizations (in Excel). The slideshow itself changed with the
addition of the analytical data from LIWC…and from updates as the presenter started to better understand
the software (still as a neophyte). Sorry about any inconvenience from the data discrepancy.
150. A LIWC Run on this Slideshow
150
Filena
me
Segm
ent WC
Analyt
ic Clout
Authe
ntic Tone WPS Sixltr Dic
functi
on
prono
un ppron i we you shehe they ipron article prep
auxve
rb
adver
b conj
negat
e verb adj
comp
are
interr
og
numb
er quant affect
pose
mo
nege
mo anx anger sad social family friend
femal
e male
cogpr
oc
insigh
t cause
discre
p tentat
certai
n differ
perce
pt see hear feel bio body health sexual ingest drives
affilia
tion
achiev
e power
rewar
d risk
focusp
ast
focusp
resent
focusf
uture relativ
motio
n space time work
leisur
e home
mone
y relig death
infor
mal swear
netsp
eak assent nonflu filler
AllPun
c Period
Com
ma Colon SemiC
QMar
k
Excla
m Dash Quote
Apostr
o
Paren
th
Other
P
LIWC-
ing at
Texts
for
Insigh
ts
from
Lingui
stic
Patter
ns.pdf 1 5888 95.05 51.70 22.36 37.80 46.36 35.39 62.42 31.10 2.79 0.39 0.02 0.05 0.07 0.00 0.25 2.39 4.55 13.20 3.52 2.09 6.22 0.54 8.14 5.06 2.55 0.87 6.08 2.55 3.09 1.80 1.12 0.27 0.20 0.17 4.45 0.07 0.05 0.07 0.07 14.45 4.26 3.16 0.61 3.31 1.00 3.57 1.14 0.51 0.25 0.12 0.46 0.10 0.24 0.03 0.08 4.79 0.88 1.26 1.70 0.97 0.27 1.32 4.38 0.85 9.21 0.95 6.40 1.75 5.25 0.39 0.10 0.14 0.05 0.08 0.68 0.02 0.54 0.02 0.12 0.00 32.69 2.43 5.01 2.38 0.39 1.02 0.02 2.09 1.46 0.19 7.46 10.26
151. Summary Language Variables
Scores (/100, as a percentile measure)
• Analytic: 95.26 (score)
• Clout: 55.01 (score)
• Authentic: 22.15 (score)
• (Emotional) Tone: 40.94 (score)
Counts (raw counts)
• WC (word count): 4281 (count)
• WPS (average words per
sentence): 47.57 (count)
• Sixltr (words > 6 letters): 36.32
(count)
• Dic (number of words out of 100
in the built-in dictionary,
application of the dictionary to
the text): 63.68 (count)
151
152. Linguistic Dimensions
(% of words in a text that fit in a certain linguistic category, including 21 possible dimensions)
• Function (vs. non-function words): 31.51
• Pronoun: 2.66
• Ppron: 0.40
• I: 0.00
• We: 0.05
• You: 0.07
• Shehe: 0.00
• They: 0.28
• Ipron (impersonal pronouns): 2.27
152
• Article: 4.58
• Prep: 13.50
• Auxverb: 3.53
• Adverb: 2.13
• Conj: 6.49
• Negate: 0.35
159. Social Processes
• Social: 4.56 (summary percentage from the category but not with
rounded-up numbers)
• Family: 0.05
• Friend: 0.05
• Female: 0.07
• Male: 0.07
159
167. Drives
Drives: 4.25 (summary percentage
from the category but not with
rounded-up numbers)
Affiliation: 0.79
Achieve: 1.17
Power: 1.47
Reward: 0.68
Risk: 0.30
167
176. Extrapolated Content Summary (rough)
176
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
function affect social cogproc insight percept bio drives "timeorient" relativ "perscons" informal AllPunc
Extrapolated Content Summary (rough)
177. Extrapolated Content Summary (rough) (cont.)
• So a slideshow focused on cognitive processes, relativity, time
orientation, personal concerns, drives, social, insights, in descending
order (based on the prior summary, which does not sum cleanly to
100% because of the rounding up) (prior slide)
177
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
A Rough Content Summary
179. Four Summary Language Dimensions of this
Slideshow (in a spider chart)
179
95.26
55.01
22.15
40.94 0
10
20
30
40
50
60
70
80
90
100
Analytic
Clout
Authentic
Emotional
Four Summary Language Dimensions of this Slideshow
180. Four Summary Language Dimensions of this
Slideshow
Descriptive?
• Do these summary dimensions
capture the human close-
reading sense of the slideshow?
• If so, what are its insights?
• If not, where does it fall short?
Prescriptive?
• Analytic: Should this work be so
highly analytic? (score: 95.26)
• Clout: Should there be more
effort to raise the clout score to
indicate more expertise,
confidence, status, and
leadership? (even though the
presenter is a newbie to LIWC)
(score: 55.01)
180
181. Four Summary Language Dimensions of this
Slideshow (cont.)
Descriptive?
• What are ways to capture other
aspects of the text or texts
beyond the summary language
dimensions? (and outside of
LIWC)
Prescriptive?
• Authentic: Should the language in
this slideshow be more personable
and authentic? Less guarded?
More vulnerable? (score: 22.15)
• Tone (emotion): Given its
emotional tone, which is trending a
little negative (< 50), should more
effort be made to make it trend
more positive? (score: 40.94)
181
182. Comments? Questions?
• Any insights about challenges to interpreting the data without
baselines? Control text sets? Without comparatives?
• Insight about why the color coding of the target document or text set
may be helpful?
• Ideas for new applications of LIWC2015? Fresh research ideas?
• Strengths of the tool and research methodology? Weaknesses? Ways
to strengthen this approach?
182
183. Conclusion and Contact
• Dr. Shalin Hai-Jew
• iTAC, Kansas State University
• 212 Hale / Farrell Library
• shalin@k-state.edu
• 785-532-5262
• No ties: The presenter has no tie to the maker of LIWC (Pennebaker Conglomerates, Inc.).
• Data visualizations: The simple data visualizations were created in Microsoft Excel using a re-run
of the LIWC tool over this revised slideshow, so the numbers in the data visualizations are slightly
different from the numbers in the first set used in the Addendum: An Applied Example (Slide 149
onwards).
• Newbie alert! Also, the presenter is a newcomer to LIWC and is still LIWC-ing around. If you see
an error, please contact the presenter, so the slideshow may be corrected. Thanks!
• And less-newbie learning: A new section was added as the presenter worked on her first custom
dictionary, so if you downloaded an earlier version, please download this current one.
183