SlideShare uma empresa Scribd logo
1 de 183
Baixar para ler offline
LIWC-ing at Texts for Insights from
Linguistic Patterns
Overview
• Since the mid-1990s, researchers have been using the Linguistic
Inquiry and Word Count (LIWC, pronounced “luke”) software tool to
explore various text corpora for hidden insights from linguistic
patterns. The LIWC tool has evolved over the years. Simultaneously,
research using computational text analysis has evolved and shed light
on areas of deception, threat assessment, personality, predictive
analytics, and others. This presentation will highlight some of the
applications of LIWC in the research literature and showcase the tool
on some original text sets.
2
3
Content Overview
• Notes About Language
• Reading = Decoding; Writing =
Encoding
• Computational Linguistic Analysis
• Curation of Text Sets
• LIWC: Linguistic Inquiry & Word
Count
• Work Space in LIWC2015
• Structure of a Custom External
Dictionary
• Insights from Experiences Working
on a Custom Dictionary* (added)
• Creating .dic Files* (added)
• LIWC in Applied Research
• A Basic Walk-through with
LIWC2015
• Live Demos
• Some Types of Askable Questions
4
Content Overview(cont.)
• Challenges with Internal and
External Validation
• Some Other Computational
Linguistic Analysis Tools
• Some Other Approaches to the
Data
• Conclusion (and some Newbie
Observations)
• Addendum: An Applied Example
5
Notes About Language
6
Some Generalities About Language
• Most language is natural language which evolves common practices and
structures over time based on human interaction (vs. constructed
language, like Esperanto, or those created for the silver screen).
• Language evolves over time based on human usage, particularly in local geographical
areas.
• Unique dialects may develop locally in particular regions or within certain social groups.
• Language itself tends to be patterned but not necessarily internally logical.
• Modern languages originate from language families and are influenced by other
languages.
• Languages are shared codes (oral and written) for people to communicate
and exchange information.
• Because languages have to be understood broadly, they tend to be highly patterned.
• Modern languages tend to have written and phonology aspects; they tend to include
both content (semantic) and structure (syntactic) aspects.
7
Some Generalities About Language (cont.)
• Only 200 of the world’s 6,000 – 7,000 languages have a written
version; most are / have been oral only.
• Language is social; it plays a core role in how people make meaning and
interact with each other.
• Changes in a language (based on new technologies, interactions between
cultures, and fashion) are often adopted first orally and then integrated in
more formal written forms.
• World’s languages are disappearing as their users discontinue the
uses of the languages for more commonly shared languages (“Lists of
endangered languages”).
• Globalization has complex effects on world languages.
8
Some Generalities About Language (cont.)
• Semantic terms tend towards polysemy (being multi-meaninged) and
nuance, and so are inherently ambiguous.
• Words must be understood in context (translate: proximity to the target term) to
understand their particular respective word sense (connotative application vs. only
denotative).
• There are statistical probabilities for which meaning of a word is likely being used,
and based on proxemics terms to the target term, it is possible to “understand” the
particular meaning of a term in a context.
• Language contains high dimensionality data; it involves many facets.
• Language has text and subtext as well, so the meanings conveyed are not only
surface ones but some hidden (or latent) aspects.
• People wield language in non-obvious ways, such as by using humor, irony,
symbolism, historical referencing, tone, and other aspects.
9
Changing Roles of Writing in Societies
• Writing used to be practiced by those with political and social power,
and their creations were based on formal structures and conventions.
• Originally focused on religious issues
• Broadened to address issues of interest for the literate upper and political
classes
• Writing is now way more practiced by the masses, who are much
more broadly literate.
• Topics cover anything of interest but still along certain code-able topics (in
terms of library system labeling, and others) for formal publishing.
10
Common Forms of Writing
Non-fiction
• Journalism
• Essay writing
• Autobiography, memoir
• Biography
• Research writing
Fiction
• Poetry
• Short Stories
• Novelas
• Novels
• Plays and scripts
11
Common Forms of Writing (cont.)
Non-fiction
• Documents
• Letters
• Oral histories
• Interviews,
• Manifestos,
• Statements, and others
Fiction
• Songs
• Jokes
• Synthetic data for dummy case-
or scenario- research, and others
12
Reading = Decoding Text
Writing = Encoding Text
“A language is a fecund, redolent buzzing mess of a thing, in every facet, glint, and corner, even in single
words.”
-- John McWhorter, What Language Is (2011)
13
encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> …
The two activities hold each other in tension and constrain themselves and the other.
Complementary Human and Machine
Reading
Human: Close Reading
• Informed by training,
experience, personality,
intellect, emotion, and other
factors
• Full sensory: sight, smell, taste,
touch, hearing, and
proprioception (embodied)
Computer: Distant Reading
• May include supervised or
unsupervised machine learning
• More efficient and scalable than
human reading
• Results in objective counts
14
Complementary Human and Machine
Reading (cont.)
Human: Close Reading
• Interpretive and subjective,
filtered through the person
Computer: Distant Reading
• Objectivist
• Reproducible (theoretically and
practically)
15
Some Common Distant Reading Approaches
• Human and computer (supervised machine learning):
• XML tagging and data queries of the tagged texts
• Literary analysis including on dimensions of time, characters, dialogue, locations, and
other aspects (such as in the digital humanities)
• Coding by existing pattern (with original human coding: emergent, a priori, or
mixed)
16
Some Common Distant Reading Approaches
(cont.)
• Human and computer (data queries):
• Text frequency counts for issues of main focus
• Both:
• Absolute frequency counts
• Relative frequency counts (relative to the document and corpus)
• Mitigation of counts based on how frequently a term appears in a document and corpus,
with more common word appearances diluting the informational importance of that
word (TF-IDF)
• Word search to find all word contexts for word and phrase disambiguation
17
Some Common Distant Reading Approaches
(cont.)
• Computer (unsupervised machine learning):
• Sentiment analysis (against a pre-coded sentiment word set)
• Emotion analysis (based on a number of different models)
• Derivation of gender, personality [Big 5 Personality Traits, Dark Triad
Personality traits (narcissism, Machiavellianism, and psychopathy), with
evidence of “within-person stability” that enables profiling and
comparisons/contrasts between people], age, cultural background, and
others
• Remote profiling (with “zero interaction”)
• Predictive analytics, such as anticipation of actions by leaders based on public
and private signaling (extrapolation of intentionality)
18
Some Common Distant Reading Approaches
(cont.)
• Computer (unsupervised machine learning) (cont.):
• Deception detection and analysis (such as through “pronoun drop”)
• Stylometry (including author identification): convergence of linguistic
features to an (un)identified author
• “psychological signatures” of writers based on their written works and communications
and extending to profiles
• Domain mapping (topic extraction)
• Theme and subtheme extraction / topic modeling (unsupervised machine
learning)
• Machine-extraction of textual features that are “tells” for certain
outcomes
19
Computational Linguistic Analysis
(as a subset of “distant reading” capabilities)
20
Brief History of Quantitative Approaches and
Linguistics
• Manual or “counting by hand” back in the day
• Computational systems tested against human experts in a particular
field or domain
• Computational linguistics in 1950s at beginning of the Cold War to
translate texts from foreign languages into English (machine
translation)
• Used in translating spoken language to text translation
• Used in summarizing texts (topic modeling) at scale for search tools
(“Computational linguistics,” Apr. 1, 2016)
21
Computational Linguistic Research Design Built on
Theories, Models, and Empirical Research
• Informed by research in language, (social) psychology, computer
science, and other fields
• May involve text exploration, discovery, or targeted research
questions (or some combination)
• Building on theories, models, and empirical research
• Hypothesizing based on theories and models
• Grouping writing based on particular outcome variables to identify
differences in writing, using selected observed indicators in (written and
spoken) language as potential indicators of difference between the groups
with the differing outcomes
• Using a combination of insights from theories, models, and empirical research
22
Computational Linguistic Research Design Built on
Theories, Models, and Empirical Research(cont.)
• Relationship between natural language expression and…
• hidden internal states of people (gender, personality, cognition, state of mind,
intentionality, etc.) and hidden internal states of groups and cultures
• health
• genres of writing
• different language structures
• gender differences
• Language features as certain “tells” (indicators, signs, signals)
• Reverse engineering backwards in time
• Predictive analytics forwards in time
23
Computational Linguistic Research Design Built on
Theories, Models, and Empirical Research(cont.)
• So essentially: plaintext = code (indicators) of latent (hidden) realities
• So can profile various genres of text for general characteristics / baselines
• So can compare new exemplars of particular texts against baselines
• So can profile an “unknown” text based on its quant characteristics
• So can compare historical texts against future ones
• So can compare historical occurrences and related texts…and possibly apply
in predictive ways into the future
24
Creation of Dedicated Dictionaries
• Informed by in-world texts
• Suggested words and stems and synonyms
• Vetted by people
• Empirically tested for research value
• Does the dictionary provide practical research insights?
25
Consumptive and Non-Consumptive Text
Analysis
Consumptive Text Analysis
• Access to the analytics AND the
underlying text set(s)
Non-Consumptive Text Analysis
• No access to the underlying text
set
• Google Books Ngram Viewer is
one popular example of non-
consumptive computational text
analysis (with access to the
shadow text set of ngrams only)
26
General Sequence
• Theoretical underpinning
• Research design
• Collection of target text
documents into corpora
• May need to negotiate the release
of particular rights
• Preservation of raw data into a
pristine master set
• Development of familiarity or
intimacy with the text sets
(through close reading and other
types of explorations)
• Translation of non-base
languages to the base language
(or separation for different data
runs using different language
dictionaries)
27
General Sequence(cont.)
Text cleaning
• Separating each text bit into its own
file based on a unit of analysis (quote,
paragraph, article, section, novel, or
play, etc.)
• Data normalization / spell check
• Clear and representative file naming
protocols
• De-identification of data (if relevant)
• Cleaning of notes from transcripts,
and others
Text file transcoding (with close
observations of data capture and data
loss at each phase)
• Images with word content are often
not represented as textual contents
• Metadata may / may not be captured
• Sequences of text processing,
software used, and such, affect the
word counts
• Preferred: .pdf -> MS Word (less lossy)
• Not preferred: .pdf -> .txt (lossy)
28
General Sequence(cont.)
• File formatting (.txt., rtf, .pdf,
.doc, .docx, .csv, .xl, xlsx, NOT
.pptx, .ppt, .wpd, .jpg, .png) for
LIWC2015
• Versioning of text corpora for
different queries
• “Bag of words” paradigm or
structure / context preservation
• End-sentence markers for
sentence length
• Extraction of data tables
• Creation of data visualizations
from the extracted data
• Interpretations and analyses
• (In)validation of the linguistic
analysis
29
Sense-making from Linguistic Patterning
• Starting with known information and prior research (and prior theory)
that will inform the analyses; may be based on a stated hypothesis
• Selecting relevant texts that are comparable along particular
dimensions
• May use from-world text corpora (such as based on dichotomous
nonparametric outcome variables or multi-factor outcomes)…and
looking for linguistic differences and similarities
30
Sense-making from Linguistic Patterning (cont.)
• Identifying linguistic “markers” / indicators that are a “tell” for a
particular construct or state-of-the-world
• Setting baselines (controls) for certain types of texts and then
comparing a subset against the general baselines
• Looking at text clustering as factor loading and applying human
understandings of the respective factors
• Comparisons and contrasts across dimensions of texts (such as text
types across cultures)
31
“Linguistic Style”
• Study of semantic terms makes sense to human readers and aligns
with how the human brain works (in terms of what is noticed /
perceived and remembered in a text), but semantic terms and unique
phrasing are eminently emulatable and manipulate-able
• Point is to find indicators that are not so easily tampered with by people who
may want to manage impressions
• Going to “function words” or particles (articles, pronouns,
prepositions, conjunctions, auxiliary / helping verbs, adverbs,
negations, etc.)
32
Curation of Text Sets
33
Selected Texts from a Domain
• Delimiting of targeted texts helps focus the function of the software
• All included texts should relate to the topic being studied
• Said to work better with natural language text than non-natural
language text
• Texts do not have to be only one type, but if they are mixed text sets,
that should be noted in the analysis
• Types of text data should be reflected in the types of dictionary dimensions
(and categories) applied…as well as the selected language dictionary
34
35
Curation of Text Sets
1. Gathering of textual and non-textual (such as multimedia) data
2. Selection of relevant texts
3. Arrangement of rights releases as needed (staying legal)
4. Transcoding of multimedia content to textual and textual-to-textual
5. (Non-destructive) data cleaning: normalization of terms, spelling,
foreign language translation, treatment of symbols and punctuation,
and others (with archival of all raw files in pristine format prior to
any normalizing, for “non-destructiveness”)
6. Data segmentation / grouping
7. Data / file labeling
36
Curation of Text Sets (cont.)
8. Data formatting / file conversions
• Must be searchable (machine readable, with optical character recognition /
OCR) files in any of the following formats: .pdf, .doc, .docx, .txt, .rtf, .xl, .xlsx,
.csv, etc. (Ability to read within and across columns in spreadsheet formats is
a new LIWC2015 capability.)
• Considerations for digital preservation, with common strategies of going to
the lowest common denominator files and open source if possible (.txt, .rtf,
.csv, .html, .xml)
9. Metadata creation (linked to the respective files or kept in a
README file with the textual data)
37
Curation of Text Sets (cont.)
10. Conducting of research on the text set: Inherent enablements for certain
types of data queries and data explorations and research based on
dataset contents and structure (such as test text sets to train new
models)
11. Scrub of dataset for publishing
12. Descriptions of the text set: Its origins, its contents, its quantitative
numbers, its standards for text inclusion, its copyright releases, its prior
uses, its potential uses, proper citation methods, originator’s /
originators’ contact information, and others; datasets named based on
curated textual and other contents or the data curator or some other
naming method (for easier reference, for building up a user base)
38
A “Sanity Check” for Text Processing
• Transcoding from one document type to another often results in
information loss because of how each software program handles the
transferred information.
• Data lossiness:
• There is some degree of expected lossiness. For example, text in images will not be
recognized unless there is optical character recognition (OCR) applied.
• Embedded videos will not have a text equivalency unless a transcript is also downloaded
and included with the text document, corpus, or corpora.
• In multi-lingual text files, messages that are not in the main base language may not be
transferred accurately.
• Some valid words may turn to garble in the transcoding.
39
A “Sanity Check” for Text Processing (cont.)
• Added extraneous data: There is some degree of extraneous information
included, such as web page data in between-page gutters captured in a web-
to-PDF capture.
• Comments made to a .pdf may be included in a transcoding context, say, to
Word.
• There may also be extra information in web pages if they are not captured in
print style but include advertising in designated ad spaces and pop-up
windows.
• Print styles of web pages are not always included as an option.
40
A “Sanity Check” for Text Processing (cont.)
• If batch processing, check results with smaller sets first. Check the
outputs as well.
• In automation, it is possible to lose information if this is done unthinkingly.
• To see if there are systematic challenges with losing and / or gaining
text during transcoding, run a “sanity check” after processing data to
see how much of the original was preserved.
• One type of “sanity check” is to run a simple average word count of a
particular unit (document, message, or other).
• Does the per-unit word count jibe with what is observed by the researcher?
If not, it’s important to figure out a less lossy way of capturing query-able
files.
41
A “Sanity Check” for Text Processing (cont.)
• There may be differences between “Save as,” “Export as,” “Print as,” “Send
to,” and other sorts of functions that enable transcoding between file
types.
• Allowing character substitution or not will affect the transcoded contexts
from MS Word.
• For those text files with UTF-8 characters, it is important to ensure that the
coding enables UTF-8 characters.
• There are more optimal sequences and technologies to move text from one
file type to another, so researchers should experiment with what works
best for them.
• In a PDF file, go to .docx (via MS Word) to capture much more recognizable text (vs. a
.txt or a .rtf).
42
Creation of Text Metadata from Multimedia
Sources
• Types of files:
• Digital imagery, audio files, video files, slideshows, games, simulations, and
others
• Analog-to-digital files (transcoding)
• Text versioning: metadata descriptors, transcripts, locational
information, coding (whether manual, automated, or mixed) and
others
• Using the extracted text transcripts for linguistic analysis…but a step or two
out from the original source
• Updated capabilities to read image files (in PDF) to text in an automated way
43
LIWC
Linguistic Inquiry and Word Count
44
LIWC and its History
• Developed in early-to-mid 1990s by Martha E. Francis (then a grad
student and programmer) and James W. Pennebaker (1993) to study
possible therapeutic use of language
• Named LIWC (Linguistic Inquiry and Word Count), helpfully descriptive but
also disambiguated; “LIWC” pronounced “luke” (according to its makers)
• Comprised of two parts: (1) a processing component and (2) dictionaries
(based on certain categories of data and / or constructs)
• Factors broadening rapidly in v. 1 to 80 factors (variables)
45
LIWC and its History(cont.)
• v. 2 evolved with an expanded dictionary and more modern software
processing capabilities (2001), also known as SLIWC (Second LIWC)
• LIWC2007 has even broader dictionary capabilities by James W.
Pennebaker, Roger J. Booth, and Martha E. Francis (2007)
46
LIWC and its History(cont.)
• Most recent version is LIWC2015 (Pennebaker, Boyd, Jordan, &
Blackburn, 2015), with new software and new dictionary (vs. an
upgrade) and extensive documentation
• LIWC2015 dictionary contains nearly 6,400 words, word stems, and emoticons
• “Each dictionary entry additionally defines one or more word categories or
subdictionaries”
• Includes a feature to include customized dictionaries
47
LIWC and its History(cont.)
• Systematic process of LIWC2015 dictionary creation involved building on
LIWC2007, having 2-6 judges individually generating word lists, having
collected words analyzed by a group of 4-8 judges, application of a
Meaning Extraction Helper to set base rates of word usage in the wild,
creating candidate word lists of terms possibly missed by judges, and
psychometric evaluation of respective words’ influences on the constructs,
then refinement of the terms, and a re-review of the prior steps to catch
potential errors (Pennebaker, Boyd, Jordan, & Blackburn, 2015, pp. 5 – 6)
• Is a well documented software tool (rare)
• Is tested for both internal validity (based on real-world text sets) and
external validity (based on research designs, with validity not applicable
just across-the-board but in case-by-case bases)
48
LIWC and its History(cont.)
• Tested again large text corpuses: blogs, “expressive writing,” novels, natural
speech, NY Times, and Twitter to set baselines
• LIWC2015 captures “on average, over 86 percent of the words people use in writing and
speech” (Pennebaker, Boyd, Jordan, & Blackburn, 2015, p. 10)
• Fairly high correlations between LIWC2007 and LIWC2015 means (p. 13)
• Removal of categories “largely due to their consistently low base rates, low
internal reliability, or their infrequent use by researchers” (past tense verbs,
present tense verbs, future tense verbs, human words, inhibition words,
inclusives, exclusives); versions 2001 and 2007 enabled
• Internally validated on a variety of psychometric dimensions;
backstopped by empirical research across a number of modern
languages
• Informed by decades of empirical research
49
Internal Consistency Measures across
Variables
50
Textual Baselining
51
LIWC and its History(cont.)
• As a commercial product
• May be purchased (http://liwc.wpengine.com/) from Pennebaker Conglomerates,
Inc.
• Free trial version may be accessed (http://www.liwc.net/tryonline.php) but with text
size limits
• Includes a LIWC API with all LIWC2015 variables “plus 30+ additional validated
measures of psychology, personality, moetion, tone, sentiment and more—all in real
time” and access to “social media integration, time-series analysis, statistical models,
machine learning models and more” (through Receptiviti)
• Is not open-source
• Runs on both Windows PCs and Macs OSes via the Java Virtual Machine
• May be downloaded to the local machine or accessed as a web-based
version
52
LIWC and its History(cont.)
• Processes text files sequentially, finds each word, looks to see if it is
its built-in dictionary, counts that word, and increments as a straight
count and then applies that count to a simple percentage function (%
of the complete document) or a variable-based scale function (based
on dedicated algorithms) for psychometric, psycholinguistic, or other
human-related measures
• Outputs one of the following:
• Raw counts
• Frequency percentages
• Processed scores (percentiles)…but no access to the coded text sets
53
LIWC and its History(cont.)
• Evolution of built-in dictionaries over time based on documented
standards and researcher majority consensus
• Based on English (with 171,000 “English” words in use, 100,000 English words
used by the average native English speaker)
• Considered the foremost linguistics analysis tools in use today
• Backstopped by hundreds of research articles
• Can handle any language representable by UTF-8 charset / character set (but
analytics done in a base language)
54
Human-Created Non-English Translations
based on LIWC2001 or LIWC2007
Available
• Spanish
• German
• Dutch
• Norwegian
• Italian
• Portuguese
In Process
• Arabic
• Korean
• Turkish
• Chinese
55
Downloadable External Dictionaries (.dic)
from LIWC2007 and LIWC2001
LIWC2007
• Spanish
• French
• Russian
• Italian
• Dutch
LIWC2001
• German
56
Downloadable Customized Dictionaries
• Dedicated site for dictionary downloads also enables access to user-
created dictionaries
• Four were available as of mid-2016
• Two were coherent (structurally and conceptually), and of those, one was a
sample one to show how to set up a dictionary for use in LIWC2015
• Linguistic analysis dictionary-creators need to be expert in an area of research
• They need a clear grasp of the language that they are using
• They need to work with others to ensure that the linguistic analysis dictionary is as
comprehensive and as accurate as possible
• Such dictionaries—like any research instrument—should be fully documented and tested
for validity (of construct) and reliability (of consistent results); the first is based on the
subject matter field, and the latter is based on counting (which is usually very high
reliability for item counting)
57
58
Exporting Pre-Built Internal Dictionaries
• Can export internal dictionaries (LIWC2001, LIWC2007, and
LIWC2015) as “posters” in secured (non-editable) .pdf files
59
Original Customized External Dictionaries
• How to create:
• Conceptualize a construct
• Identify terms that fit that construct
• In a text editor, list in the proper format (next slide): first the constructs and
then the terms in each of those constructs
• Be sure to have the proper placement of the % and %
• Version the file as a .dic file (changing the text file extension of a basic text file
BUT—go to Slide 75 for the easiest way to create a .dic file that works); in
LIWC2015, .doc or .docx or .txt files may be used as dictionary files as long as
the other formatting is in place
60
Original Customized External Dictionaries (cont.)
• Adding to LIWC (for the analyses)
• Dictionary -> Load New Dictionary
• Can only run one dictionary at a time, but can run various dictionaries over
the same text set for different insights
• Conduct research and test the respective constructs for internal
reliability and external validity
61
Structure of a Custom External Dictionary
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Custom Dictionary Structure:
• Constructs or dimensions on the top
section; words representing the various
constructs below
• Use of unusual characters ($, #, %, ?, ^, *,
etc.) to separate dimensions or
categories from the words themselves,
with these on their own lines
• May represent any language depicted
with UTF-8 charset (of Unicode)
• May use created characters for
imaginary created languages (vs.
natural languages)…but haven’t seen
this yet in the LIWC custom dictionary
collection
62
Structure of a Custom External Dictionary (cont.)
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Selected words indicative of particular
dimensions or categories (single
dimensions or multiple ones) for the
bottom section
• Words may indicate several constructs,
but multiple counting of terms will mean
more noise in the data (as compared to
signal)…and will err on the side of recall
vs. precision (in terms of an f-measure)
• May use empirical data and sources to
stock lists, then add synonyms to
expand the dictionary’s transferability
beyond the “training data”
• Word list should be alphabetized for
easier perusal and for elimination of
repeated words
63
Structure of a Custom External Dictionary (cont.)
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Helpful to have custom dictionaries
constructed multi-dimensionally to
capture a full and complex issue
• May run multiple dictionaries against
a corpus or combined corpora
• May divide up corpuses into separate
documents and sets for different sorts
of queries
• For example, use separate corpora to
analyze different time periods with sets
representing different time periods
• Use the creation of corpora and the
separation of documents and datasets
into different sets…as a way to enhance
LIWC capabilities
64
Structure of a Custom External Dictionary (cont.)
%
1 Dimensiona
2 Dimensionb
3 Dimensionc
%
Word 1
Word 1 2 3
Word 3
Word 3
Word 2
Word 1
• Results as straight raw word or phrase
or emoticon counts and computed
percentages of occurrences against
the entire corpora (not scores)
• For transferability and research
efficacy, need to validate / invalidate a
custom dictionary through pilot-
testing and usage
• Testing may involve
• Review by experts in the field
• Application of dictionary against various
text sets
• Statistical testing for whether words
represent the respective constructs
65
Required Notation in Customized Dictionaries
• Category names must be one word (and can be written as several words in
camel case)
• Separate words by spaces or tabs or new lines / hard returns (but be
consistent)
• Stemmed words may be counted (separate from the core word)
• Stemmed words are created with changes to a word’s form, such as with the addition
of prefixes, suffixes; the adding of count (pluralizing); the expression of verb tense;
and other transformations from a core or base term
• Asterisk (*) tells LIWC2015 to ignore all subsequent letters that follow to
capture all forms of the word based on a base form or lemma or stem (so
the differently inflected word forms may be treated as a single item)
• Telephon*
66
Required Notation in Customized Dictionaries
(cont.)
• Inclusion of multi-word phrases possible, such as for specific
compound terms or n-gram sequences
• Single-form versions of those terms will not be counted separately, and
phrases are ultimately treated as one-word units
• Words in alphabetical order in the customized dictionaries
67
68
Dictionaries for the Study of Various
Constructs and Dimensions
Built-in Dictionaries
• Selective coded word set that
define a particular category
• Dictionaries affect the
fundamental tool capabilities
• Validated
Customizable Dictionaries
• May apply custom dictionaries
to the analyses
• External dictionaries as plain text
files delimited by % and %
69
Insights from Experiences Working on a
Custom Dictionary
• Study the issue in depth, both in the formal and informal literature. Use a
“greedy” and “voracious” capture for sources.
• Create constructs (to be sufficiently mutually exclusive but also to cover
the research topic as comprehensively as possible). Write these as single
words or phrases using camel case.
• Capture words that indicate the respective constructs from all possible
respectable sources.
• Go beyond text to images and multimedia. Code everything that is
relevant.
• Capture as many natural language words that represent the various
constructs as possible.
• If using social media as the source, pay attention to abbreviations (from
everywhere), #hashtags, @expressions, emoticons, and a range of other details…
70
Insights from Experiences Working on a
Custom Dictionary (cont.)
• Avoid early lock-in or early finalization of a dictionary. (Assume that a
custom dictionary is never really finalized.)
• If a word applies to multiple constructs, include it in the multiple
constructs. (Don’t commit a word to only one construct.)
• Build a table in Word or Excel. Do not number the cells. Keep this as
freeform and inclusive as possible. Extend the brainstorm stage as
long as possible, so that there is not early commitment to an early
draft.
71
Insights from Experiences Working on a
Custom Dictionary (cont.)
72
Construct(s) Related Words, Phrases, Symbols,
Numbers, etc. to the Construct(s)
1 ConstructA …words…phrases…symbols…numbers,
and others
2 ConstructB "
3 ConstructC "
4 ConstructD "
5 ConstructF "
6 ConstructG "
Insights from Experiences Working on a
Custom Dictionary (cont.)
• When the table is complete (at least for this round)…and the
dictionary has to be collated…
• Assign numbers to the constructs.
• Assign numbers to the related words showing their respective
relationships to the respective constructs.
• List the constructs in numerical order.
• You now have the top part of the custom dictionary.
73
Insights from Experiences Working on a
Custom Dictionary (cont.)
• Make a “bag of words” of all the words (with
their assigned construct numbers in the
adjacent cells in Excel).
• Filter the column of words into alphabetical
order and include the “Expand the Selection”
command so all the row data follows the sorted
file.
• Take the alphabetized word list, and you have
the bottom part of the custom dictionary.
• Test this in LIWC… by loading the new dictionary
and selecting the type of analysis desired…
74
Creating .dic Files
• Open MS Word.
• Click “File” tab in the ribbon.
• Click “Options” at the bottom left.
• Select “Proofing” in the “Word Options” window.
• Click the “Custom Dictionaries” button.
• Indicate a “New” dictionary.
• Give the new dictionary a name and save it to the correct location
with the .dic file format.
75
Creating .dic Files(cont.)
• Open the .doc file in Word and paste the dictionary (with new words
on each line) into the file. Save. Load. Run.
• If you’ll be making multiple dictionaries, make a few extras with
generic names to serve as templates!
76
77
78
79
80
81
Dimensions of Language in LIWC2015:
Summary Language Variables
• Four Summary Language
Variables (standardized
composite scores based on
algorithms created from prior
linguistic analysis research and
large “training” text sets)
• Reported out as percentiles from 0
to 100
• Relative and comparative standing
of a target text document or text
set (against training text set) vs.
any “absolute” measure
• These summary language
variables include the following:
Analytic, Clout, Authentic, and
Tone
• These variables each have unique
meanings, so it is important to
read the official manuals to
understand the respective
meanings
• These are “black box” features,
so the underlying algorithms are
not available
82
Dimensions of Language in LIWC2015:
Summary Language Variables(cont.)
• Analytic (formerly categorical
dynamic index or “CDI”):
• high score: formal, logical, hierarchical
• low score: informal, personal, narrative
thinking
• Clout:
• high score: “perspective of high
expertise”
• low score: tentative or humble style
• may be indicative of relative social status,
confidence, and leadership
• Authentic:
• high score: honest and disclosing (being
“personal, humble, and vulnerable” and
authentic)
• low score: more guarded “distanced
form of discourse”
• (Sentiment and Emotional) Tone:
• high score (>50): positive emotion
• low score(<50): “greater anxiety, sadness,
or hostility”
• at 50: “suggests either a lack of
emotionality or different levels of
ambivalence” (LIWC2015 Operator’s
Manual)
• below 50 is negative, above 50 is positive
83
Dimensions of Language in LIWC2015:
Summary Counts to Indicate “Structural
Composition” and Complexity
• WC (total word count) (raw count)
• length of the particular text used as a
proxy for how in-depth that work may
be in addressing the target topic
• WPS (words per sentence)
(average)
• used as a proxy for sentence
complexity
• Sixltr (words longer than six letters)
(raw count)
• count used as a proxy for word
complexity
• Dic (dictionary words count)
(percentage of target words
captured by the applied dictionary
/ dictionary words)
• used as an understanding of how
much of a text was addressed in the
LIWC analyses, assuming that the
various counts were all applied
84
Dimensions of Language in LIWC2015:
Understanding Most Output Numbers
• 90 output variables in LIWC2015
• Most are percentages of certain words in the total document or text
set (text corpus or corpora) (“Interpreting LIWC Output,” 2015; “How
it Works,” 2015)
85
Dimensions of Language in LIWC2015:
Percentages of Standard Linguistic Dimensions
• Function words (pronouns,
articles, helping / auxiliary verbs,
and others)
• Othergram (other grammar),
including verbs, adjectives,
comparisons, interrogatives,
numbers, and quantifiers
86
Dimensions of Language in LIWC2015:
Percentages of Psychological Constructs
• Affect (including positive
emotions, negative emotions
and particularly anxiety, anger,
and sadness)
• Social (including family, friends,
female, male)
• Perceptual processes (including
seeing, hearing, and feeling)
• Drives (affiliation, achievement,
power, reward, risk)
87
Dimensions of Language in LIWC2015:
Percentages of Other Human-Based Constructs
• Biological Processes (including
body, health, sexual, ingestion)
• Time Orientation (including
past, present, or future focus)
• Relativity (including motion,
space, time)
• Personal Concerns (including
work, leisure, home, money,
religion, death)
• Informal Language [including
swearing, netspeak, assent,
nonfluencies (meaningless filler
words), and filler words]
• Cognitive Processes (including
insight, causal, discrepancies,
tentativeness, certainty, and
differentiation)
88
Dimensions of Language in LIWC2015:
Punctuation Marks
• Punctuation marks (12
categories)
• Considered part of “structural
composition”
89
Meaning in the Dimensions
• Based on empirical research
• Based on constructs within
particular fields (particularly
psychology, linguistics)
• Based on the selected text
corpus or corpora
• Dimensions are applied singly
and in combination with other
descriptors and analytical
approaches to create value-
added understandings.
90
Additional LIWC Dictionaries
• Dictionary -> Get More Dictionaries
• Download as .dic (dictionary) files
• Dictionary -> Load New Dictionary
91
Beyond English
• Spinoff dictionaries as translations from English terms, but not native
created and not native coded
• Some are spinoffs of the English sentiment core, with added grammatical and
cultural variables
• External dictionary in LIWC2001: German
• External dictionaries in LIWC2007: Spanish, French, Russian, Italian,
and Dutch
• Versioned in some other languages like KLIWC for Korean LIWC,
Tagalog, and others based on custom research (according to articles
in the research literature)
92
LIWC in Applied Research
93
94
Some Types of Research with Computational Linguistic Analysis:
Research Approaches
• Lab-based (and / or classroom-
based) capture of text sets based
on particular directions for
eliciting writing
• Stream-of-consciousness writing,
free writing, diary writing /
journaling, deceptive vs. non-
deceptive writing, responding to
visual prompts, completing
cliffhangers, and others
• Uses of Electronically Activated
Recorder (EAR)
• Pre- and post- experimental
methods
• Categorical outcomes used to
separate text sets and the study
of various linguistic variable
associations (“markers” or
“indicators”) with particular
outcomes; application of
statistical analysis for
significance and correlation
effect size (r)
95
Some Types of Research with Computational Linguistic Analysis:
Research Approaches(cont.)
• Sometimes used as a part of a
research sequence, not as the
main research
96
Some Types of Research with Computational Linguistic Analysis:
Baseline / Control Setting
• Baseline / control setting for
how males write / talk vs. how
females write / talk
• Within languages
• Between languages
• Status indicators in language
use; power vs. powerlessness
• Language-based baselines
• Cultural-based baselines
• Genre writing baselines
• General age trajectories
baselines and language use
97
Some Types of Research with Computational Linguistic Analysis:
Efficacy of Writing Interventions
• Whether writing has therapeutic
value; what types of writing has
therapeutic value
• Upper and lower boundaries of
therapeutic writing
98
Some Types of Research with Computational Linguistic Analysis:
Predictive Analytics
• Handling of individual trauma;
handling of collective trauma
• Authorship attribution (through
psycholinguistic profiling)
• Deception detection
• Fraud detection
• Male- female- authorship inference
• Suicidality detection
• College student performance
prediction
• Employee performance prediction
• Research article popularity based
on writing fluency
• Threat detection
• Remote personality reading
(including author / speaker
cognition, psychological health,
and others)
• Reading of mental and emotional
states
99
Some Types of Research with Computational Linguistic Analysis:
Predictive Analytics (cont.)
• Belongingness, social realities
• Cognitive judgment
• Attitudes
• Motives and intentions, and
others
• (An easy starter book on this is
The Secret Life of Pronouns by
J.W. Pennebaker, one of the
main originators of LIWC.)
100
Some Types of Research with Computational Linguistic Analysis:
Some Origins of Extant From-world Text Sets
• Historical documents
• Court records
• Research articles
• Journalism text sets
• Genres of fiction writing
• Gray (informal) literature
• Company or organization-based
writing
• Grants
• Personal writing, like letters
• Large-scale writing sets from
college students, K-12 students
• Applications for college entry
• Writing for standardized testing
• Synthetic data (created to test
particular research hypotheses)
• Computer-generated
• Crowd-generated, and others
• Related text sets across languages,
also between languages
101
Some Types of Research with Computational Linguistic Analysis:
Some Origins of Extant From-world Text Sets (cont.)
• Spoken speech
• Speeches
• Debates
• Panel discussions
• Focus groups
• Meeting agendas and discussions
• Television programs
• Telephone transcripts
• Music lyrics, and others
• Social media text sets
• Web pages and sites
• Crowd-sourced blog entries
• Web encyclopedia pages
• Tweetstreams and microblogging
message collections
• Social network user accounts
• SMS datasets
• Email sets
• Sub-Reddits and discussion threads
• Image tags,
• Video tags, and others
102
Work Space in LIWC2015
103
User Interface
• Simple user interface
• Clear nomenclature
• Text pre-processing outside of
LIWC
104
105
106
Analyze Text…to Pre-Coded Categories
107
Categorize Words (Unigrams)…to Pre-Coded Categories
108
Color-Code Text…in Original Document Structure
109
A Basic Walk-through with
LIWC2015
110
General Process (redux)
• Theoretical underpinning(s)
• Research design
• Text collection (searchable file types, file naming protocols)
• Text cleaning (normalization)
• LIWC runs and re-runs (word counts, percentages of words in content
categories)
• Analytic conclusions
• Further research within and beyond LIWC
111
Live Demos
112
113
Some Types of Askable Questions
From Computational Linguistic Analysis
114
Some Types of Basic Askable Questions
Pre-existent (or “found”) text:
• Are there statistically significant differences in linguistic writing styles
between authors (author style profiles)?
• Authors of different genders? Age groups? Cultures? Languages?
Backgrounds? Experiences?
• If so, what are the differences? Are these consistent differences? Do these
differences hold across different conditions and contexts? What could these
differences mean?
• What is the text profile of “successful” vs. “unsuccessful” genres of
writing? Do such differentiating text profiles exist in a meaningful
way? Are these effects explainable based on the linguistic features?
115
Some Types of Basic Askable Questions(cont.)
Pre-existent (or “found”) text (cont.):
• Are there linguistic markers / indicators in text sets that may indicate
particular outcomes in terms of reception of the text / text sets?
Outcomes for the authors of the text sets? Other outcomes?
• Are there linguistic patterns from certain genres of writing? Genres
of writing in certain time periods?
• Are there patterned observable differences between spoken words
and written ones in a particular context? How spontaneous (raw) or
edited (processed) were the respective source texts?
116
Some Types of Basic Askable Questions (cont.)
Pre-existent (or “found”) text (cont.):
• What are some summary features (descriptions) of the document or
text corpus? How are function words used in the text set?
• What are some observed sentiment features of the text set? How do
these features correlate with features in the real-world?
• What are some observable psychometric features of the text set?
How do these features correlate with features in the real-world?
• How do the various features of one document or text set compare
and contrast against another?
117
Some Types of Basic Askable Questions (cont.)
Elicited text:
• What are some linguistic features of the elicited texts?
• What are some creative prompts for elicited spoken words (like think-
aloud prompts) vs. elicited written words?
• Are there identifiable patterns that may be found in those elicited
texts? Do different prompts result in identifiably different types of
texts, and if so how, and how?
• How does writing change over time (in terms of observed linguistic
features)?
118
Some Types of Basic Askable Questions (cont.)
Elicited text (cont.):
• How does writing change in a pre- and post- intervention scenario?
• What role can writing play as an intervention itself?
• Are there different writing patterns that may be identified among
different people groups (such as based on demographic factors or
categorical factors)? What might this mean?
119
Challenges with
Internal and External Validation
120
Some Challenges with the Word Counting
Method
• Inherent lexical ambiguity and polysemous nature of language (a
counted term can be understood different ways based on the context,
author intention, and usage)
• Focus on the single unigram / one-gram (instead of two-grams, three-
grams, four-grams, and so on, as phrases)
• A lack of contextual awareness in a “bag of words” paradigmatic
approach (except for counts within documents in sets comprised of
stand-alone documents)
• Some small mitigation in terms of the color coding of terms found in a
document that are in the LIWC2015 dictionary, but this requires human “close
reading” of the document (whether academic reading, skimming, or
scanning)
121
Some Challenges with the Word Counting
Method(cont.)
• A core base language has to be selected even though there are
dictionaries in English and non-English languages
• Multi-language datasets cannot be run simultaneously (but may be
run individually, with findings applied in a complementary way)
122
123
Definitions: Validation / (In)Validation
Internal Validity
• How well the words represent
the constructs that they are
supposed to represent
• How solidly does LIWC2015
work based on its
conceptualization, creation,
testing, and design
External Validity
• How well identified textual
indicators predict “ground truth” or
“state of the world”
• How well the findings may be
generalized to the world (or the slice
of the world that is being studied)
• Also how applicable the findings
may be to other similar (~) cases
• How findings compare to base rates
of particular textual phenomena in
particular text genres
124
Internal (In)validation
• Evaluation of each of the steps to the process, the execution at each
step, and the overall work
• Theoretical underpinning
• Research design
• Text collection
• Text cleaning
• Text analysis instrument functioning
• LIWC runs and re-runs
• Text set treated as individual files and as a collection
• Analytic conclusions
125
Testing of Predictive Modeling Accuracy
• Testing predictive modeling based on other measures (created by people or by
other programs)
• Both precision and recall are important for a predictive construct but there may be tradeoffs
between these two features
• To ensure that predicted positives are actual positives, a threshold may be set too high, leaving out many
actual positives but resulting in fewer false positives (so high precision, but low recall)
• To ensure that that all the positives that exist in a set are captured, a threshold for inclusion may be set
too low, capturing a lot of false positives (so high recall, but low precision)
• Ideally, both precision and recall should be as high as possible
• In a perfect balance, all positives are true positives, and every single actual positive is identified from a
set
• F-measure / F1 score / F-score (“weighted harmonic mean” between precision
and recall)
• Is expressed as a number between 0 and 1 (where 1 is perfect precision and perfect recall)
• 0 < p < 1
• 0 < r < 1
126
F-measure: Precision and Recall
Precision “p” (predicted positive results):
true positives / true positives and false
positives
• How sensitive is the test to the
identification of true positives (without
confusing false positives with the true)?
• How much noise is in the results? Is the
test overweighted towards finding
positives and so falsely categorizing false
positives (undesirable)?
• High precision means that an identified
positive is highly likely to actually be a
true positive (and not a false positive).
Low precision means that an identified
positive could well be a false positive.
Recall “r” (capturing of actual positive
results): true positives / true positives +
false negatives
• How many of the true positives
have been identified (from the full
set of all true positives
possibilities)?
• High recall means that most or all
of the true positives are identified
by the test.
• Low recall means that many of the
true positives were missed. In this
case, the test is not trusted to
include all possible true positives
because many are missed.
127
Testing of Predictive Modeling Accuracy (cont.)
• F1 =
2
1
recall
+
1
precision
• F1 simplified = 2 x
Precision x Recall
Precision+Recall
• OR F1 simplified: F = 2 * [ (pr) / (p+r) ]
• An ideal test identifies the target phenomena accurately (p) and
thoroughly (r).
128
Some Limits to Dictionary-Based Classifiers
• Dictionary-based classifier systems tend to be high on “precision” but
low on “recall” (Aoqui, 2012)
• Natural language (particularly in an age of social media) evolves quickly, with
new terms and new word usages occurring constantly
• Dictionaries used in classifiers are often updated through rigorous manual
(“by hand”) processes, which require large human investments and
efforts…and time
129
External (In)validation
• Light “sanity check” of text counts
• Comparisons against extant
baselines (if available)
• Comparisons against text results of
a control group (vs. the
experimental group)
• Comparisons against human coding
(if feasible)
• Comparisons with other similar
research (if available)
• Comparisons of phenomena based
on other similar / dissimilar text
sets
• Testing of linguistic indicators in
other similar (~) contexts for
applicability
• Fine-tuning
• Testing of linguistic insights with
“ground truth” (assessed by other
means)
130
Some Other Computational
Linguistic Analysis Tools
131
Note: Most of these were not tested by the author.
Some Other Common
Computational Linguistic Analysis Tools
• Computational Social Science
Lab (CSSL) at the U of S.
California’s Text Analysis,
Crawling and Interpretation Tool
(TACIT)
• http://tacit.usc.edu/
• Free and open-source
• Art Graesser’s Coh-Metrix
program (coherence metrics in
text)
• http://cohmetrix.com/
• Rod Hart’s DICTION program
• http://www.dictionsoftware.com
• Tom Landauer’s Latent Semantic
Analysis
• http://lsa.colorado.edu
• CASOS’ AutoMap (network text
analysis)
• http://www.casos.cs.cmu.edu/proj
ects/automap/
• Free
132
Some Other Approaches to the
Data
133
File Export Formats
• No direct way to save a project in LIWC2015 so need to “Save Results”
• “Analyze Text” and “Categorize Words” data export as following file
types:
• .txt (ASCII text)
• .csv (comma separated values)
• .xlsx (XML spreadsheet file format in Excel 2007 onwards)
• “Color-Code Text” function data results cannot be saved out directly
but may be copied and pasted in MS Word with color intact (not in
Notepad or simple text editors)
134
From LIWC2015 -> Other analytics
• Light analytics and counts in LIWC2015
• No access to coded text sets from which numbers are extrapolated
(so a kind of black-box processing except for the software manual
documentation)
• Export…depending on the text curation process…sometimes requiring data
restructuring…sometimes requiring information and assumptions beyond the
extracted texts…
• …In Excel: mostly descriptive and comparison data
• Range of quant processing (averaging data, summing data, and others)
• Data visualizations (stacked bar charts, line graphs, and others)
135
From LIWC2015 -> Other analytics (cont.)
• …In SPSS: asking harder questions based on the studied texts
• Statistical significance computations
• Chi-square computations
• Factor analyses
• Content analysis through human “close reading”
• LIWC2015 variables involve (reproducible) counts and some psychometrics,
but its use is always through a researcher interpretive lens
• Researcher identifies what is relevant and why
• Researcher brings knowledge of the topic and field to the interpretation
136
Conclusion
And some Newbie Observations
137
0
5
10
15
20
25
30
35
40
45
function
Initial Impressions of LIWC
• Functions are simple and mechanistic: counting and tallying
• User interface is simple
• Processes are simple
• Potential is intriguing and promising:
• Power lies in the dictionaries and the domain-based insights
• Power lies in the respective text sets for sufficiency of data
• Power lies in discovery and hypothesis-testing
• Power lies in surfacing insights that would not be knowable otherwise in an
efficient way
138
Conceptualizing the Extracted Data
• Because these tools offer predictions based on probability, however,
such insights will never be definitive. “In the final analysis, our
situation is much like that of economists,” (James W.) Pennebaker
says. “It’s too early to come up with a standardized analysis. But at
the end of the day, we all are making educated guesses, the same way
economists can understand, explain and predict economic ups and
downs.”
• – Jan Dönges, “What Your Choice of Words Says about Your Personality,” July
1, 2009, Scientific American
139
Some Newbie Observations about the
Software Tool
• LIWC2015 (and computational linguistics) is young yet and still very much
in the exploratory phases
• There are challenges with setting baselines against which subsets and new
sets of texts may be compared for insights; likewise, there are challenges in
setting control groups for experimental research
• Some insights are domain-specific (and culture specific), and others may be
general and cross-domain
• A major trap involves over-asserting from limited observations (and limited
text sets)
• Facile interpretations are risky and should be avoided
• J.W. Pennebaker: “Don’t trust your instincts.”
• LIWC2015 is a lot of fun to use (maybe a little dangerously so)!
140
Some Newbie Observations for Researchers
• As with some software tools, it is easy to come up with a lot of data
with just a few clicks…but accurate analysis will require the following:
• Understanding
• the strengths and weaknesses of the tool
• the strengths and weaknesses of the curated text sets
• with some text sets more amenable to the application of computational linguistic
analysis than others
• with Heaps’ law / Herdan’s law implications to the amount of text and diminishing
returns on numbers of distinct vocabulary elements after a certain amount of collected
text
• the particular domain / discipline
• the research context
141
Some Newbie Observations for Researchers
(cont.)
• Accurate analysis will require understanding… (cont.)
• what to select as relevant from the mass of numerical data
• ways to test the findings (internal and external validation)
• Internal: the text set(s)
• External: the world (what the text findings may indicate about the author…the state of
the genre…the state of the world, often through statistical means)
• ways to hypothesize about and interpret the findings
142
Going to the Source
• “The Development and Psychometric Properties of LIWC2015” by James
W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate Blackburn
addresses the following and more:
• Psychometric baselines
• Examples
• Test text corpora details
• Internal consistency measures of the various psychological constructs (based on
uncorrected α and corrected α)
• Uncorrected α based on average of standard Cronbach’s alpha calculation for the term
across corpora
• Challenges of highly variant base rates of word occurrences across documents and corpora
• Corrected α based on Spearman-Brown prediction (prophecy) formula which takes into
account the amount of underlying text (“test length”) in attributing internal reliability (the
more data, the stronger the potential reliability)
143
Internal Consistency of Words as Indicators of
Psychometric Constructs…and Confidence Levels
• The built-in psychological constructs have a range of confidence based
on prior research based on a diverse and large text corpora.
• The higher the internal consistency measures (usually measuring correlations
between different items on the same test / words in the same construct, and
measured on a restricted scale of 0 – 1 outputs), the greater reliability of the
software tool in using language to identify a psychological construct (and
therefore the higher confidence users may have in the output).
• Uncorrected α tends to “grossly underestimate reliability in language categories
due (to) the highly variable base rates of word usage within any given category”
• Corrected α is based on the Spearman-Brown prediction formula and this is
considered “a more accurate approximation of each category’s ‘true’ internal
consistency” (Pennebaker, Boyd, Jordan, & Blackburn, 2015, p. 8).
144
Basic Approaches for Researchers
Researchers…
• study the built-in dictionary and other available .dics in LIWC2015
• read up on the related academic research literature (and extract the
methods that make the most sense)
• experiment with testable hypotheses about particular textual
datasets in particular disciplines
• trial-run the tool on datasets about which the researcher is already
intimate to get a sense of the tool
• sample broadly in terms of texts *and* sample texts in a targeted and
strategic way, too
145
Basic Approaches for Researchers (cont.)
Researchers… (cont.)
• practice cleaning (pre-processing) textual data for processing in
LIWC2015
• formulate and temper hypotheses based on the LIWC findings
• work to find counter-evidence to one’s hypotheses (both seeking
validation and invalidation)
• use in-world knowledge to interpret and test the findings from
LIWC2015
146
Advanced Approaches for Researchers
Researchers…
• design their own research approaches using LIWC2015
• use a variety of research methods to capture the desired knowledge
more fully (so not just using LIWC alone)
• create their own dictionaries for limited and targeted questions
147
Interpreting One Document
(this slideshow in an earlier iteration)
• Scores vs. counts (with computed percentages)
• Scores involving calculations (some algorithmic processing) of raw counts
• Counts (raw) within groups of dimensions (and computed percentages)
148
Addendum: An Applied Example
… based on this slideshow v. 1 and v. 2…
149
Data Visualizations: The following data visualizations were mostly created after a second run of the slideshow
was done in LIWC2015, so there are some small discrepancies between the posted numerical data and
the numerical data used to create the data visualizations (in Excel). The slideshow itself changed with the
addition of the analytical data from LIWC…and from updates as the presenter started to better understand
the software (still as a neophyte). Sorry about any inconvenience from the data discrepancy.
A LIWC Run on this Slideshow
150
Filena
me
Segm
ent WC
Analyt
ic Clout
Authe
ntic Tone WPS Sixltr Dic
functi
on
prono
un ppron i we you shehe they ipron article prep
auxve
rb
adver
b conj
negat
e verb adj
comp
are
interr
og
numb
er quant affect
pose
mo
nege
mo anx anger sad social family friend
femal
e male
cogpr
oc
insigh
t cause
discre
p tentat
certai
n differ
perce
pt see hear feel bio body health sexual ingest drives
affilia
tion
achiev
e power
rewar
d risk
focusp
ast
focusp
resent
focusf
uture relativ
motio
n space time work
leisur
e home
mone
y relig death
infor
mal swear
netsp
eak assent nonflu filler
AllPun
c Period
Com
ma Colon SemiC
QMar
k
Excla
m Dash Quote
Apostr
o
Paren
th
Other
P
LIWC-
ing at
Texts
for
Insigh
ts
from
Lingui
stic
Patter
ns.pdf 1 5888 95.05 51.70 22.36 37.80 46.36 35.39 62.42 31.10 2.79 0.39 0.02 0.05 0.07 0.00 0.25 2.39 4.55 13.20 3.52 2.09 6.22 0.54 8.14 5.06 2.55 0.87 6.08 2.55 3.09 1.80 1.12 0.27 0.20 0.17 4.45 0.07 0.05 0.07 0.07 14.45 4.26 3.16 0.61 3.31 1.00 3.57 1.14 0.51 0.25 0.12 0.46 0.10 0.24 0.03 0.08 4.79 0.88 1.26 1.70 0.97 0.27 1.32 4.38 0.85 9.21 0.95 6.40 1.75 5.25 0.39 0.10 0.14 0.05 0.08 0.68 0.02 0.54 0.02 0.12 0.00 32.69 2.43 5.01 2.38 0.39 1.02 0.02 2.09 1.46 0.19 7.46 10.26
Summary Language Variables
Scores (/100, as a percentile measure)
• Analytic: 95.26 (score)
• Clout: 55.01 (score)
• Authentic: 22.15 (score)
• (Emotional) Tone: 40.94 (score)
Counts (raw counts)
• WC (word count): 4281 (count)
• WPS (average words per
sentence): 47.57 (count)
• Sixltr (words > 6 letters): 36.32
(count)
• Dic (number of words out of 100
in the built-in dictionary,
application of the dictionary to
the text): 63.68 (count)
151
Linguistic Dimensions
(% of words in a text that fit in a certain linguistic category, including 21 possible dimensions)
• Function (vs. non-function words): 31.51
• Pronoun: 2.66
• Ppron: 0.40
• I: 0.00
• We: 0.05
• You: 0.07
• Shehe: 0.00
• They: 0.28
• Ipron (impersonal pronouns): 2.27
152
• Article: 4.58
• Prep: 13.50
• Auxverb: 3.53
• Adverb: 2.13
• Conj: 6.49
• Negate: 0.35
153
0.00
0.50
1.00
1.50
2.00
2.50
3.00
pronoun ppron i we you shehe they ipron
Pronoun Use
=
154
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
article prep auxverb adverb conj negate
Non Pronoun Function Words
Linguistic Dimensions: Other Grammar
• Verb: 8.76
• Adj: 5.02
• Compare: 2.45
• Interrog: 0.93
• Number: 4.34
• Quant: 2.27
• Common verbs
• Common adjectives
• Comparisons
• Interrogatives
• Numbers
• Quantifiers (few, many, much)
155
156
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
verb adj compare interrog number quant
Other Grammar
Psychological Processes
• Affect: 2.71 (summary percentage from the category but not with
rounded-up numbers)
• Posemo: 1.71
• Negemo: 0.86
• Anx: 0.23
• Anger: 0.21
• Sad: 0.14
157
158
Social Processes
• Social: 4.56 (summary percentage from the category but not with
rounded-up numbers)
• Family: 0.05
• Friend: 0.05
• Female: 0.07
• Male: 0.07
159
160
Social
family friend female male
Cognitive Processes
• Cogproc: 14.55 (summary percentage from the category but not with
rounded-up numbers)
• Insight: 4.70
• Cause: 3.27
• Discrep: 0.58
• Tentat: 3.34
• Certain: 0.86
• Differ: 3.06
161
162
Perceptual Processes
• Percept: 1.24 (summary percentage from the category but not with
rounded-up numbers)
• See: 0.58
• Hear: 0.26
• Feel: 0.14
163
164
0.00
0.20
0.40
0.60
0.80
1.00
1.20
percept see hear feel
1.14
0.51
0.25
0.12
Perceptual Processes
=
Biological Processes
• Bio: 0.42 (summary percentage from the category but not with
rounded-up numbers)
• Body: 0.09
• Health: 0.23
• Sexual: 0.02
• Ingest: 0.07
165
0.46
0.10
0.24
0.03
0.08
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
bio body health sexual ingest
PercentageofFoundTermsinTargetDocument
Biological Processes
Note: This was visualized from a re-run of the revised slideshow, so the numbers are slightly different from those
in the slide.
=
Drives
Drives: 4.25 (summary percentage
from the category but not with
rounded-up numbers)
Affiliation: 0.79
Achieve: 1.17
Power: 1.47
Reward: 0.68
Risk: 0.30
167
Time Orientation
• Focuspast: 1.35
• Focuspresent: 0.93
• Relativ: 9.23
168
169
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00
focuspast
focuspresent
focusfuture
Time Orientation
Relativity
• Motion: 0.98
• Space: 6.14
• Time: 2.01
170
Personal Concerns
• Work: 5.98
• Leisure: 0.42
• Home: 0.09
• Money: 0.16
• Relig: 0.07
• Death: 0.09
171
0.00
1.00
2.00
3.00
4.00
5.00
6.00
work
leisure
home
money
relig
death
Personal Concerns
(as a Filled Radar Chart)
Informal Language
• Informal: 0.61 (0.63 if rounded up from numbers below)
• Swear: 0.00
• Netspeak: 0.49
• Assent: 0.02
• Nonflu: 0.12
• Filler: 0.00
172
173
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
informal swear netspeak assent nonflu filler
Informal Language
=
Punctuation Categories
• AllPunc: 29.95 (> 21.35 average
in LIWC2015 analysis of training
set)
• Period: 2.50
• Comma: 5.40
• Colon: 0.86
• SemiC: 0.33
• Qmark: 0.75
• Exclam: 0.02
• Dash: 2.03
• Quote: 1.45
• Apostro: 0.23
• Parenth: 6.89
• OtherP: 9.48
174
175
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
AllPunc Period Comma Colon SemiC QMark Exclam Dash Quote Apostro Parenth OtherP
Punctuation Categories
=
Extrapolated Content Summary (rough)
176
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
function affect social cogproc insight percept bio drives "timeorient" relativ "perscons" informal AllPunc
Extrapolated Content Summary (rough)
Extrapolated Content Summary (rough) (cont.)
• So a slideshow focused on cognitive processes, relativity, time
orientation, personal concerns, drives, social, insights, in descending
order (based on the prior summary, which does not sum cleanly to
100% because of the rounding up) (prior slide)
177
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
A Rough Content Summary
Color-Coded by Time Dimension
178
Four Summary Language Dimensions of this
Slideshow (in a spider chart)
179
95.26
55.01
22.15
40.94 0
10
20
30
40
50
60
70
80
90
100
Analytic
Clout
Authentic
Emotional
Four Summary Language Dimensions of this Slideshow
Four Summary Language Dimensions of this
Slideshow
Descriptive?
• Do these summary dimensions
capture the human close-
reading sense of the slideshow?
• If so, what are its insights?
• If not, where does it fall short?
Prescriptive?
• Analytic: Should this work be so
highly analytic? (score: 95.26)
• Clout: Should there be more
effort to raise the clout score to
indicate more expertise,
confidence, status, and
leadership? (even though the
presenter is a newbie to LIWC)
(score: 55.01)
180
Four Summary Language Dimensions of this
Slideshow (cont.)
Descriptive?
• What are ways to capture other
aspects of the text or texts
beyond the summary language
dimensions? (and outside of
LIWC)
Prescriptive?
• Authentic: Should the language in
this slideshow be more personable
and authentic? Less guarded?
More vulnerable? (score: 22.15)
• Tone (emotion): Given its
emotional tone, which is trending a
little negative (< 50), should more
effort be made to make it trend
more positive? (score: 40.94)
181
Comments? Questions?
• Any insights about challenges to interpreting the data without
baselines? Control text sets? Without comparatives?
• Insight about why the color coding of the target document or text set
may be helpful?
• Ideas for new applications of LIWC2015? Fresh research ideas?
• Strengths of the tool and research methodology? Weaknesses? Ways
to strengthen this approach?
182
Conclusion and Contact
• Dr. Shalin Hai-Jew
• iTAC, Kansas State University
• 212 Hale / Farrell Library
• shalin@k-state.edu
• 785-532-5262
• No ties: The presenter has no tie to the maker of LIWC (Pennebaker Conglomerates, Inc.).
• Data visualizations: The simple data visualizations were created in Microsoft Excel using a re-run
of the LIWC tool over this revised slideshow, so the numbers in the data visualizations are slightly
different from the numbers in the first set used in the Addendum: An Applied Example (Slide 149
onwards).
• Newbie alert! Also, the presenter is a newcomer to LIWC and is still LIWC-ing around. If you see
an error, please contact the presenter, so the slideshow may be corrected. Thanks!
• And less-newbie learning: A new section was added as the presenter worked on her first custom
dictionary, so if you downloaded an earlier version, please download this current one. 
183

Mais conteúdo relacionado

Mais procurados

Business Intelligence - A Management Perspective
Business Intelligence - A Management PerspectiveBusiness Intelligence - A Management Perspective
Business Intelligence - A Management Perspectivevinaya.hs
 
Data Analaytics.04. Data visualization
Data Analaytics.04. Data visualizationData Analaytics.04. Data visualization
Data Analaytics.04. Data visualizationAlex Rayón Jerez
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
TigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial CrimesTigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial CrimesTigerGraph
 
Data Visualization1.pptx
Data Visualization1.pptxData Visualization1.pptx
Data Visualization1.pptxqwtadhsaber
 
An introduction to Business intelligence
An introduction to Business intelligenceAn introduction to Business intelligence
An introduction to Business intelligenceHadi Fadlallah
 
Business Intelligence Trends 2020
Business Intelligence Trends 2020Business Intelligence Trends 2020
Business Intelligence Trends 2020Wiiisdom
 
Top Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | EdurekaTop Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | EdurekaEdureka!
 
Implementing business intelligence
Implementing business intelligenceImplementing business intelligence
Implementing business intelligenceAlistair Sergeant
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Data Visualization
Data VisualizationData Visualization
Data Visualizationsimonwandrew
 

Mais procurados (20)

Business Intelligence - A Management Perspective
Business Intelligence - A Management PerspectiveBusiness Intelligence - A Management Perspective
Business Intelligence - A Management Perspective
 
Data Analaytics.04. Data visualization
Data Analaytics.04. Data visualizationData Analaytics.04. Data visualization
Data Analaytics.04. Data visualization
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
Business intelligence kpi
Business intelligence kpiBusiness intelligence kpi
Business intelligence kpi
 
TigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial CrimesTigerGraph UI Toolkits Financial Crimes
TigerGraph UI Toolkits Financial Crimes
 
Data visualization
Data visualizationData visualization
Data visualization
 
Big Data Analytics (1).ppt
Big Data Analytics (1).pptBig Data Analytics (1).ppt
Big Data Analytics (1).ppt
 
Predictive analytics 2025_br
Predictive analytics 2025_brPredictive analytics 2025_br
Predictive analytics 2025_br
 
Data Visualization1.pptx
Data Visualization1.pptxData Visualization1.pptx
Data Visualization1.pptx
 
An introduction to Business intelligence
An introduction to Business intelligenceAn introduction to Business intelligence
An introduction to Business intelligence
 
Business Intelligence Trends 2020
Business Intelligence Trends 2020Business Intelligence Trends 2020
Business Intelligence Trends 2020
 
Top Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | EdurekaTop Machine Learning Tools and Frameworks for Beginners | Edureka
Top Machine Learning Tools and Frameworks for Beginners | Edureka
 
Implementing business intelligence
Implementing business intelligenceImplementing business intelligence
Implementing business intelligence
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data Visualization - A Brief Overview
Data Visualization - A Brief OverviewData Visualization - A Brief Overview
Data Visualization - A Brief Overview
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 

Destaque

Formations & Deformations of Social Network Graphs
Formations & Deformations of Social Network GraphsFormations & Deformations of Social Network Graphs
Formations & Deformations of Social Network GraphsShalin Hai-Jew
 
Sentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusSentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusShalin Hai-Jew
 
Eavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteEavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
 
LIWC Dictionary Expansion
LIWC Dictionary ExpansionLIWC Dictionary Expansion
LIWC Dictionary ExpansionLuiz Aoqui
 
Designing Online Learning to Actual Human Capabilities
Designing Online Learning to Actual Human CapabilitiesDesigning Online Learning to Actual Human Capabilities
Designing Online Learning to Actual Human CapabilitiesShalin Hai-Jew
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
 
See Ya! Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...
See Ya!  Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...See Ya!  Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...
See Ya! Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...Shalin Hai-Jew
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant ReadingShalin Hai-Jew
 
Building a Digital Learning Object w/ Articulate Storyline 2
Building a Digital Learning Object w/ Articulate Storyline 2Building a Digital Learning Object w/ Articulate Storyline 2
Building a Digital Learning Object w/ Articulate Storyline 2Shalin Hai-Jew
 
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...Shalin Hai-Jew
 
Fully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online SurveysFully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online SurveysShalin Hai-Jew
 
Beauty as a Bridge to NodeXL
Beauty as a Bridge to NodeXLBeauty as a Bridge to NodeXL
Beauty as a Bridge to NodeXLShalin Hai-Jew
 
Exploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXLExploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXLShalin Hai-Jew
 
Coding Social Imagery: Learning from a #selfie #humor Image Set from Instagram
Coding Social Imagery:  Learning from a #selfie #humor Image Set from InstagramCoding Social Imagery:  Learning from a #selfie #humor Image Set from Instagram
Coding Social Imagery: Learning from a #selfie #humor Image Set from InstagramShalin Hai-Jew
 
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3 Real-time Tweet Analysis w/ Maltego Carbon 3.5.3
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3 Shalin Hai-Jew
 
Using Qualtrics to Create Automated Online Trainings
Using Qualtrics to Create Automated Online TrainingsUsing Qualtrics to Create Automated Online Trainings
Using Qualtrics to Create Automated Online TrainingsShalin Hai-Jew
 
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...Shalin Hai-Jew
 
Writing and Publishing about Applied Technologies in Tech Journals and Books
Writing and Publishing about Applied Technologies in Tech Journals and BooksWriting and Publishing about Applied Technologies in Tech Journals and Books
Writing and Publishing about Applied Technologies in Tech Journals and BooksShalin Hai-Jew
 
An introduction to inbound marketing analytics
An introduction to inbound marketing analyticsAn introduction to inbound marketing analytics
An introduction to inbound marketing analyticsNuno Fraga Coelho
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingShalin Hai-Jew
 

Destaque (20)

Formations & Deformations of Social Network Graphs
Formations & Deformations of Social Network GraphsFormations & Deformations of Social Network Graphs
Formations & Deformations of Social Network Graphs
 
Sentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 PlusSentiment Analysis with NVivo 11 Plus
Sentiment Analysis with NVivo 11 Plus
 
Eavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging SiteEavesdropping on the Twitter Microblogging Site
Eavesdropping on the Twitter Microblogging Site
 
LIWC Dictionary Expansion
LIWC Dictionary ExpansionLIWC Dictionary Expansion
LIWC Dictionary Expansion
 
Designing Online Learning to Actual Human Capabilities
Designing Online Learning to Actual Human CapabilitiesDesigning Online Learning to Actual Human Capabilities
Designing Online Learning to Actual Human Capabilities
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger Data
 
See Ya! Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...
See Ya!  Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...See Ya!  Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...
See Ya! Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading
 
Building a Digital Learning Object w/ Articulate Storyline 2
Building a Digital Learning Object w/ Articulate Storyline 2Building a Digital Learning Object w/ Articulate Storyline 2
Building a Digital Learning Object w/ Articulate Storyline 2
 
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
 
Fully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online SurveysFully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online Surveys
 
Beauty as a Bridge to NodeXL
Beauty as a Bridge to NodeXLBeauty as a Bridge to NodeXL
Beauty as a Bridge to NodeXL
 
Exploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXLExploring Article Networks on Wikipedia with NodeXL
Exploring Article Networks on Wikipedia with NodeXL
 
Coding Social Imagery: Learning from a #selfie #humor Image Set from Instagram
Coding Social Imagery:  Learning from a #selfie #humor Image Set from InstagramCoding Social Imagery:  Learning from a #selfie #humor Image Set from Instagram
Coding Social Imagery: Learning from a #selfie #humor Image Set from Instagram
 
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3 Real-time Tweet Analysis w/ Maltego Carbon 3.5.3
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3
 
Using Qualtrics to Create Automated Online Trainings
Using Qualtrics to Create Automated Online TrainingsUsing Qualtrics to Create Automated Online Trainings
Using Qualtrics to Create Automated Online Trainings
 
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...
 
Writing and Publishing about Applied Technologies in Tech Journals and Books
Writing and Publishing about Applied Technologies in Tech Journals and BooksWriting and Publishing about Applied Technologies in Tech Journals and Books
Writing and Publishing about Applied Technologies in Tech Journals and Books
 
An introduction to inbound marketing analytics
An introduction to inbound marketing analyticsAn introduction to inbound marketing analytics
An introduction to inbound marketing analytics
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
 

Semelhante a LIWC-ing at Texts for Insights from Linguistic Patterns

LIWC2015: Using Word-Based Psychometrics for Research
LIWC2015:  Using Word-Based Psychometrics for Research LIWC2015:  Using Word-Based Psychometrics for Research
LIWC2015: Using Word-Based Psychometrics for Research Shalin Hai-Jew
 
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdfApplied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdfDr.Badriya Al Mamari
 
Discourse Analysis Weeks 1,2,3 and 4.pdf
Discourse Analysis  Weeks 1,2,3 and 4.pdfDiscourse Analysis  Weeks 1,2,3 and 4.pdf
Discourse Analysis Weeks 1,2,3 and 4.pdfAmadStrongman
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 
Research in language and literature, karpagam university, conference ppt
Research in language and literature, karpagam university, conference pptResearch in language and literature, karpagam university, conference ppt
Research in language and literature, karpagam university, conference pptvijay kumar
 
Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Samira Rahmdel
 
Zoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesZoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesDukeDigitalScholarship
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguisticsIrum Malik
 
Text & Critical Discourse Analysis
Text & Critical Discourse AnalysisText & Critical Discourse Analysis
Text & Critical Discourse AnalysisLazarus Gawazah
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics kashmasardar
 
Natural language processing
Natural language processingNatural language processing
Natural language processingHansi Thenuwara
 
naturallanguageprocessing-160722053804.pdf
naturallanguageprocessing-160722053804.pdfnaturallanguageprocessing-160722053804.pdf
naturallanguageprocessing-160722053804.pdfshakeelAsghar6
 
An overview of applied linguistics with defintion
An overview of applied linguistics with defintionAn overview of applied linguistics with defintion
An overview of applied linguistics with defintionEfraín Suárez-Arce, M.Ed
 
Lecture 1st-Introduction to Discourse Analysis._023928.pptx
Lecture 1st-Introduction to Discourse Analysis._023928.pptxLecture 1st-Introduction to Discourse Analysis._023928.pptx
Lecture 1st-Introduction to Discourse Analysis._023928.pptxGoogle
 

Semelhante a LIWC-ing at Texts for Insights from Linguistic Patterns (20)

LIWC2015: Using Word-Based Psychometrics for Research
LIWC2015:  Using Word-Based Psychometrics for Research LIWC2015:  Using Word-Based Psychometrics for Research
LIWC2015: Using Word-Based Psychometrics for Research
 
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdfApplied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
 
Contrastive rhetoric
Contrastive rhetoricContrastive rhetoric
Contrastive rhetoric
 
Contrastive rhetoric
Contrastive rhetoricContrastive rhetoric
Contrastive rhetoric
 
Discourse Analysis Weeks 1,2,3 and 4.pdf
Discourse Analysis  Weeks 1,2,3 and 4.pdfDiscourse Analysis  Weeks 1,2,3 and 4.pdf
Discourse Analysis Weeks 1,2,3 and 4.pdf
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
Research in language and literature, karpagam university, conference ppt
Research in language and literature, karpagam university, conference pptResearch in language and literature, karpagam university, conference ppt
Research in language and literature, karpagam university, conference ppt
 
Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)
 
Zoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesZoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and Techniques
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Discourse analysis
Discourse analysis Discourse analysis
Discourse analysis
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Text & Critical Discourse Analysis
Text & Critical Discourse AnalysisText & Critical Discourse Analysis
Text & Critical Discourse Analysis
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics
 
L1 nlp intro
L1 nlp introL1 nlp intro
L1 nlp intro
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
naturallanguageprocessing-160722053804.pdf
naturallanguageprocessing-160722053804.pdfnaturallanguageprocessing-160722053804.pdf
naturallanguageprocessing-160722053804.pdf
 
An overview of applied linguistics with defintion
An overview of applied linguistics with defintionAn overview of applied linguistics with defintion
An overview of applied linguistics with defintion
 
Branches of linguistics
Branches of linguisticsBranches of linguistics
Branches of linguistics
 
Lecture 1st-Introduction to Discourse Analysis._023928.pptx
Lecture 1st-Introduction to Discourse Analysis._023928.pptxLecture 1st-Introduction to Discourse Analysis._023928.pptx
Lecture 1st-Introduction to Discourse Analysis._023928.pptx
 

Mais de Shalin Hai-Jew

Writing a Long Non-Fiction Chapter......
Writing a Long Non-Fiction Chapter......Writing a Long Non-Fiction Chapter......
Writing a Long Non-Fiction Chapter......Shalin Hai-Jew
 
Overcoming Reluctance to Pursuing Grant Funds in Academia
Overcoming Reluctance to Pursuing Grant Funds in AcademiaOvercoming Reluctance to Pursuing Grant Funds in Academia
Overcoming Reluctance to Pursuing Grant Funds in AcademiaShalin Hai-Jew
 
Pursuing Grants in Higher Ed
Pursuing Grants in Higher EdPursuing Grants in Higher Ed
Pursuing Grants in Higher EdShalin Hai-Jew
 
Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...
Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...
Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...Shalin Hai-Jew
 
Creating Seeding Visuals to Prompt Art-Making Generative AIs
Creating Seeding Visuals to Prompt Art-Making Generative AIsCreating Seeding Visuals to Prompt Art-Making Generative AIs
Creating Seeding Visuals to Prompt Art-Making Generative AIsShalin Hai-Jew
 
Poster: Multimodal "Art"-Making Generative AIs
Poster:  Multimodal "Art"-Making Generative AIsPoster:  Multimodal "Art"-Making Generative AIs
Poster: Multimodal "Art"-Making Generative AIsShalin Hai-Jew
 
Poster: Digital Templating
Poster:  Digital TemplatingPoster:  Digital Templating
Poster: Digital TemplatingShalin Hai-Jew
 
Poster: Digital Qualitative Codebook
Poster:  Digital Qualitative CodebookPoster:  Digital Qualitative Codebook
Poster: Digital Qualitative CodebookShalin Hai-Jew
 
Common Neophyte Academic Book Manuscript Reviewer Mistakes
Common Neophyte Academic Book Manuscript Reviewer MistakesCommon Neophyte Academic Book Manuscript Reviewer Mistakes
Common Neophyte Academic Book Manuscript Reviewer MistakesShalin Hai-Jew
 
Fashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AI
Fashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AIFashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AI
Fashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AIShalin Hai-Jew
 
Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...
Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...
Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...Shalin Hai-Jew
 
Introduction to Adobe Aero 2023
Introduction to Adobe Aero 2023Introduction to Adobe Aero 2023
Introduction to Adobe Aero 2023Shalin Hai-Jew
 
Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...
Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...
Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...Shalin Hai-Jew
 
Exploring the Deep Dream Generator (an Art-Making Generative AI)
Exploring the Deep Dream Generator (an Art-Making Generative AI)  Exploring the Deep Dream Generator (an Art-Making Generative AI)
Exploring the Deep Dream Generator (an Art-Making Generative AI) Shalin Hai-Jew
 
Augmented Reality for Learning and Accessibility
Augmented Reality for Learning and AccessibilityAugmented Reality for Learning and Accessibility
Augmented Reality for Learning and AccessibilityShalin Hai-Jew
 
Art-Making Generative AI and Instructional Design Work: An Early Brainstorm
Art-Making Generative AI and Instructional Design Work:  An Early BrainstormArt-Making Generative AI and Instructional Design Work:  An Early Brainstorm
Art-Making Generative AI and Instructional Design Work: An Early BrainstormShalin Hai-Jew
 
Engaging Pixabay as an open-source contributor to hone digital image editing,...
Engaging Pixabay as an open-source contributor to hone digital image editing,...Engaging Pixabay as an open-source contributor to hone digital image editing,...
Engaging Pixabay as an open-source contributor to hone digital image editing,...Shalin Hai-Jew
 
Publishing about Educational Technology
Publishing about Educational TechnologyPublishing about Educational Technology
Publishing about Educational TechnologyShalin Hai-Jew
 
Human-Machine Collaboration: Using art-making AI (CrAIyon) as cited work, o...
Human-Machine Collaboration:  Using art-making AI (CrAIyon) as  cited work, o...Human-Machine Collaboration:  Using art-making AI (CrAIyon) as  cited work, o...
Human-Machine Collaboration: Using art-making AI (CrAIyon) as cited work, o...Shalin Hai-Jew
 
Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...
Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...
Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...Shalin Hai-Jew
 

Mais de Shalin Hai-Jew (20)

Writing a Long Non-Fiction Chapter......
Writing a Long Non-Fiction Chapter......Writing a Long Non-Fiction Chapter......
Writing a Long Non-Fiction Chapter......
 
Overcoming Reluctance to Pursuing Grant Funds in Academia
Overcoming Reluctance to Pursuing Grant Funds in AcademiaOvercoming Reluctance to Pursuing Grant Funds in Academia
Overcoming Reluctance to Pursuing Grant Funds in Academia
 
Pursuing Grants in Higher Ed
Pursuing Grants in Higher EdPursuing Grants in Higher Ed
Pursuing Grants in Higher Ed
 
Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...
Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...
Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...
 
Creating Seeding Visuals to Prompt Art-Making Generative AIs
Creating Seeding Visuals to Prompt Art-Making Generative AIsCreating Seeding Visuals to Prompt Art-Making Generative AIs
Creating Seeding Visuals to Prompt Art-Making Generative AIs
 
Poster: Multimodal "Art"-Making Generative AIs
Poster:  Multimodal "Art"-Making Generative AIsPoster:  Multimodal "Art"-Making Generative AIs
Poster: Multimodal "Art"-Making Generative AIs
 
Poster: Digital Templating
Poster:  Digital TemplatingPoster:  Digital Templating
Poster: Digital Templating
 
Poster: Digital Qualitative Codebook
Poster:  Digital Qualitative CodebookPoster:  Digital Qualitative Codebook
Poster: Digital Qualitative Codebook
 
Common Neophyte Academic Book Manuscript Reviewer Mistakes
Common Neophyte Academic Book Manuscript Reviewer MistakesCommon Neophyte Academic Book Manuscript Reviewer Mistakes
Common Neophyte Academic Book Manuscript Reviewer Mistakes
 
Fashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AI
Fashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AIFashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AI
Fashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AI
 
Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...
Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...
Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...
 
Introduction to Adobe Aero 2023
Introduction to Adobe Aero 2023Introduction to Adobe Aero 2023
Introduction to Adobe Aero 2023
 
Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...
Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...
Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...
 
Exploring the Deep Dream Generator (an Art-Making Generative AI)
Exploring the Deep Dream Generator (an Art-Making Generative AI)  Exploring the Deep Dream Generator (an Art-Making Generative AI)
Exploring the Deep Dream Generator (an Art-Making Generative AI)
 
Augmented Reality for Learning and Accessibility
Augmented Reality for Learning and AccessibilityAugmented Reality for Learning and Accessibility
Augmented Reality for Learning and Accessibility
 
Art-Making Generative AI and Instructional Design Work: An Early Brainstorm
Art-Making Generative AI and Instructional Design Work:  An Early BrainstormArt-Making Generative AI and Instructional Design Work:  An Early Brainstorm
Art-Making Generative AI and Instructional Design Work: An Early Brainstorm
 
Engaging Pixabay as an open-source contributor to hone digital image editing,...
Engaging Pixabay as an open-source contributor to hone digital image editing,...Engaging Pixabay as an open-source contributor to hone digital image editing,...
Engaging Pixabay as an open-source contributor to hone digital image editing,...
 
Publishing about Educational Technology
Publishing about Educational TechnologyPublishing about Educational Technology
Publishing about Educational Technology
 
Human-Machine Collaboration: Using art-making AI (CrAIyon) as cited work, o...
Human-Machine Collaboration:  Using art-making AI (CrAIyon) as  cited work, o...Human-Machine Collaboration:  Using art-making AI (CrAIyon) as  cited work, o...
Human-Machine Collaboration: Using art-making AI (CrAIyon) as cited work, o...
 
Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...
Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...
Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...
 

Último

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 

LIWC-ing at Texts for Insights from Linguistic Patterns

  • 1. LIWC-ing at Texts for Insights from Linguistic Patterns
  • 2. Overview • Since the mid-1990s, researchers have been using the Linguistic Inquiry and Word Count (LIWC, pronounced “luke”) software tool to explore various text corpora for hidden insights from linguistic patterns. The LIWC tool has evolved over the years. Simultaneously, research using computational text analysis has evolved and shed light on areas of deception, threat assessment, personality, predictive analytics, and others. This presentation will highlight some of the applications of LIWC in the research literature and showcase the tool on some original text sets. 2
  • 3. 3
  • 4. Content Overview • Notes About Language • Reading = Decoding; Writing = Encoding • Computational Linguistic Analysis • Curation of Text Sets • LIWC: Linguistic Inquiry & Word Count • Work Space in LIWC2015 • Structure of a Custom External Dictionary • Insights from Experiences Working on a Custom Dictionary* (added) • Creating .dic Files* (added) • LIWC in Applied Research • A Basic Walk-through with LIWC2015 • Live Demos • Some Types of Askable Questions 4
  • 5. Content Overview(cont.) • Challenges with Internal and External Validation • Some Other Computational Linguistic Analysis Tools • Some Other Approaches to the Data • Conclusion (and some Newbie Observations) • Addendum: An Applied Example 5
  • 7. Some Generalities About Language • Most language is natural language which evolves common practices and structures over time based on human interaction (vs. constructed language, like Esperanto, or those created for the silver screen). • Language evolves over time based on human usage, particularly in local geographical areas. • Unique dialects may develop locally in particular regions or within certain social groups. • Language itself tends to be patterned but not necessarily internally logical. • Modern languages originate from language families and are influenced by other languages. • Languages are shared codes (oral and written) for people to communicate and exchange information. • Because languages have to be understood broadly, they tend to be highly patterned. • Modern languages tend to have written and phonology aspects; they tend to include both content (semantic) and structure (syntactic) aspects. 7
  • 8. Some Generalities About Language (cont.) • Only 200 of the world’s 6,000 – 7,000 languages have a written version; most are / have been oral only. • Language is social; it plays a core role in how people make meaning and interact with each other. • Changes in a language (based on new technologies, interactions between cultures, and fashion) are often adopted first orally and then integrated in more formal written forms. • World’s languages are disappearing as their users discontinue the uses of the languages for more commonly shared languages (“Lists of endangered languages”). • Globalization has complex effects on world languages. 8
  • 9. Some Generalities About Language (cont.) • Semantic terms tend towards polysemy (being multi-meaninged) and nuance, and so are inherently ambiguous. • Words must be understood in context (translate: proximity to the target term) to understand their particular respective word sense (connotative application vs. only denotative). • There are statistical probabilities for which meaning of a word is likely being used, and based on proxemics terms to the target term, it is possible to “understand” the particular meaning of a term in a context. • Language contains high dimensionality data; it involves many facets. • Language has text and subtext as well, so the meanings conveyed are not only surface ones but some hidden (or latent) aspects. • People wield language in non-obvious ways, such as by using humor, irony, symbolism, historical referencing, tone, and other aspects. 9
  • 10. Changing Roles of Writing in Societies • Writing used to be practiced by those with political and social power, and their creations were based on formal structures and conventions. • Originally focused on religious issues • Broadened to address issues of interest for the literate upper and political classes • Writing is now way more practiced by the masses, who are much more broadly literate. • Topics cover anything of interest but still along certain code-able topics (in terms of library system labeling, and others) for formal publishing. 10
  • 11. Common Forms of Writing Non-fiction • Journalism • Essay writing • Autobiography, memoir • Biography • Research writing Fiction • Poetry • Short Stories • Novelas • Novels • Plays and scripts 11
  • 12. Common Forms of Writing (cont.) Non-fiction • Documents • Letters • Oral histories • Interviews, • Manifestos, • Statements, and others Fiction • Songs • Jokes • Synthetic data for dummy case- or scenario- research, and others 12
  • 13. Reading = Decoding Text Writing = Encoding Text “A language is a fecund, redolent buzzing mess of a thing, in every facet, glint, and corner, even in single words.” -- John McWhorter, What Language Is (2011) 13 encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> encode <-> decode <-> … The two activities hold each other in tension and constrain themselves and the other.
  • 14. Complementary Human and Machine Reading Human: Close Reading • Informed by training, experience, personality, intellect, emotion, and other factors • Full sensory: sight, smell, taste, touch, hearing, and proprioception (embodied) Computer: Distant Reading • May include supervised or unsupervised machine learning • More efficient and scalable than human reading • Results in objective counts 14
  • 15. Complementary Human and Machine Reading (cont.) Human: Close Reading • Interpretive and subjective, filtered through the person Computer: Distant Reading • Objectivist • Reproducible (theoretically and practically) 15
  • 16. Some Common Distant Reading Approaches • Human and computer (supervised machine learning): • XML tagging and data queries of the tagged texts • Literary analysis including on dimensions of time, characters, dialogue, locations, and other aspects (such as in the digital humanities) • Coding by existing pattern (with original human coding: emergent, a priori, or mixed) 16
  • 17. Some Common Distant Reading Approaches (cont.) • Human and computer (data queries): • Text frequency counts for issues of main focus • Both: • Absolute frequency counts • Relative frequency counts (relative to the document and corpus) • Mitigation of counts based on how frequently a term appears in a document and corpus, with more common word appearances diluting the informational importance of that word (TF-IDF) • Word search to find all word contexts for word and phrase disambiguation 17
  • 18. Some Common Distant Reading Approaches (cont.) • Computer (unsupervised machine learning): • Sentiment analysis (against a pre-coded sentiment word set) • Emotion analysis (based on a number of different models) • Derivation of gender, personality [Big 5 Personality Traits, Dark Triad Personality traits (narcissism, Machiavellianism, and psychopathy), with evidence of “within-person stability” that enables profiling and comparisons/contrasts between people], age, cultural background, and others • Remote profiling (with “zero interaction”) • Predictive analytics, such as anticipation of actions by leaders based on public and private signaling (extrapolation of intentionality) 18
  • 19. Some Common Distant Reading Approaches (cont.) • Computer (unsupervised machine learning) (cont.): • Deception detection and analysis (such as through “pronoun drop”) • Stylometry (including author identification): convergence of linguistic features to an (un)identified author • “psychological signatures” of writers based on their written works and communications and extending to profiles • Domain mapping (topic extraction) • Theme and subtheme extraction / topic modeling (unsupervised machine learning) • Machine-extraction of textual features that are “tells” for certain outcomes 19
  • 20. Computational Linguistic Analysis (as a subset of “distant reading” capabilities) 20
  • 21. Brief History of Quantitative Approaches and Linguistics • Manual or “counting by hand” back in the day • Computational systems tested against human experts in a particular field or domain • Computational linguistics in 1950s at beginning of the Cold War to translate texts from foreign languages into English (machine translation) • Used in translating spoken language to text translation • Used in summarizing texts (topic modeling) at scale for search tools (“Computational linguistics,” Apr. 1, 2016) 21
  • 22. Computational Linguistic Research Design Built on Theories, Models, and Empirical Research • Informed by research in language, (social) psychology, computer science, and other fields • May involve text exploration, discovery, or targeted research questions (or some combination) • Building on theories, models, and empirical research • Hypothesizing based on theories and models • Grouping writing based on particular outcome variables to identify differences in writing, using selected observed indicators in (written and spoken) language as potential indicators of difference between the groups with the differing outcomes • Using a combination of insights from theories, models, and empirical research 22
  • 23. Computational Linguistic Research Design Built on Theories, Models, and Empirical Research(cont.) • Relationship between natural language expression and… • hidden internal states of people (gender, personality, cognition, state of mind, intentionality, etc.) and hidden internal states of groups and cultures • health • genres of writing • different language structures • gender differences • Language features as certain “tells” (indicators, signs, signals) • Reverse engineering backwards in time • Predictive analytics forwards in time 23
  • 24. Computational Linguistic Research Design Built on Theories, Models, and Empirical Research(cont.) • So essentially: plaintext = code (indicators) of latent (hidden) realities • So can profile various genres of text for general characteristics / baselines • So can compare new exemplars of particular texts against baselines • So can profile an “unknown” text based on its quant characteristics • So can compare historical texts against future ones • So can compare historical occurrences and related texts…and possibly apply in predictive ways into the future 24
  • 25. Creation of Dedicated Dictionaries • Informed by in-world texts • Suggested words and stems and synonyms • Vetted by people • Empirically tested for research value • Does the dictionary provide practical research insights? 25
  • 26. Consumptive and Non-Consumptive Text Analysis Consumptive Text Analysis • Access to the analytics AND the underlying text set(s) Non-Consumptive Text Analysis • No access to the underlying text set • Google Books Ngram Viewer is one popular example of non- consumptive computational text analysis (with access to the shadow text set of ngrams only) 26
  • 27. General Sequence • Theoretical underpinning • Research design • Collection of target text documents into corpora • May need to negotiate the release of particular rights • Preservation of raw data into a pristine master set • Development of familiarity or intimacy with the text sets (through close reading and other types of explorations) • Translation of non-base languages to the base language (or separation for different data runs using different language dictionaries) 27
  • 28. General Sequence(cont.) Text cleaning • Separating each text bit into its own file based on a unit of analysis (quote, paragraph, article, section, novel, or play, etc.) • Data normalization / spell check • Clear and representative file naming protocols • De-identification of data (if relevant) • Cleaning of notes from transcripts, and others Text file transcoding (with close observations of data capture and data loss at each phase) • Images with word content are often not represented as textual contents • Metadata may / may not be captured • Sequences of text processing, software used, and such, affect the word counts • Preferred: .pdf -> MS Word (less lossy) • Not preferred: .pdf -> .txt (lossy) 28
  • 29. General Sequence(cont.) • File formatting (.txt., rtf, .pdf, .doc, .docx, .csv, .xl, xlsx, NOT .pptx, .ppt, .wpd, .jpg, .png) for LIWC2015 • Versioning of text corpora for different queries • “Bag of words” paradigm or structure / context preservation • End-sentence markers for sentence length • Extraction of data tables • Creation of data visualizations from the extracted data • Interpretations and analyses • (In)validation of the linguistic analysis 29
  • 30. Sense-making from Linguistic Patterning • Starting with known information and prior research (and prior theory) that will inform the analyses; may be based on a stated hypothesis • Selecting relevant texts that are comparable along particular dimensions • May use from-world text corpora (such as based on dichotomous nonparametric outcome variables or multi-factor outcomes)…and looking for linguistic differences and similarities 30
  • 31. Sense-making from Linguistic Patterning (cont.) • Identifying linguistic “markers” / indicators that are a “tell” for a particular construct or state-of-the-world • Setting baselines (controls) for certain types of texts and then comparing a subset against the general baselines • Looking at text clustering as factor loading and applying human understandings of the respective factors • Comparisons and contrasts across dimensions of texts (such as text types across cultures) 31
  • 32. “Linguistic Style” • Study of semantic terms makes sense to human readers and aligns with how the human brain works (in terms of what is noticed / perceived and remembered in a text), but semantic terms and unique phrasing are eminently emulatable and manipulate-able • Point is to find indicators that are not so easily tampered with by people who may want to manage impressions • Going to “function words” or particles (articles, pronouns, prepositions, conjunctions, auxiliary / helping verbs, adverbs, negations, etc.) 32
  • 33. Curation of Text Sets 33
  • 34. Selected Texts from a Domain • Delimiting of targeted texts helps focus the function of the software • All included texts should relate to the topic being studied • Said to work better with natural language text than non-natural language text • Texts do not have to be only one type, but if they are mixed text sets, that should be noted in the analysis • Types of text data should be reflected in the types of dictionary dimensions (and categories) applied…as well as the selected language dictionary 34
  • 35. 35
  • 36. Curation of Text Sets 1. Gathering of textual and non-textual (such as multimedia) data 2. Selection of relevant texts 3. Arrangement of rights releases as needed (staying legal) 4. Transcoding of multimedia content to textual and textual-to-textual 5. (Non-destructive) data cleaning: normalization of terms, spelling, foreign language translation, treatment of symbols and punctuation, and others (with archival of all raw files in pristine format prior to any normalizing, for “non-destructiveness”) 6. Data segmentation / grouping 7. Data / file labeling 36
  • 37. Curation of Text Sets (cont.) 8. Data formatting / file conversions • Must be searchable (machine readable, with optical character recognition / OCR) files in any of the following formats: .pdf, .doc, .docx, .txt, .rtf, .xl, .xlsx, .csv, etc. (Ability to read within and across columns in spreadsheet formats is a new LIWC2015 capability.) • Considerations for digital preservation, with common strategies of going to the lowest common denominator files and open source if possible (.txt, .rtf, .csv, .html, .xml) 9. Metadata creation (linked to the respective files or kept in a README file with the textual data) 37
  • 38. Curation of Text Sets (cont.) 10. Conducting of research on the text set: Inherent enablements for certain types of data queries and data explorations and research based on dataset contents and structure (such as test text sets to train new models) 11. Scrub of dataset for publishing 12. Descriptions of the text set: Its origins, its contents, its quantitative numbers, its standards for text inclusion, its copyright releases, its prior uses, its potential uses, proper citation methods, originator’s / originators’ contact information, and others; datasets named based on curated textual and other contents or the data curator or some other naming method (for easier reference, for building up a user base) 38
  • 39. A “Sanity Check” for Text Processing • Transcoding from one document type to another often results in information loss because of how each software program handles the transferred information. • Data lossiness: • There is some degree of expected lossiness. For example, text in images will not be recognized unless there is optical character recognition (OCR) applied. • Embedded videos will not have a text equivalency unless a transcript is also downloaded and included with the text document, corpus, or corpora. • In multi-lingual text files, messages that are not in the main base language may not be transferred accurately. • Some valid words may turn to garble in the transcoding. 39
  • 40. A “Sanity Check” for Text Processing (cont.) • Added extraneous data: There is some degree of extraneous information included, such as web page data in between-page gutters captured in a web- to-PDF capture. • Comments made to a .pdf may be included in a transcoding context, say, to Word. • There may also be extra information in web pages if they are not captured in print style but include advertising in designated ad spaces and pop-up windows. • Print styles of web pages are not always included as an option. 40
  • 41. A “Sanity Check” for Text Processing (cont.) • If batch processing, check results with smaller sets first. Check the outputs as well. • In automation, it is possible to lose information if this is done unthinkingly. • To see if there are systematic challenges with losing and / or gaining text during transcoding, run a “sanity check” after processing data to see how much of the original was preserved. • One type of “sanity check” is to run a simple average word count of a particular unit (document, message, or other). • Does the per-unit word count jibe with what is observed by the researcher? If not, it’s important to figure out a less lossy way of capturing query-able files. 41
  • 42. A “Sanity Check” for Text Processing (cont.) • There may be differences between “Save as,” “Export as,” “Print as,” “Send to,” and other sorts of functions that enable transcoding between file types. • Allowing character substitution or not will affect the transcoded contexts from MS Word. • For those text files with UTF-8 characters, it is important to ensure that the coding enables UTF-8 characters. • There are more optimal sequences and technologies to move text from one file type to another, so researchers should experiment with what works best for them. • In a PDF file, go to .docx (via MS Word) to capture much more recognizable text (vs. a .txt or a .rtf). 42
  • 43. Creation of Text Metadata from Multimedia Sources • Types of files: • Digital imagery, audio files, video files, slideshows, games, simulations, and others • Analog-to-digital files (transcoding) • Text versioning: metadata descriptors, transcripts, locational information, coding (whether manual, automated, or mixed) and others • Using the extracted text transcripts for linguistic analysis…but a step or two out from the original source • Updated capabilities to read image files (in PDF) to text in an automated way 43
  • 45. LIWC and its History • Developed in early-to-mid 1990s by Martha E. Francis (then a grad student and programmer) and James W. Pennebaker (1993) to study possible therapeutic use of language • Named LIWC (Linguistic Inquiry and Word Count), helpfully descriptive but also disambiguated; “LIWC” pronounced “luke” (according to its makers) • Comprised of two parts: (1) a processing component and (2) dictionaries (based on certain categories of data and / or constructs) • Factors broadening rapidly in v. 1 to 80 factors (variables) 45
  • 46. LIWC and its History(cont.) • v. 2 evolved with an expanded dictionary and more modern software processing capabilities (2001), also known as SLIWC (Second LIWC) • LIWC2007 has even broader dictionary capabilities by James W. Pennebaker, Roger J. Booth, and Martha E. Francis (2007) 46
  • 47. LIWC and its History(cont.) • Most recent version is LIWC2015 (Pennebaker, Boyd, Jordan, & Blackburn, 2015), with new software and new dictionary (vs. an upgrade) and extensive documentation • LIWC2015 dictionary contains nearly 6,400 words, word stems, and emoticons • “Each dictionary entry additionally defines one or more word categories or subdictionaries” • Includes a feature to include customized dictionaries 47
  • 48. LIWC and its History(cont.) • Systematic process of LIWC2015 dictionary creation involved building on LIWC2007, having 2-6 judges individually generating word lists, having collected words analyzed by a group of 4-8 judges, application of a Meaning Extraction Helper to set base rates of word usage in the wild, creating candidate word lists of terms possibly missed by judges, and psychometric evaluation of respective words’ influences on the constructs, then refinement of the terms, and a re-review of the prior steps to catch potential errors (Pennebaker, Boyd, Jordan, & Blackburn, 2015, pp. 5 – 6) • Is a well documented software tool (rare) • Is tested for both internal validity (based on real-world text sets) and external validity (based on research designs, with validity not applicable just across-the-board but in case-by-case bases) 48
  • 49. LIWC and its History(cont.) • Tested again large text corpuses: blogs, “expressive writing,” novels, natural speech, NY Times, and Twitter to set baselines • LIWC2015 captures “on average, over 86 percent of the words people use in writing and speech” (Pennebaker, Boyd, Jordan, & Blackburn, 2015, p. 10) • Fairly high correlations between LIWC2007 and LIWC2015 means (p. 13) • Removal of categories “largely due to their consistently low base rates, low internal reliability, or their infrequent use by researchers” (past tense verbs, present tense verbs, future tense verbs, human words, inhibition words, inclusives, exclusives); versions 2001 and 2007 enabled • Internally validated on a variety of psychometric dimensions; backstopped by empirical research across a number of modern languages • Informed by decades of empirical research 49
  • 50. Internal Consistency Measures across Variables 50
  • 52. LIWC and its History(cont.) • As a commercial product • May be purchased (http://liwc.wpengine.com/) from Pennebaker Conglomerates, Inc. • Free trial version may be accessed (http://www.liwc.net/tryonline.php) but with text size limits • Includes a LIWC API with all LIWC2015 variables “plus 30+ additional validated measures of psychology, personality, moetion, tone, sentiment and more—all in real time” and access to “social media integration, time-series analysis, statistical models, machine learning models and more” (through Receptiviti) • Is not open-source • Runs on both Windows PCs and Macs OSes via the Java Virtual Machine • May be downloaded to the local machine or accessed as a web-based version 52
  • 53. LIWC and its History(cont.) • Processes text files sequentially, finds each word, looks to see if it is its built-in dictionary, counts that word, and increments as a straight count and then applies that count to a simple percentage function (% of the complete document) or a variable-based scale function (based on dedicated algorithms) for psychometric, psycholinguistic, or other human-related measures • Outputs one of the following: • Raw counts • Frequency percentages • Processed scores (percentiles)…but no access to the coded text sets 53
  • 54. LIWC and its History(cont.) • Evolution of built-in dictionaries over time based on documented standards and researcher majority consensus • Based on English (with 171,000 “English” words in use, 100,000 English words used by the average native English speaker) • Considered the foremost linguistics analysis tools in use today • Backstopped by hundreds of research articles • Can handle any language representable by UTF-8 charset / character set (but analytics done in a base language) 54
  • 55. Human-Created Non-English Translations based on LIWC2001 or LIWC2007 Available • Spanish • German • Dutch • Norwegian • Italian • Portuguese In Process • Arabic • Korean • Turkish • Chinese 55
  • 56. Downloadable External Dictionaries (.dic) from LIWC2007 and LIWC2001 LIWC2007 • Spanish • French • Russian • Italian • Dutch LIWC2001 • German 56
  • 57. Downloadable Customized Dictionaries • Dedicated site for dictionary downloads also enables access to user- created dictionaries • Four were available as of mid-2016 • Two were coherent (structurally and conceptually), and of those, one was a sample one to show how to set up a dictionary for use in LIWC2015 • Linguistic analysis dictionary-creators need to be expert in an area of research • They need a clear grasp of the language that they are using • They need to work with others to ensure that the linguistic analysis dictionary is as comprehensive and as accurate as possible • Such dictionaries—like any research instrument—should be fully documented and tested for validity (of construct) and reliability (of consistent results); the first is based on the subject matter field, and the latter is based on counting (which is usually very high reliability for item counting) 57
  • 58. 58
  • 59. Exporting Pre-Built Internal Dictionaries • Can export internal dictionaries (LIWC2001, LIWC2007, and LIWC2015) as “posters” in secured (non-editable) .pdf files 59
  • 60. Original Customized External Dictionaries • How to create: • Conceptualize a construct • Identify terms that fit that construct • In a text editor, list in the proper format (next slide): first the constructs and then the terms in each of those constructs • Be sure to have the proper placement of the % and % • Version the file as a .dic file (changing the text file extension of a basic text file BUT—go to Slide 75 for the easiest way to create a .dic file that works); in LIWC2015, .doc or .docx or .txt files may be used as dictionary files as long as the other formatting is in place 60
  • 61. Original Customized External Dictionaries (cont.) • Adding to LIWC (for the analyses) • Dictionary -> Load New Dictionary • Can only run one dictionary at a time, but can run various dictionaries over the same text set for different insights • Conduct research and test the respective constructs for internal reliability and external validity 61
  • 62. Structure of a Custom External Dictionary % 1 Dimensiona 2 Dimensionb 3 Dimensionc % Word 1 Word 1 2 3 Word 3 Word 3 Word 2 Word 1 • Custom Dictionary Structure: • Constructs or dimensions on the top section; words representing the various constructs below • Use of unusual characters ($, #, %, ?, ^, *, etc.) to separate dimensions or categories from the words themselves, with these on their own lines • May represent any language depicted with UTF-8 charset (of Unicode) • May use created characters for imaginary created languages (vs. natural languages)…but haven’t seen this yet in the LIWC custom dictionary collection 62
  • 63. Structure of a Custom External Dictionary (cont.) % 1 Dimensiona 2 Dimensionb 3 Dimensionc % Word 1 Word 1 2 3 Word 3 Word 3 Word 2 Word 1 • Selected words indicative of particular dimensions or categories (single dimensions or multiple ones) for the bottom section • Words may indicate several constructs, but multiple counting of terms will mean more noise in the data (as compared to signal)…and will err on the side of recall vs. precision (in terms of an f-measure) • May use empirical data and sources to stock lists, then add synonyms to expand the dictionary’s transferability beyond the “training data” • Word list should be alphabetized for easier perusal and for elimination of repeated words 63
  • 64. Structure of a Custom External Dictionary (cont.) % 1 Dimensiona 2 Dimensionb 3 Dimensionc % Word 1 Word 1 2 3 Word 3 Word 3 Word 2 Word 1 • Helpful to have custom dictionaries constructed multi-dimensionally to capture a full and complex issue • May run multiple dictionaries against a corpus or combined corpora • May divide up corpuses into separate documents and sets for different sorts of queries • For example, use separate corpora to analyze different time periods with sets representing different time periods • Use the creation of corpora and the separation of documents and datasets into different sets…as a way to enhance LIWC capabilities 64
  • 65. Structure of a Custom External Dictionary (cont.) % 1 Dimensiona 2 Dimensionb 3 Dimensionc % Word 1 Word 1 2 3 Word 3 Word 3 Word 2 Word 1 • Results as straight raw word or phrase or emoticon counts and computed percentages of occurrences against the entire corpora (not scores) • For transferability and research efficacy, need to validate / invalidate a custom dictionary through pilot- testing and usage • Testing may involve • Review by experts in the field • Application of dictionary against various text sets • Statistical testing for whether words represent the respective constructs 65
  • 66. Required Notation in Customized Dictionaries • Category names must be one word (and can be written as several words in camel case) • Separate words by spaces or tabs or new lines / hard returns (but be consistent) • Stemmed words may be counted (separate from the core word) • Stemmed words are created with changes to a word’s form, such as with the addition of prefixes, suffixes; the adding of count (pluralizing); the expression of verb tense; and other transformations from a core or base term • Asterisk (*) tells LIWC2015 to ignore all subsequent letters that follow to capture all forms of the word based on a base form or lemma or stem (so the differently inflected word forms may be treated as a single item) • Telephon* 66
  • 67. Required Notation in Customized Dictionaries (cont.) • Inclusion of multi-word phrases possible, such as for specific compound terms or n-gram sequences • Single-form versions of those terms will not be counted separately, and phrases are ultimately treated as one-word units • Words in alphabetical order in the customized dictionaries 67
  • 68. 68
  • 69. Dictionaries for the Study of Various Constructs and Dimensions Built-in Dictionaries • Selective coded word set that define a particular category • Dictionaries affect the fundamental tool capabilities • Validated Customizable Dictionaries • May apply custom dictionaries to the analyses • External dictionaries as plain text files delimited by % and % 69
  • 70. Insights from Experiences Working on a Custom Dictionary • Study the issue in depth, both in the formal and informal literature. Use a “greedy” and “voracious” capture for sources. • Create constructs (to be sufficiently mutually exclusive but also to cover the research topic as comprehensively as possible). Write these as single words or phrases using camel case. • Capture words that indicate the respective constructs from all possible respectable sources. • Go beyond text to images and multimedia. Code everything that is relevant. • Capture as many natural language words that represent the various constructs as possible. • If using social media as the source, pay attention to abbreviations (from everywhere), #hashtags, @expressions, emoticons, and a range of other details… 70
  • 71. Insights from Experiences Working on a Custom Dictionary (cont.) • Avoid early lock-in or early finalization of a dictionary. (Assume that a custom dictionary is never really finalized.) • If a word applies to multiple constructs, include it in the multiple constructs. (Don’t commit a word to only one construct.) • Build a table in Word or Excel. Do not number the cells. Keep this as freeform and inclusive as possible. Extend the brainstorm stage as long as possible, so that there is not early commitment to an early draft. 71
  • 72. Insights from Experiences Working on a Custom Dictionary (cont.) 72 Construct(s) Related Words, Phrases, Symbols, Numbers, etc. to the Construct(s) 1 ConstructA …words…phrases…symbols…numbers, and others 2 ConstructB " 3 ConstructC " 4 ConstructD " 5 ConstructF " 6 ConstructG "
  • 73. Insights from Experiences Working on a Custom Dictionary (cont.) • When the table is complete (at least for this round)…and the dictionary has to be collated… • Assign numbers to the constructs. • Assign numbers to the related words showing their respective relationships to the respective constructs. • List the constructs in numerical order. • You now have the top part of the custom dictionary. 73
  • 74. Insights from Experiences Working on a Custom Dictionary (cont.) • Make a “bag of words” of all the words (with their assigned construct numbers in the adjacent cells in Excel). • Filter the column of words into alphabetical order and include the “Expand the Selection” command so all the row data follows the sorted file. • Take the alphabetized word list, and you have the bottom part of the custom dictionary. • Test this in LIWC… by loading the new dictionary and selecting the type of analysis desired… 74
  • 75. Creating .dic Files • Open MS Word. • Click “File” tab in the ribbon. • Click “Options” at the bottom left. • Select “Proofing” in the “Word Options” window. • Click the “Custom Dictionaries” button. • Indicate a “New” dictionary. • Give the new dictionary a name and save it to the correct location with the .dic file format. 75
  • 76. Creating .dic Files(cont.) • Open the .doc file in Word and paste the dictionary (with new words on each line) into the file. Save. Load. Run. • If you’ll be making multiple dictionaries, make a few extras with generic names to serve as templates! 76
  • 77. 77
  • 78. 78
  • 79. 79
  • 80. 80
  • 81. 81
  • 82. Dimensions of Language in LIWC2015: Summary Language Variables • Four Summary Language Variables (standardized composite scores based on algorithms created from prior linguistic analysis research and large “training” text sets) • Reported out as percentiles from 0 to 100 • Relative and comparative standing of a target text document or text set (against training text set) vs. any “absolute” measure • These summary language variables include the following: Analytic, Clout, Authentic, and Tone • These variables each have unique meanings, so it is important to read the official manuals to understand the respective meanings • These are “black box” features, so the underlying algorithms are not available 82
  • 83. Dimensions of Language in LIWC2015: Summary Language Variables(cont.) • Analytic (formerly categorical dynamic index or “CDI”): • high score: formal, logical, hierarchical • low score: informal, personal, narrative thinking • Clout: • high score: “perspective of high expertise” • low score: tentative or humble style • may be indicative of relative social status, confidence, and leadership • Authentic: • high score: honest and disclosing (being “personal, humble, and vulnerable” and authentic) • low score: more guarded “distanced form of discourse” • (Sentiment and Emotional) Tone: • high score (>50): positive emotion • low score(<50): “greater anxiety, sadness, or hostility” • at 50: “suggests either a lack of emotionality or different levels of ambivalence” (LIWC2015 Operator’s Manual) • below 50 is negative, above 50 is positive 83
  • 84. Dimensions of Language in LIWC2015: Summary Counts to Indicate “Structural Composition” and Complexity • WC (total word count) (raw count) • length of the particular text used as a proxy for how in-depth that work may be in addressing the target topic • WPS (words per sentence) (average) • used as a proxy for sentence complexity • Sixltr (words longer than six letters) (raw count) • count used as a proxy for word complexity • Dic (dictionary words count) (percentage of target words captured by the applied dictionary / dictionary words) • used as an understanding of how much of a text was addressed in the LIWC analyses, assuming that the various counts were all applied 84
  • 85. Dimensions of Language in LIWC2015: Understanding Most Output Numbers • 90 output variables in LIWC2015 • Most are percentages of certain words in the total document or text set (text corpus or corpora) (“Interpreting LIWC Output,” 2015; “How it Works,” 2015) 85
  • 86. Dimensions of Language in LIWC2015: Percentages of Standard Linguistic Dimensions • Function words (pronouns, articles, helping / auxiliary verbs, and others) • Othergram (other grammar), including verbs, adjectives, comparisons, interrogatives, numbers, and quantifiers 86
  • 87. Dimensions of Language in LIWC2015: Percentages of Psychological Constructs • Affect (including positive emotions, negative emotions and particularly anxiety, anger, and sadness) • Social (including family, friends, female, male) • Perceptual processes (including seeing, hearing, and feeling) • Drives (affiliation, achievement, power, reward, risk) 87
  • 88. Dimensions of Language in LIWC2015: Percentages of Other Human-Based Constructs • Biological Processes (including body, health, sexual, ingestion) • Time Orientation (including past, present, or future focus) • Relativity (including motion, space, time) • Personal Concerns (including work, leisure, home, money, religion, death) • Informal Language [including swearing, netspeak, assent, nonfluencies (meaningless filler words), and filler words] • Cognitive Processes (including insight, causal, discrepancies, tentativeness, certainty, and differentiation) 88
  • 89. Dimensions of Language in LIWC2015: Punctuation Marks • Punctuation marks (12 categories) • Considered part of “structural composition” 89
  • 90. Meaning in the Dimensions • Based on empirical research • Based on constructs within particular fields (particularly psychology, linguistics) • Based on the selected text corpus or corpora • Dimensions are applied singly and in combination with other descriptors and analytical approaches to create value- added understandings. 90
  • 91. Additional LIWC Dictionaries • Dictionary -> Get More Dictionaries • Download as .dic (dictionary) files • Dictionary -> Load New Dictionary 91
  • 92. Beyond English • Spinoff dictionaries as translations from English terms, but not native created and not native coded • Some are spinoffs of the English sentiment core, with added grammatical and cultural variables • External dictionary in LIWC2001: German • External dictionaries in LIWC2007: Spanish, French, Russian, Italian, and Dutch • Versioned in some other languages like KLIWC for Korean LIWC, Tagalog, and others based on custom research (according to articles in the research literature) 92
  • 93. LIWC in Applied Research 93
  • 94. 94
  • 95. Some Types of Research with Computational Linguistic Analysis: Research Approaches • Lab-based (and / or classroom- based) capture of text sets based on particular directions for eliciting writing • Stream-of-consciousness writing, free writing, diary writing / journaling, deceptive vs. non- deceptive writing, responding to visual prompts, completing cliffhangers, and others • Uses of Electronically Activated Recorder (EAR) • Pre- and post- experimental methods • Categorical outcomes used to separate text sets and the study of various linguistic variable associations (“markers” or “indicators”) with particular outcomes; application of statistical analysis for significance and correlation effect size (r) 95
  • 96. Some Types of Research with Computational Linguistic Analysis: Research Approaches(cont.) • Sometimes used as a part of a research sequence, not as the main research 96
  • 97. Some Types of Research with Computational Linguistic Analysis: Baseline / Control Setting • Baseline / control setting for how males write / talk vs. how females write / talk • Within languages • Between languages • Status indicators in language use; power vs. powerlessness • Language-based baselines • Cultural-based baselines • Genre writing baselines • General age trajectories baselines and language use 97
  • 98. Some Types of Research with Computational Linguistic Analysis: Efficacy of Writing Interventions • Whether writing has therapeutic value; what types of writing has therapeutic value • Upper and lower boundaries of therapeutic writing 98
  • 99. Some Types of Research with Computational Linguistic Analysis: Predictive Analytics • Handling of individual trauma; handling of collective trauma • Authorship attribution (through psycholinguistic profiling) • Deception detection • Fraud detection • Male- female- authorship inference • Suicidality detection • College student performance prediction • Employee performance prediction • Research article popularity based on writing fluency • Threat detection • Remote personality reading (including author / speaker cognition, psychological health, and others) • Reading of mental and emotional states 99
  • 100. Some Types of Research with Computational Linguistic Analysis: Predictive Analytics (cont.) • Belongingness, social realities • Cognitive judgment • Attitudes • Motives and intentions, and others • (An easy starter book on this is The Secret Life of Pronouns by J.W. Pennebaker, one of the main originators of LIWC.) 100
  • 101. Some Types of Research with Computational Linguistic Analysis: Some Origins of Extant From-world Text Sets • Historical documents • Court records • Research articles • Journalism text sets • Genres of fiction writing • Gray (informal) literature • Company or organization-based writing • Grants • Personal writing, like letters • Large-scale writing sets from college students, K-12 students • Applications for college entry • Writing for standardized testing • Synthetic data (created to test particular research hypotheses) • Computer-generated • Crowd-generated, and others • Related text sets across languages, also between languages 101
  • 102. Some Types of Research with Computational Linguistic Analysis: Some Origins of Extant From-world Text Sets (cont.) • Spoken speech • Speeches • Debates • Panel discussions • Focus groups • Meeting agendas and discussions • Television programs • Telephone transcripts • Music lyrics, and others • Social media text sets • Web pages and sites • Crowd-sourced blog entries • Web encyclopedia pages • Tweetstreams and microblogging message collections • Social network user accounts • SMS datasets • Email sets • Sub-Reddits and discussion threads • Image tags, • Video tags, and others 102
  • 103. Work Space in LIWC2015 103
  • 104. User Interface • Simple user interface • Clear nomenclature • Text pre-processing outside of LIWC 104
  • 105. 105
  • 106. 106
  • 107. Analyze Text…to Pre-Coded Categories 107
  • 108. Categorize Words (Unigrams)…to Pre-Coded Categories 108
  • 109. Color-Code Text…in Original Document Structure 109
  • 110. A Basic Walk-through with LIWC2015 110
  • 111. General Process (redux) • Theoretical underpinning(s) • Research design • Text collection (searchable file types, file naming protocols) • Text cleaning (normalization) • LIWC runs and re-runs (word counts, percentages of words in content categories) • Analytic conclusions • Further research within and beyond LIWC 111
  • 113. 113
  • 114. Some Types of Askable Questions From Computational Linguistic Analysis 114
  • 115. Some Types of Basic Askable Questions Pre-existent (or “found”) text: • Are there statistically significant differences in linguistic writing styles between authors (author style profiles)? • Authors of different genders? Age groups? Cultures? Languages? Backgrounds? Experiences? • If so, what are the differences? Are these consistent differences? Do these differences hold across different conditions and contexts? What could these differences mean? • What is the text profile of “successful” vs. “unsuccessful” genres of writing? Do such differentiating text profiles exist in a meaningful way? Are these effects explainable based on the linguistic features? 115
  • 116. Some Types of Basic Askable Questions(cont.) Pre-existent (or “found”) text (cont.): • Are there linguistic markers / indicators in text sets that may indicate particular outcomes in terms of reception of the text / text sets? Outcomes for the authors of the text sets? Other outcomes? • Are there linguistic patterns from certain genres of writing? Genres of writing in certain time periods? • Are there patterned observable differences between spoken words and written ones in a particular context? How spontaneous (raw) or edited (processed) were the respective source texts? 116
  • 117. Some Types of Basic Askable Questions (cont.) Pre-existent (or “found”) text (cont.): • What are some summary features (descriptions) of the document or text corpus? How are function words used in the text set? • What are some observed sentiment features of the text set? How do these features correlate with features in the real-world? • What are some observable psychometric features of the text set? How do these features correlate with features in the real-world? • How do the various features of one document or text set compare and contrast against another? 117
  • 118. Some Types of Basic Askable Questions (cont.) Elicited text: • What are some linguistic features of the elicited texts? • What are some creative prompts for elicited spoken words (like think- aloud prompts) vs. elicited written words? • Are there identifiable patterns that may be found in those elicited texts? Do different prompts result in identifiably different types of texts, and if so how, and how? • How does writing change over time (in terms of observed linguistic features)? 118
  • 119. Some Types of Basic Askable Questions (cont.) Elicited text (cont.): • How does writing change in a pre- and post- intervention scenario? • What role can writing play as an intervention itself? • Are there different writing patterns that may be identified among different people groups (such as based on demographic factors or categorical factors)? What might this mean? 119
  • 120. Challenges with Internal and External Validation 120
  • 121. Some Challenges with the Word Counting Method • Inherent lexical ambiguity and polysemous nature of language (a counted term can be understood different ways based on the context, author intention, and usage) • Focus on the single unigram / one-gram (instead of two-grams, three- grams, four-grams, and so on, as phrases) • A lack of contextual awareness in a “bag of words” paradigmatic approach (except for counts within documents in sets comprised of stand-alone documents) • Some small mitigation in terms of the color coding of terms found in a document that are in the LIWC2015 dictionary, but this requires human “close reading” of the document (whether academic reading, skimming, or scanning) 121
  • 122. Some Challenges with the Word Counting Method(cont.) • A core base language has to be selected even though there are dictionaries in English and non-English languages • Multi-language datasets cannot be run simultaneously (but may be run individually, with findings applied in a complementary way) 122
  • 123. 123
  • 124. Definitions: Validation / (In)Validation Internal Validity • How well the words represent the constructs that they are supposed to represent • How solidly does LIWC2015 work based on its conceptualization, creation, testing, and design External Validity • How well identified textual indicators predict “ground truth” or “state of the world” • How well the findings may be generalized to the world (or the slice of the world that is being studied) • Also how applicable the findings may be to other similar (~) cases • How findings compare to base rates of particular textual phenomena in particular text genres 124
  • 125. Internal (In)validation • Evaluation of each of the steps to the process, the execution at each step, and the overall work • Theoretical underpinning • Research design • Text collection • Text cleaning • Text analysis instrument functioning • LIWC runs and re-runs • Text set treated as individual files and as a collection • Analytic conclusions 125
  • 126. Testing of Predictive Modeling Accuracy • Testing predictive modeling based on other measures (created by people or by other programs) • Both precision and recall are important for a predictive construct but there may be tradeoffs between these two features • To ensure that predicted positives are actual positives, a threshold may be set too high, leaving out many actual positives but resulting in fewer false positives (so high precision, but low recall) • To ensure that that all the positives that exist in a set are captured, a threshold for inclusion may be set too low, capturing a lot of false positives (so high recall, but low precision) • Ideally, both precision and recall should be as high as possible • In a perfect balance, all positives are true positives, and every single actual positive is identified from a set • F-measure / F1 score / F-score (“weighted harmonic mean” between precision and recall) • Is expressed as a number between 0 and 1 (where 1 is perfect precision and perfect recall) • 0 < p < 1 • 0 < r < 1 126
  • 127. F-measure: Precision and Recall Precision “p” (predicted positive results): true positives / true positives and false positives • How sensitive is the test to the identification of true positives (without confusing false positives with the true)? • How much noise is in the results? Is the test overweighted towards finding positives and so falsely categorizing false positives (undesirable)? • High precision means that an identified positive is highly likely to actually be a true positive (and not a false positive). Low precision means that an identified positive could well be a false positive. Recall “r” (capturing of actual positive results): true positives / true positives + false negatives • How many of the true positives have been identified (from the full set of all true positives possibilities)? • High recall means that most or all of the true positives are identified by the test. • Low recall means that many of the true positives were missed. In this case, the test is not trusted to include all possible true positives because many are missed. 127
  • 128. Testing of Predictive Modeling Accuracy (cont.) • F1 = 2 1 recall + 1 precision • F1 simplified = 2 x Precision x Recall Precision+Recall • OR F1 simplified: F = 2 * [ (pr) / (p+r) ] • An ideal test identifies the target phenomena accurately (p) and thoroughly (r). 128
  • 129. Some Limits to Dictionary-Based Classifiers • Dictionary-based classifier systems tend to be high on “precision” but low on “recall” (Aoqui, 2012) • Natural language (particularly in an age of social media) evolves quickly, with new terms and new word usages occurring constantly • Dictionaries used in classifiers are often updated through rigorous manual (“by hand”) processes, which require large human investments and efforts…and time 129
  • 130. External (In)validation • Light “sanity check” of text counts • Comparisons against extant baselines (if available) • Comparisons against text results of a control group (vs. the experimental group) • Comparisons against human coding (if feasible) • Comparisons with other similar research (if available) • Comparisons of phenomena based on other similar / dissimilar text sets • Testing of linguistic indicators in other similar (~) contexts for applicability • Fine-tuning • Testing of linguistic insights with “ground truth” (assessed by other means) 130
  • 131. Some Other Computational Linguistic Analysis Tools 131 Note: Most of these were not tested by the author.
  • 132. Some Other Common Computational Linguistic Analysis Tools • Computational Social Science Lab (CSSL) at the U of S. California’s Text Analysis, Crawling and Interpretation Tool (TACIT) • http://tacit.usc.edu/ • Free and open-source • Art Graesser’s Coh-Metrix program (coherence metrics in text) • http://cohmetrix.com/ • Rod Hart’s DICTION program • http://www.dictionsoftware.com • Tom Landauer’s Latent Semantic Analysis • http://lsa.colorado.edu • CASOS’ AutoMap (network text analysis) • http://www.casos.cs.cmu.edu/proj ects/automap/ • Free 132
  • 133. Some Other Approaches to the Data 133
  • 134. File Export Formats • No direct way to save a project in LIWC2015 so need to “Save Results” • “Analyze Text” and “Categorize Words” data export as following file types: • .txt (ASCII text) • .csv (comma separated values) • .xlsx (XML spreadsheet file format in Excel 2007 onwards) • “Color-Code Text” function data results cannot be saved out directly but may be copied and pasted in MS Word with color intact (not in Notepad or simple text editors) 134
  • 135. From LIWC2015 -> Other analytics • Light analytics and counts in LIWC2015 • No access to coded text sets from which numbers are extrapolated (so a kind of black-box processing except for the software manual documentation) • Export…depending on the text curation process…sometimes requiring data restructuring…sometimes requiring information and assumptions beyond the extracted texts… • …In Excel: mostly descriptive and comparison data • Range of quant processing (averaging data, summing data, and others) • Data visualizations (stacked bar charts, line graphs, and others) 135
  • 136. From LIWC2015 -> Other analytics (cont.) • …In SPSS: asking harder questions based on the studied texts • Statistical significance computations • Chi-square computations • Factor analyses • Content analysis through human “close reading” • LIWC2015 variables involve (reproducible) counts and some psychometrics, but its use is always through a researcher interpretive lens • Researcher identifies what is relevant and why • Researcher brings knowledge of the topic and field to the interpretation 136
  • 137. Conclusion And some Newbie Observations 137 0 5 10 15 20 25 30 35 40 45 function
  • 138. Initial Impressions of LIWC • Functions are simple and mechanistic: counting and tallying • User interface is simple • Processes are simple • Potential is intriguing and promising: • Power lies in the dictionaries and the domain-based insights • Power lies in the respective text sets for sufficiency of data • Power lies in discovery and hypothesis-testing • Power lies in surfacing insights that would not be knowable otherwise in an efficient way 138
  • 139. Conceptualizing the Extracted Data • Because these tools offer predictions based on probability, however, such insights will never be definitive. “In the final analysis, our situation is much like that of economists,” (James W.) Pennebaker says. “It’s too early to come up with a standardized analysis. But at the end of the day, we all are making educated guesses, the same way economists can understand, explain and predict economic ups and downs.” • – Jan Dönges, “What Your Choice of Words Says about Your Personality,” July 1, 2009, Scientific American 139
  • 140. Some Newbie Observations about the Software Tool • LIWC2015 (and computational linguistics) is young yet and still very much in the exploratory phases • There are challenges with setting baselines against which subsets and new sets of texts may be compared for insights; likewise, there are challenges in setting control groups for experimental research • Some insights are domain-specific (and culture specific), and others may be general and cross-domain • A major trap involves over-asserting from limited observations (and limited text sets) • Facile interpretations are risky and should be avoided • J.W. Pennebaker: “Don’t trust your instincts.” • LIWC2015 is a lot of fun to use (maybe a little dangerously so)! 140
  • 141. Some Newbie Observations for Researchers • As with some software tools, it is easy to come up with a lot of data with just a few clicks…but accurate analysis will require the following: • Understanding • the strengths and weaknesses of the tool • the strengths and weaknesses of the curated text sets • with some text sets more amenable to the application of computational linguistic analysis than others • with Heaps’ law / Herdan’s law implications to the amount of text and diminishing returns on numbers of distinct vocabulary elements after a certain amount of collected text • the particular domain / discipline • the research context 141
  • 142. Some Newbie Observations for Researchers (cont.) • Accurate analysis will require understanding… (cont.) • what to select as relevant from the mass of numerical data • ways to test the findings (internal and external validation) • Internal: the text set(s) • External: the world (what the text findings may indicate about the author…the state of the genre…the state of the world, often through statistical means) • ways to hypothesize about and interpret the findings 142
  • 143. Going to the Source • “The Development and Psychometric Properties of LIWC2015” by James W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate Blackburn addresses the following and more: • Psychometric baselines • Examples • Test text corpora details • Internal consistency measures of the various psychological constructs (based on uncorrected α and corrected α) • Uncorrected α based on average of standard Cronbach’s alpha calculation for the term across corpora • Challenges of highly variant base rates of word occurrences across documents and corpora • Corrected α based on Spearman-Brown prediction (prophecy) formula which takes into account the amount of underlying text (“test length”) in attributing internal reliability (the more data, the stronger the potential reliability) 143
  • 144. Internal Consistency of Words as Indicators of Psychometric Constructs…and Confidence Levels • The built-in psychological constructs have a range of confidence based on prior research based on a diverse and large text corpora. • The higher the internal consistency measures (usually measuring correlations between different items on the same test / words in the same construct, and measured on a restricted scale of 0 – 1 outputs), the greater reliability of the software tool in using language to identify a psychological construct (and therefore the higher confidence users may have in the output). • Uncorrected α tends to “grossly underestimate reliability in language categories due (to) the highly variable base rates of word usage within any given category” • Corrected α is based on the Spearman-Brown prediction formula and this is considered “a more accurate approximation of each category’s ‘true’ internal consistency” (Pennebaker, Boyd, Jordan, & Blackburn, 2015, p. 8). 144
  • 145. Basic Approaches for Researchers Researchers… • study the built-in dictionary and other available .dics in LIWC2015 • read up on the related academic research literature (and extract the methods that make the most sense) • experiment with testable hypotheses about particular textual datasets in particular disciplines • trial-run the tool on datasets about which the researcher is already intimate to get a sense of the tool • sample broadly in terms of texts *and* sample texts in a targeted and strategic way, too 145
  • 146. Basic Approaches for Researchers (cont.) Researchers… (cont.) • practice cleaning (pre-processing) textual data for processing in LIWC2015 • formulate and temper hypotheses based on the LIWC findings • work to find counter-evidence to one’s hypotheses (both seeking validation and invalidation) • use in-world knowledge to interpret and test the findings from LIWC2015 146
  • 147. Advanced Approaches for Researchers Researchers… • design their own research approaches using LIWC2015 • use a variety of research methods to capture the desired knowledge more fully (so not just using LIWC alone) • create their own dictionaries for limited and targeted questions 147
  • 148. Interpreting One Document (this slideshow in an earlier iteration) • Scores vs. counts (with computed percentages) • Scores involving calculations (some algorithmic processing) of raw counts • Counts (raw) within groups of dimensions (and computed percentages) 148
  • 149. Addendum: An Applied Example … based on this slideshow v. 1 and v. 2… 149 Data Visualizations: The following data visualizations were mostly created after a second run of the slideshow was done in LIWC2015, so there are some small discrepancies between the posted numerical data and the numerical data used to create the data visualizations (in Excel). The slideshow itself changed with the addition of the analytical data from LIWC…and from updates as the presenter started to better understand the software (still as a neophyte). Sorry about any inconvenience from the data discrepancy.
  • 150. A LIWC Run on this Slideshow 150 Filena me Segm ent WC Analyt ic Clout Authe ntic Tone WPS Sixltr Dic functi on prono un ppron i we you shehe they ipron article prep auxve rb adver b conj negat e verb adj comp are interr og numb er quant affect pose mo nege mo anx anger sad social family friend femal e male cogpr oc insigh t cause discre p tentat certai n differ perce pt see hear feel bio body health sexual ingest drives affilia tion achiev e power rewar d risk focusp ast focusp resent focusf uture relativ motio n space time work leisur e home mone y relig death infor mal swear netsp eak assent nonflu filler AllPun c Period Com ma Colon SemiC QMar k Excla m Dash Quote Apostr o Paren th Other P LIWC- ing at Texts for Insigh ts from Lingui stic Patter ns.pdf 1 5888 95.05 51.70 22.36 37.80 46.36 35.39 62.42 31.10 2.79 0.39 0.02 0.05 0.07 0.00 0.25 2.39 4.55 13.20 3.52 2.09 6.22 0.54 8.14 5.06 2.55 0.87 6.08 2.55 3.09 1.80 1.12 0.27 0.20 0.17 4.45 0.07 0.05 0.07 0.07 14.45 4.26 3.16 0.61 3.31 1.00 3.57 1.14 0.51 0.25 0.12 0.46 0.10 0.24 0.03 0.08 4.79 0.88 1.26 1.70 0.97 0.27 1.32 4.38 0.85 9.21 0.95 6.40 1.75 5.25 0.39 0.10 0.14 0.05 0.08 0.68 0.02 0.54 0.02 0.12 0.00 32.69 2.43 5.01 2.38 0.39 1.02 0.02 2.09 1.46 0.19 7.46 10.26
  • 151. Summary Language Variables Scores (/100, as a percentile measure) • Analytic: 95.26 (score) • Clout: 55.01 (score) • Authentic: 22.15 (score) • (Emotional) Tone: 40.94 (score) Counts (raw counts) • WC (word count): 4281 (count) • WPS (average words per sentence): 47.57 (count) • Sixltr (words > 6 letters): 36.32 (count) • Dic (number of words out of 100 in the built-in dictionary, application of the dictionary to the text): 63.68 (count) 151
  • 152. Linguistic Dimensions (% of words in a text that fit in a certain linguistic category, including 21 possible dimensions) • Function (vs. non-function words): 31.51 • Pronoun: 2.66 • Ppron: 0.40 • I: 0.00 • We: 0.05 • You: 0.07 • Shehe: 0.00 • They: 0.28 • Ipron (impersonal pronouns): 2.27 152 • Article: 4.58 • Prep: 13.50 • Auxverb: 3.53 • Adverb: 2.13 • Conj: 6.49 • Negate: 0.35
  • 153. 153 0.00 0.50 1.00 1.50 2.00 2.50 3.00 pronoun ppron i we you shehe they ipron Pronoun Use =
  • 154. 154 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 article prep auxverb adverb conj negate Non Pronoun Function Words
  • 155. Linguistic Dimensions: Other Grammar • Verb: 8.76 • Adj: 5.02 • Compare: 2.45 • Interrog: 0.93 • Number: 4.34 • Quant: 2.27 • Common verbs • Common adjectives • Comparisons • Interrogatives • Numbers • Quantifiers (few, many, much) 155
  • 157. Psychological Processes • Affect: 2.71 (summary percentage from the category but not with rounded-up numbers) • Posemo: 1.71 • Negemo: 0.86 • Anx: 0.23 • Anger: 0.21 • Sad: 0.14 157
  • 158. 158
  • 159. Social Processes • Social: 4.56 (summary percentage from the category but not with rounded-up numbers) • Family: 0.05 • Friend: 0.05 • Female: 0.07 • Male: 0.07 159
  • 161. Cognitive Processes • Cogproc: 14.55 (summary percentage from the category but not with rounded-up numbers) • Insight: 4.70 • Cause: 3.27 • Discrep: 0.58 • Tentat: 3.34 • Certain: 0.86 • Differ: 3.06 161
  • 162. 162
  • 163. Perceptual Processes • Percept: 1.24 (summary percentage from the category but not with rounded-up numbers) • See: 0.58 • Hear: 0.26 • Feel: 0.14 163
  • 164. 164 0.00 0.20 0.40 0.60 0.80 1.00 1.20 percept see hear feel 1.14 0.51 0.25 0.12 Perceptual Processes =
  • 165. Biological Processes • Bio: 0.42 (summary percentage from the category but not with rounded-up numbers) • Body: 0.09 • Health: 0.23 • Sexual: 0.02 • Ingest: 0.07 165
  • 166. 0.46 0.10 0.24 0.03 0.08 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 bio body health sexual ingest PercentageofFoundTermsinTargetDocument Biological Processes Note: This was visualized from a re-run of the revised slideshow, so the numbers are slightly different from those in the slide. =
  • 167. Drives Drives: 4.25 (summary percentage from the category but not with rounded-up numbers) Affiliation: 0.79 Achieve: 1.17 Power: 1.47 Reward: 0.68 Risk: 0.30 167
  • 168. Time Orientation • Focuspast: 1.35 • Focuspresent: 0.93 • Relativ: 9.23 168
  • 169. 169 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 focuspast focuspresent focusfuture Time Orientation
  • 170. Relativity • Motion: 0.98 • Space: 6.14 • Time: 2.01 170
  • 171. Personal Concerns • Work: 5.98 • Leisure: 0.42 • Home: 0.09 • Money: 0.16 • Relig: 0.07 • Death: 0.09 171 0.00 1.00 2.00 3.00 4.00 5.00 6.00 work leisure home money relig death Personal Concerns (as a Filled Radar Chart)
  • 172. Informal Language • Informal: 0.61 (0.63 if rounded up from numbers below) • Swear: 0.00 • Netspeak: 0.49 • Assent: 0.02 • Nonflu: 0.12 • Filler: 0.00 172
  • 174. Punctuation Categories • AllPunc: 29.95 (> 21.35 average in LIWC2015 analysis of training set) • Period: 2.50 • Comma: 5.40 • Colon: 0.86 • SemiC: 0.33 • Qmark: 0.75 • Exclam: 0.02 • Dash: 2.03 • Quote: 1.45 • Apostro: 0.23 • Parenth: 6.89 • OtherP: 9.48 174
  • 175. 175 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 AllPunc Period Comma Colon SemiC QMark Exclam Dash Quote Apostro Parenth OtherP Punctuation Categories =
  • 176. Extrapolated Content Summary (rough) 176 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 function affect social cogproc insight percept bio drives "timeorient" relativ "perscons" informal AllPunc Extrapolated Content Summary (rough)
  • 177. Extrapolated Content Summary (rough) (cont.) • So a slideshow focused on cognitive processes, relativity, time orientation, personal concerns, drives, social, insights, in descending order (based on the prior summary, which does not sum cleanly to 100% because of the rounding up) (prior slide) 177 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 A Rough Content Summary
  • 178. Color-Coded by Time Dimension 178
  • 179. Four Summary Language Dimensions of this Slideshow (in a spider chart) 179 95.26 55.01 22.15 40.94 0 10 20 30 40 50 60 70 80 90 100 Analytic Clout Authentic Emotional Four Summary Language Dimensions of this Slideshow
  • 180. Four Summary Language Dimensions of this Slideshow Descriptive? • Do these summary dimensions capture the human close- reading sense of the slideshow? • If so, what are its insights? • If not, where does it fall short? Prescriptive? • Analytic: Should this work be so highly analytic? (score: 95.26) • Clout: Should there be more effort to raise the clout score to indicate more expertise, confidence, status, and leadership? (even though the presenter is a newbie to LIWC) (score: 55.01) 180
  • 181. Four Summary Language Dimensions of this Slideshow (cont.) Descriptive? • What are ways to capture other aspects of the text or texts beyond the summary language dimensions? (and outside of LIWC) Prescriptive? • Authentic: Should the language in this slideshow be more personable and authentic? Less guarded? More vulnerable? (score: 22.15) • Tone (emotion): Given its emotional tone, which is trending a little negative (< 50), should more effort be made to make it trend more positive? (score: 40.94) 181
  • 182. Comments? Questions? • Any insights about challenges to interpreting the data without baselines? Control text sets? Without comparatives? • Insight about why the color coding of the target document or text set may be helpful? • Ideas for new applications of LIWC2015? Fresh research ideas? • Strengths of the tool and research methodology? Weaknesses? Ways to strengthen this approach? 182
  • 183. Conclusion and Contact • Dr. Shalin Hai-Jew • iTAC, Kansas State University • 212 Hale / Farrell Library • shalin@k-state.edu • 785-532-5262 • No ties: The presenter has no tie to the maker of LIWC (Pennebaker Conglomerates, Inc.). • Data visualizations: The simple data visualizations were created in Microsoft Excel using a re-run of the LIWC tool over this revised slideshow, so the numbers in the data visualizations are slightly different from the numbers in the first set used in the Addendum: An Applied Example (Slide 149 onwards). • Newbie alert! Also, the presenter is a newcomer to LIWC and is still LIWC-ing around. If you see an error, please contact the presenter, so the slideshow may be corrected. Thanks! • And less-newbie learning: A new section was added as the presenter worked on her first custom dictionary, so if you downloaded an earlier version, please download this current one.  183