A Benchmark Study on Sentiment Analysis for Software Engineering Research

A Benchmark Study on
Sentiment Analysis for
Software Engineering Research
Nicole Novielli
@NicoleNovielli
Filippo Lanubile
@lanubile
Daniela Girardi
@DanielaGirard91

Sentiment analysis for software engineering
Collaborative software development
– Security concerns detection (Pletea et al., MSR’14)
– Impact on productivity (Ortu et al., MSR‘15)
– Early burnout discovery (Mantyla et al. MSR’15)
– Anger detection (Gachechiladze et al., ICSE-NIER‘17)
Collaborative knowledge sharing
– Empirically-driven guidelines for question writing (Calefato et al., IST 2018)
Requirements engineering
– User feedback (Guzman and Maalej, RE‘14)
– App improvement (Panichella et al., ICSME ‘14)
Actionable insights for

Off-the-shelf tools for sentiment analysis
Approach Ouput Validated on
Supervised learning
Bag-of-words
Probabilities:
• p(positive)
• p(negative)
• p(neutral)
Movie reviews
Tweets
Supervised learning Sentiment score in
[0,4]:
• 0 = very negative
• 2 = neutral
• 4 = very positive
Movie reviews
Lexicon-based
Dictionaries with a
priori polarity scores
in [-5, 5]
Sentiment scores
• Negative in [-5, -1]
• Positive in [1,5]
• Neutral = (-1,1)
Social media:
• YouTube
• Twitter
• MySpace
• …
http://sentistrength.wlv.ac.uk/
http://text-processing.com/
http://nlp.stanford.edu/sentiment/
Are off-the-shelf sentiment analysis tools
reliable for software engineering research?

RQ1: Do different sentiment analysis tools
agree with emotions of software developers?
The tools disagree with each other
Poor performance on technical texts
Disagreement can lead to diverging
conclusions
RQ2: Do sentiment analysis tools agree with
each other?
RQ3: Do different sentiment analysis tools lead
to contradictory results in software
engineering study?
RQ4: How does the choice of a sentiment
analysis tool affect conclusion validity?
Need for Software engineering (SE) specific tools for sentiment analysis

SE-specific sentiment analysis tools
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE (Islam and Zibran, MSR’17)
Supervised
Lexicon-based
F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017
T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017.
M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017.

Our replication
Research questions
RQ1: Do different sentiment analysis
tools agree with emotions of software
developers?
RQ2: Do sentiment analysis tools agree
with each other?
RQ2: Do SE-specific sentiment analysis
tools agree with each other?
RQ1: Do SE-specific sentiment analysis
tools agree with emotions of software
developers?
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE
(Islam and Zibran, MSR’17)
• SentiStrength (baseline)
• NLTK
• Stanford NLP
• Alchemy API
• SentiStrength
SE-specificOff-the-shelf

Our replication
Gold standard datasets
392 comments
(Murgia et al., MSR’14)
5869 comments
4423 Qs, As, Cs
(Calefato et al., EMSE 2017)
Model-driven annotation

Model-driven annotation of emotions
Emotion Original study Our replication
Love Positive Positive
Joy Positive Positive
Surprise Positive Ambiguous
Anger Negative Negative
Sadness Negative Negative
Fear Negative Negative
No
emotion
Neutral Neutral
(Shaver et al., 1987)
Mapping emotions to polarity
I'm happy with the approach and
the code looks good
Joy -> Positive Polarity
Joy Happiness Satisfaction

Our replication
Gold standard datasets
392 comments
5869 comments
4423 Qs, As, Cs
Model-driven annotation
1500 sentences QA on
Java libraries (Lin et al., ICSE’18)
1600 comments from
code review (Ahmed et al., ASE’17)
Ad-hoc annotation

Model-driven vs. ad-hoc annotation
Model-driven Ad-hoc
Theoretical models Yes No
Training of raters Yes No
Guidelines for
annotation
Based on
taxonomy
Based on subjective
perception

Research questions
RQ3: To what extent the labeling approach
has an impact on the performance of SE-
specific sentiment analysis tools?
Our replication
RQ1: Do different sentiment analysis tools
RQ2: Do sentiment analysis tools agree with
each other?
RQ2: Do SE-specific sentiment analysis tools
agree with each other?

Metrics
Our replication
• Weighted Cohens’ Kappa (1968) • Weighted Cohens’ Kappa (1968)
• Text categorization metrics
(Sebastiani, 2002)
– Precision
– Recall
– F-measure
J. Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 4, 213-220.
F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34,1, 1-47.

Weighted Cohens’ Kappa
Disagreement: strong vs. mild
negative neutral positive
negative 0 1 2
neutral 1 0 1
positive 2 1 0
Interpretation (Viera and Garrett, 2005)
• less than chance κ ≤ 0
• slight if 0.01 ≤ κ ≤ 0.20
• fair if 0.21 ≤ κ ≤ 0.40
• moderate if 0.41 ≤ κ ≤ 0.60
• substantial if 0.61 ≤ κ ≤ 0.80
• almost perfect if 0.81 ≤ κ ≤ 1
A.J. Viera, J.M. Garrett. 2005. Understanding interobserver agreement: the kappa statistic. Family Medicine, 37,5, 360–363

Experimental setting
Gold
standard
datasets
Train 70%Stratified
sampling
Senti4SD
updated model
SentiCR
updated model
Training of
supervised tools
Test 30%
SentiStrength-SE
SentiStrength
Assessment of
performance

Our replication: SE-specific tools
vs. manual annotation
Original study: off-the-shelf tools
Fair agreement
Substantial
agreement

Our replication: SE-specific tools
Opportunistic
sampling using
SentiStrength

Our replication
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance

Our replication
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance
• Comparable performance for
SentiStrength-SE (lexicon-
based)

agree with each other?
Our replication
From substantial
to perfect
agreement
From less than
chance to fair
agreement

RQ3: To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-
driven
annotation
Ad-hoc
annotation

RQ3: To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-driven annotation Ad-hoc annotation
• From substantial to perfect
agreement also between supervised
and lexicon-based tools
• From fair to moderate agreement
• Better agreement for supervised
approaches

Error analysis
Manual inspection of texts misclassified by all tools

Error analysis
Polar facts but neutral sentiment
‘I tried the following and it returns nothing’
---
‘This creates an unnecessary garbage list.
Sets.newHashSet should accept an Iterable.’

Error analysis
General error
Broken syntax as in ‘wontbe so bad’
---
Idiomatic expression ‘Are you out of
mind?’

Error analysis
Politeness
Context-dependent interpretation of
politeness by raters
‘Thank you’ vs. ‘Thank you!’

Lessons learned
 Reliable sentiment analysis in software engineering
is possible

Lessons learned
is possible
 Tuning of tools for software engineering improves
classification accuracy
 SE-specific tools agree with manual annotation
 SE-specific tools agree with each other

Lessons learned
is possible
 Grounding research on theoretical models of affect
is recommended
 The choice depends on the research goals: polarity vs.
fine-grained emotions, emotions vs. attitudes, etc.

Lessons learned
is possible
 Grounding research on theoretical models of affect
is recommended
 The choice depends on the research goals: polarity vs.
fine-grained emotions, emotions vs. attitudes, etc.
 Preliminary sanity check is always recommended

A Benchmark Study on Sentiment Analysis for Software Engineering Research

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to A Benchmark Study on Sentiment Analysis for Software Engineering Research

Similar to A Benchmark Study on Sentiment Analysis for Software Engineering Research (20)

More from Nicole Novielli

More from Nicole Novielli (11)

Recently uploaded

Recently uploaded (20)

A Benchmark Study on Sentiment Analysis for Software Engineering Research

Editor's Notes