PAPER here: https://arxiv.org/abs/1803.06525
A recent research trend has emerged to identify developers’ emotions, by applying sentiment analysis to the content of communication traces left in collaborative development environments. Trying to overcome the limitations posed by using off-the-shelf sentiment analysis tools, researchers recently started to develop their own tools for the software engineering domain. In this paper, we report a benchmark study to assess the performance and reliability of three sentiment analysis tools specifically customized for software engineering. Furthermore, we offer a reflection on the open challenges, as they emerge from a qualitative analysis of misclassified texts.
A Benchmark Study on Sentiment Analysis for Software Engineering Research
1. A Benchmark Study on
Sentiment Analysis for
Software Engineering Research
Nicole Novielli
@NicoleNovielli
Filippo Lanubile
@lanubile
Daniela Girardi
@DanielaGirard91
2. Sentiment analysis for software engineering
Collaborative software development
– Security concerns detection (Pletea et al., MSR’14)
– Impact on productivity (Ortu et al., MSR‘15)
– Early burnout discovery (Mantyla et al. MSR’15)
– Anger detection (Gachechiladze et al., ICSE-NIER‘17)
Collaborative knowledge sharing
– Empirically-driven guidelines for question writing (Calefato et al., IST 2018)
Requirements engineering
– User feedback (Guzman and Maalej, RE‘14)
– App improvement (Panichella et al., ICSME ‘14)
Actionable insights for
3. Off-the-shelf tools for sentiment analysis
Approach Ouput Validated on
Supervised learning
Bag-of-words
Probabilities:
• p(positive)
• p(negative)
• p(neutral)
Movie reviews
Tweets
Supervised learning Sentiment score in
[0,4]:
• 0 = very negative
• 2 = neutral
• 4 = very positive
Movie reviews
Lexicon-based
Dictionaries with a
priori polarity scores
in [-5, 5]
Sentiment scores
• Negative in [-5, -1]
• Positive in [1,5]
• Neutral = (-1,1)
Social media:
• YouTube
• Twitter
• MySpace
• …
http://sentistrength.wlv.ac.uk/
http://text-processing.com/
http://nlp.stanford.edu/sentiment/
Are off-the-shelf sentiment analysis tools
reliable for software engineering research?
4. RQ1: Do different sentiment analysis tools
agree with emotions of software developers?
The tools disagree with each other
Poor performance on technical texts
Disagreement can lead to diverging
conclusions
RQ2: Do sentiment analysis tools agree with
each other?
RQ3: Do different sentiment analysis tools lead
to contradictory results in software
engineering study?
RQ4: How does the choice of a sentiment
analysis tool affect conclusion validity?
Need for Software engineering (SE) specific tools for sentiment analysis
5. SE-specific sentiment analysis tools
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE (Islam and Zibran, MSR’17)
Supervised
Lexicon-based
F. Calefato, F. Lanubile, F. Maiorano, N. Novielli. Sentiment Polarity Detection for Software Development. EMSE, 2017
T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. . SentiCR: a customized sentiment analysis tool for code review interactions, ASE 2017.
M.D.R. Islam and M.F. Zibran, Leveraging automated sentiment analysis in software engineering, MSR 2017.
6. Our replication
Research questions
RQ1: Do different sentiment analysis
tools agree with emotions of software
developers?
RQ2: Do sentiment analysis tools agree
with each other?
RQ2: Do SE-specific sentiment analysis
tools agree with each other?
RQ1: Do SE-specific sentiment analysis
tools agree with emotions of software
developers?
• Senti4SD (Calefato et al. EMSE 2017)
• SentiCR(Ahmed et al., ASE ‘17)
• SentiStrength-SE
(Islam and Zibran, MSR’17)
• SentiStrength (baseline)
• NLTK
• Stanford NLP
• Alchemy API
• SentiStrength
SE-specificOff-the-shelf
7. Our replication
Gold standard datasets
392 comments
(Murgia et al., MSR’14)
5869 comments
(Murgia et al., MSR’16)
4423 Qs, As, Cs
(Calefato et al., EMSE 2017)
Model-driven annotation
8. Model-driven annotation of emotions
Emotion Original study Our replication
Love Positive Positive
Joy Positive Positive
Surprise Positive Ambiguous
Anger Negative Negative
Sadness Negative Negative
Fear Negative Negative
No
emotion
Neutral Neutral
(Shaver et al., 1987)
Mapping emotions to polarity
I'm happy with the approach and
the code looks good
Joy -> Positive Polarity
Joy Happiness Satisfaction
9. Our replication
Gold standard datasets
392 comments
(Murgia et al., MSR’14)
5869 comments
(Murgia et al., MSR’16)
4423 Qs, As, Cs
(Calefato et al., EMSE 2017)
Model-driven annotation
1500 sentences QA on
Java libraries (Lin et al., ICSE’18)
1600 comments from
code review (Ahmed et al., ASE’17)
Ad-hoc annotation
10. Model-driven vs. ad-hoc annotation
Model-driven Ad-hoc
Theoretical models Yes No
Training of raters Yes No
Guidelines for
annotation
Based on
taxonomy
Based on subjective
perception
11. Research questions
RQ3: To what extent the labeling approach
has an impact on the performance of SE-
specific sentiment analysis tools?
Our replication
RQ1: Do different sentiment analysis tools
agree with emotions of software developers?
RQ2: Do sentiment analysis tools agree with
each other?
RQ2: Do SE-specific sentiment analysis tools
agree with each other?
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
12. Metrics
Our replication
• Weighted Cohens’ Kappa (1968) • Weighted Cohens’ Kappa (1968)
• Text categorization metrics
(Sebastiani, 2002)
– Precision
– Recall
– F-measure
J. Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 4, 213-220.
F. Sebastiani. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34,1, 1-47.
13. Weighted Cohens’ Kappa
Disagreement: strong vs. mild
negative neutral positive
negative 0 1 2
neutral 1 0 1
positive 2 1 0
Interpretation (Viera and Garrett, 2005)
• less than chance κ ≤ 0
• slight if 0.01 ≤ κ ≤ 0.20
• fair if 0.21 ≤ κ ≤ 0.40
• moderate if 0.41 ≤ κ ≤ 0.60
• substantial if 0.61 ≤ κ ≤ 0.80
• almost perfect if 0.81 ≤ κ ≤ 1
A.J. Viera, J.M. Garrett. 2005. Understanding interobserver agreement: the kappa statistic. Family Medicine, 37,5, 360–363
15. Our replication: SE-specific tools
vs. manual annotation
Original study: off-the-shelf tools
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
Fair agreement
Substantial
agreement
16. Original study: off-the-shelf tools
Our replication: SE-specific tools
vs. manual annotation
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
Opportunistic
sampling using
SentiStrength
(Calefato et al., EMSE 2017)
17. Our replication
vs. manual annotation
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance
18. Our replication
vs. manual annotation
RQ1: Do SE-specific sentiment analysis tools
agree with emotions of software developers?
• SE-specific optimization
improves the classification
accuracy
• Retraining supervised tools
produces better performance
• Comparable performance for
SentiStrength-SE (lexicon-
based)
19. RQ2: Do SE-specific sentiment analysis tools
agree with each other?
Our replication
Original study: off-the-shelf tools
From substantial
to perfect
agreement
From less than
chance to fair
agreement
20. RQ3: To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-
driven
annotation
Ad-hoc
annotation
21. RQ3: To what extent the labeling approach has an
impact on the performance of SE-specific sentiment analysis tools?
Model-driven annotation Ad-hoc annotation
• From substantial to perfect
agreement also between supervised
and lexicon-based tools
• From fair to moderate agreement
• Better agreement for supervised
approaches
24. Error analysis
Polar facts but neutral sentiment
‘I tried the following and it returns nothing’
---
‘This creates an unnecessary garbage list.
Sets.newHashSet should accept an Iterable.’
28. Lessons learned
Reliable sentiment analysis in software engineering
is possible
Tuning of tools for software engineering improves
classification accuracy
SE-specific tools agree with manual annotation
SE-specific tools agree with each other
29. Lessons learned
Reliable sentiment analysis in software engineering
is possible
Tuning of tools for software engineering improves
classification accuracy
SE-specific tools agree with manual annotation
SE-specific tools agree with each other
Grounding research on theoretical models of affect
is recommended
The choice depends on the research goals: polarity vs.
fine-grained emotions, emotions vs. attitudes, etc.
30. Lessons learned
Reliable sentiment analysis in software engineering
is possible
Tuning of tools for software engineering improves
classification accuracy
SE-specific tools agree with manual annotation
SE-specific tools agree with each other
Grounding research on theoretical models of affect
is recommended
The choice depends on the research goals: polarity vs.
fine-grained emotions, emotions vs. attitudes, etc.
Preliminary sanity check is always recommended