More Related Content
Similar to Empirical Sentiment Accuracy Bounds
Similar to Empirical Sentiment Accuracy Bounds (20)
More from Visible Technologies
More from Visible Technologies (7)
Empirical Sentiment Accuracy Bounds
- 2. Visible’s Sentiment Approach
Visible was one of the
first Social Media
Monitoring solution in Algorithms
the market. • State of the art A sentiment model
• Beyond overhyped NLP based on years of
labeling social data for
Features enterprises.
• Deep experience 107+ labels, 105+
• Social NLP & Context
topics, 102+
enterprises.
Data
• Massive proprietary data
Copyright © 2011 Visible. All rights reserved.
- 3. Visible’s Sentiment Approach
Algorithms
• State of the art A sentiment model
• Beyond overhyped NLP based on years of
labeling social data for
We have 10s of
millions of human Features enterprises.
annotated social • Deep experience 107+ labels, 105+
media posts • Social NLP & Context
topics, 102+
enterprises.
Data
• Massive proprietary data
Copyright © 2011 Visible. All rights reserved.
- 4. Visible’s Sentiment Approach
Algorithms
• State of the art A sentiment model
• Beyond overhyped NLP based on years of
labeling social data for
Features enterprises.
• Deep experience 107+ labels, 105+
• Social NLP & Context
topics, 102+
enterprises.
Basically all break- Data
through in the last two • Massive proprietary data
decades have come
from better data
Copyright © 2011 Visible. All rights reserved.
- 5. Sentiment, The Accuracy Disconnect
• Claims: “We have 97%
Accuracy” There is a disconnect
between the hype and the
experience in the
• Experience: “The best marketplace
vendor tested had 50%
accuracy at the post
level”
• Experience: Sentiment
Accuracy most
dissatisfying feature
according to Forrester
research, only 45%
satisfied with vendor
sentiment accuracy
Copyright © 2011 Visible. All rights reserved.
- 6. Key Findings After spending several years of
research with the best available data,
here are some of the key findings.
1. Solve relevance first, sentiment second.
2. Accuracy is the wrong measure to
optimize.
3. Sentiment is more subjective than
you think it is.
Copyright © 2011 Visible. All rights reserved.
- 7. Key Findings
1. Solve relevance first, sentiment second.
2. Accuracy is the wrong measure to
optimize.
We won’t have time to cover the first two. The
third could be an alternate title for this talk.
3. Sentiment is more subjective than
you think it is.
Copyright © 2011 Visible. All rights reserved.
- 8. Audit Findings, Large Financial Institution
A typical study.
Double Blind, Multi-Reviewer Study:
1. Same posts labeled by both human No statistically significant
labeling practice and automation.
difference between human
2. At least two auditors grade each
label. Blind to label source. labeled and AI labeled
sentiment
Reviewers can’t tell the
difference between Visible’s
statistical models and human
annotators.
Copyright © 2011 Visible. All rights reserved.
- 9. Audit Findings, Large Financial Institution
Double Blind, Multi-Reviewer Study:
1. Same posts labeled by both human No statistically significant
labeling practice and automation.
difference between human
2. At least two auditors grade each
label. Blind toSo is Sentiment “solved”?
label source. labeled and AI labeled
sentiment
But…
Auditors agree with each other only 73% of the time
[95%CI: 69%-77%]. No, Auditors think people and
automation are both poor. And they
don’t agree with each other.
Copyright © 2011 Visible. All rights reserved.
- 10. Key Audit Findings, Large Financial Institution
Social Media Professionals Grading Human Annotations
Another way of looking at the same study
Both auditors At least one
agree with auditor agrees
label only 58% with label 91%
of the time of the time
Proxy for Proxy for
“hard” “easy”
graders graders
58% - 91% is a huge range.
Copyright © 2011 Visible. All rights reserved.
- 11. True Across a Wide Variety of Problems
This talk Multi-Reviewer 3rd party audits across a
promised variety of Brands consistently show
bounds and
relatively low agreement rates.
here they are.
About 81% Inter-Annotator Agreement
[IQR: 78% - 83%]
Copyright © 2011 Visible. All rights reserved.
- 12. True Across a Wide Variety of Problems
Multi-Reviewer 3rd party audits across a
variety of Brands consistently show
relatively low agreement rates.
About 81% Inter-Annotator Agreement
[IQR: 78% - 83%]
80% is also consistent
with academic research
Copyright © 2011 Visible. All rights reserved.
- 13. Take Aways
1. Yes, your team
2. Evaluating sentiment takes care
3. Accuracy claims inbetter than average drivers.
We all think we’re
the 90s are either exaggerated
or naïve (over-fit) of us have heard something like the
Similarly, although most
80% agreement statistic, we don’t think it applies to us. The
4. It main thing I want you totake away from this talk istight
will take effort to get your team in that it
agreement in the People withinyou, disagree with your
does apply to you.
team, sitting
on sentiment your department, you
cube next to
definitions
5. Real breakthroughs inofsentiment accuracy will
about 20% the time.
come from personalization
Copyright © 2011 Visible. All rights reserved.
- 14. Take Aways
1. Yes, your team
2. Evaluating sentiment takes care
3. Accuracy claims in the 90s are either exaggerated
or naïve (over-fit)
4. It will The implicationsto get yourtaking in tight
take effort are also worth team
agreement When people claim accuracies
to heart. on sentiment definitions
much higher than 80% they are either
5. Real breakthroughs in sentiment accuracy will
lying or they don’t know what they are
come from personalization .
doing (overfit to one dataset)
Copyright © 2011 Visible. All rights reserved.
- 15. Take Aways
1. Yes, your what has happened in Search, real breakthroughs will come
Similar to team
though personalization. Deeper linguistics (dealing with sarcasm, humor,
2. Evaluating sentiment takesbut can’t help break the 80% barrier.
contextual knowledge) are interesting care
3. Accuracythe work into getting90s are either exaggerated (with
If teams put claims in the tight, consistent sentiment definitions
or naïve (over-fit) then do algorithms have a chance to do that well.
>80% agreement), only
4. It will take effort to get your team in tight
agreement on sentiment definitions
5. Real breakthroughs in sentiment accuracy will
come from personalization
Copyright © 2011 Visible. All rights reserved.
- 16. @shawnrut
@Visible
VisibleTechnologies.com
Thank You!