Is search always the right solution? There are many things you can do with a hammer, but it’s not so great if you need to turn a screw.
Text Classification is an alternative to search that may be more appropriate for social media data analysis. Text classification is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. Using text classification as the foundation for analysis – i.e., teaching a machine to categorize posts the way humans do – can dramatically improve your ability to gather the right data and, ultimately, increase the chances that you’ll uncover what you need to know.
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Search vs Text Classification
1. White Paper
Search vs.Text Classification
Increasing the signal, decreasing the noise
1 West Street New York NY 10004 | 646-545-3900 | info@networkedinsights.com | networkedinsights.com
2. White Paper Networked Insights
Network
Search vs. Text Classification
Increasing the signal, decreasing the noise
Since the advent of the World Wide Web, businesses and Topic discovery—
consumers have used a variety of ways to find information. letting data speak for itself
These various methods of discovery have trained us to think Topic discovery is a valuable type of
and behave in ways that make understanding analytics semantic analysis based on text
challenging. In fact, what makes retrieving information easy classification. Whereas sentiment analysis
for individuals is not the manner in which we should examine simply reveals people’s likes and dislikes,
social data. Confused? semantic analysis refers to a group of
methods that allow machines to discover
In the infancy of the commercial public Web, navigation was nearly impos- the fundamental patterns of words or
sible without directories and then information portals. With the explosion phrases that act as building blocks in a
of the Web in the late 1990s, keyword searching and using search engines large set of text. Topics, themes, sentiment
has become as ubiquitous as the Internet itself. While the underlying and similar elements of meaning appear
methods of search have evolved over the years, its primary use has stayed as intricate weavings of those fundamental
constant since the early days of companies like Yahoo!, Altavista, Lycos, patterns. So semantic analysis is the
Excite and Google. Reflecting its mass popularity and understanding, summarization of large amounts of text
search is often the first tool applied to a wide variety of data challenges. by automatically discovering the topics
and themes within.
But is search always the right solution? There are many things you can do
with a hammer, but it’s not so great if you need to turn a screw. By grouping social media posts based on
semantic similarity, rather than preset
To learn what customers think about your products and services, you may sentiment categories such as positive, nega-
need to apply sentiment analysis across millions of social media posts. tive and neutral, topic discovery can help
Or, to guide your media buying, you might use topic discovery to uncover companies uncover important information –
market trends in the social conversation. for example, what exactly people are saying
about a product or service; where and how
In either case, using search to identify the set of posts you’ll submit to they use it; the features they use most; and
scrutiny could send your social media analysis down the wrong path from the enhancements or new offerings they’re
the start. Your approach to conducting sentiment analysis or topic interested in. All of this information can
discovery could be spot on. But if it’s based on a number of posts that ultimately drive product development, new
aren’t actually about what you think they are, which typically happens revenue streams and strategies for market-
with search, the noise created can flaw the inferences and conclusions you ing, advertising and media planning.
ultimately draw.
Text classification is an alternative to search that may be more appropri-
ate for social media data analysis. Text classification is the task of assigning
predefined categories to free-text documents. It can provide conceptual
views of document collections and has important applications in the real
world. Using text classification as the foundation for analysis – i.e., teach-
ing a machine to categorize posts the way humans do – can dramatically
improve your ability to gather the right data and, ultimately, increase the
chances that you’ll uncover what you need to know.
2
3. White Paper Networked Insights
Search vs. Text Classification
The impact of bad data
A look at several related but distinct topics illustrates how seriously the
problems of search can impact analysis.
A Networked Insights analyst designed search queries for five topics that
moms typically discuss – pregnancy and newborns; school-aged children;
food, nutrition and health; shopping and money; and illness and injury.
Searches were run on the five topics, then another analyst reviewed
the results under two test scenarios to determine how well the search
delivered posts fitting the intended criteria as defined by the query.
In the first test, the analyst reviewed only the top 20 results returned traditional search
by each search as ordered by the search engine. In the second test, the
analyst reviewed a random sample of 200 results returned by the search.
In each case, the analyst was asked to judge whether each resulting post
was appropriate for the intended category or if it fit better in a different
one. The percent of appropriate posts is a measure of the “precision” of
the search.
The test results (Table 1) reveal search’s severe limitations. Precision was Significant problems arise
high when only the top 20 results were examined (90 percent or higher), with search when you’re
but falls precipitously when examining a larger number of randomly sam-
pled posts. In only one search, pregnancy and newborns, did the results
after a broad collection of
yield a somewhat reliable level of precision (86.5 percent). In three of the similar posts, not a handful
five searches, precision rates were under 50 percent. of the best ones.
In practical terms, these results mean there’s a greater chance that a ran-
domly selected search result will not meet the intended criteria than that
it will. Said another way, search might be used to support other analyses
by returning a large number of posts assumed to cover the same basic
topic. The problem: the majority of the data isn’t relevant to the topic you
want to understand.
Table 1. Keyword Search Precision
Desired Topic Top 20 Results Only Random Sample
Pregnancy and newborns 95% 86.5%
School-aged children 95% 19.5%
Food, nutrition, health 90% 39.5%
Shopping and money 100% 57.5%
Illness and Injury 100% 41%
Overall 96% 48.8%
3