Supported by artificial intelligence, intelligent recommendation systems are a boon for e-commerce and other business activities. This paper examines how clustering, data classification, and collaborative filtering as well as a number of algorithms are harnessed to make recommendation systems powerful tools.
Business Model Canvas (BMC)- A new venture concept
How to Develop Online Recommendation Systems that Deliver Superior Business Performance
1. • Cognizant 20-20 Insights
How to Develop Online Recommendation
Systems that Deliver Superior Business
Performance
Executive Summary combating this issue is what is known as a recom-
mendation system.
Over the past two decades, the Internet has
emerged as the mainstream medium for online Many major e-commerce Websites are already
shopping, social networking, e-mail and more. using recommendation systems to provide
Corporations also view the Web as a potential relevant suggestions to their customers. The
business accelerator. They see the huge volume of recommendations could be based on various
transactional and interaction data generated by parameters, such as items popular on the
the Internet as R&D that informs the creation of company’s Website; user characteristics such as
new and more competitive services and products. geographical location or other demographic infor-
mation; or past buying behavior of top customers.
Several “e-movement” crusaders have discovered
that customers spend significant amounts of This white paper presents an overview of how
time researching products they seek before we are helping a Fortune 500 organization to
purchasing. In a bid to assist customers in these implement a recommendation system. Moreover,
efforts, and conserve precious time, these orga- this paper also sheds light on key challenges that
nizations offer users suggestions of products may be encountered during implementation of a
they may be interested in. This serves the dual recommendation system built on the open source
purpose of not just attracting browsers but Apache Mahout,1 a large-data library of statistical
converting them into buyers. For instance, an and analytical algorithms.
online bookstore may know that a customer has
interest in mobile technology based on previous Recommendation Systems
site visits and suggest relevant titles to purchase. Recommendation systems can be considered
An uninitiated user may be impressed by such as a valuable extension of traditional informa-
suggestions. Suggestions (or “recommenda- tion systems used in industries such as travel
tions” as they are popularly known) predict likes and hospitality. However, recommendation
and dislikes of users. To offer meaningful rec- systems have mathematical roots and are more
ommendations to site visitors, these companies akin to artificial intelligence (AI) than any other
need to store huge amounts of data pertaining IT discipline. A recommendation system learns
to different user profiles and their correspond- from a customer’s behavior and recommends a
ing interests. This eventually culminates in infor- product in which users may be interested. At the
mation overload, or difficulty in understanding heart of recommendation systems are machine-
and making informed decisions. One solution to
cognizant 20-20 insights | january 2012
2. learning constructs. Leading e-commerce players Classification is a technique used to decide
use recommendation engines that sift users’ past whether new input or a search term matches
purchase histories to recommend products such a previously observed pattern. It is also used
as magazine articles, books, goods, etc. Here is to detect suspicious network activity. Yahoo!
how major e-commerce companies use recom- Mail8 uses classification to decide if an incoming
mendation engines to improve their sales and message is spam. Image sharing sites like Picasa9
their customers’ shopping experience. use classification techniques to determine
whether photos contain human faces. They
• Amazon.com: Depending on past purchases
then offer recommendations of people that are
and user activity, the site recommends prod-
identified in the user contacts list.10
ucts of user interest.2
• Netflix: Recommends DVDs in which a user A Robust System to Counter
may be interested by category like drama, Information Overload
comedy, action, etc. Netflix went so far as to We are working with a leading multinational man-
offer a $1 million3 prize to researchers who ufacturing company that has numerous product
could improve its recommendation engine. research labs with many scientists and research-
• eBay: Collects user feedback about its prod- ers in numerous countries working on different
ucts which is then used to recommend prod- technologies. To help facilitate scientific research,
ucts to users who have exhibited similar be- and to buy the latest technology information, this
haviors.4 client partnered with information providers such
as Scopus, Knovel, etc. But despite these data
Online companies that leverage recommendation sources, scientists and researchers were often
systems can increase sales by 8% to12%.5 unable to find the right information to improve
their research. Also, scientists across the globe
Companies that succeed with recommendation were unable to collaborate and share technical
engines are those that can quickly and efficiently information with each other. This situation is
turn vast amounts of data into actionable infor- typical of companies dealing with information
mation. overload.
Anatomy of a Recommendation Engine To increase the informational awareness of
The key component of a recommendation system scientists and other employees, the client wanted
is data. This data may be garnered by a variety to create a system to recommend resources like
of means such as customer ratings of products, patents, articles and journals from paid content
feedback/reviews from purchasers, etc. This data providers. A successful system needed to learn
will serve as the basis for recommendations to from user searches and be intelligent enough to
users. After data collection, recommendation recommend popular resources similar to the ones
systems use machine-learning algorithms to that a user is currently working from. The system
find similarities and affinities between products was also expected to provide scientists with
and users. Recommender logic programs are useful insights on information other scientists
then used to build suggestions for specific user across the globe are using. Finally, the system
profiles. This technique of filtering the input data was to serve as a platform to connect scientists
and giving recommendations to users is also working on similar technologies.
known as “collaborative filtering.”6
We helped to design and develop the system,
Along with collaborative filtering, recommenda- which was dubbed “intelligent recommendation
tion systems also use other machine-learning system” (IRS). Many of the problems the orga-
techniques such as clustering and classification nization faced originated from the multiple pref-
of data. Clustering is a technique which is used to erences and needs of users pertaining to their
bundle large amounts of data together into similar individual research topics. To make the system
categories. It is also used to see data patterns and adaptive to specific user requirements, the
render huge amounts of data simpler to manage. solution proposed was to use a recommendation
For instance, Google News7 creates clusters of system. As a first step towards the solution, the
similar news information when grouping diverse large information base possessed by the client
arrays of news articles. Many other search was categorized/grouped by specific criteria.
engines use clustering to group results for similar After much contemplation of data and size of the
search terms. user base, we decided to implement the system
using the Apache Mahout framework.
cognizant 20-20 insights 2
3. The Mahout framework is highly flexible and lets highly scalable. Mahout is considered a superior
developers customize outcomes according to way of building recommendation systems
their ad hoc requirements. We then developed because it implements all the three machine
a customized algorithm to recommend relevant learning techniques — collabora-
resources to scientists and researchers. tive filtering, clustering and clas- Mahout is considered
sification. Collaborative filtering a superior way
Building an Intelligent is the primary technique used
Recommendation System by Apache Mahout to provide
of building
The Solution recommendations. Given rating recommendation
data along with a set of users systems because
The most important purpose of an intelligent
and items, collaborative filtering
recommendation system (IRS) is to increase
generates recommendations in
it implements all
awareness among scientists about the areas in
which they are exploring, technologies on which
one of the following four ways. the three machine
their colleagues are working, and information learning techniques
• User-based: Recommenda-
about experts and their views on that particular tions are made based on us- — collaborative
discipline. Despite the possession of huge
amounts of data, there was very little insight on
ers with similar characteris- filtering, clustering
tics.
the information scientists were seeking. There and classification.
are fundamental differences designing a recom- • Item-based: Recommenda-
mendation system compared with traditional tions are based on similar items.
software design. The overall system architec- • Slope-one: A fast technique that offers rec-
ture depends heavily on the choice of algorithms ommendations based on previous user-rated
and the system architecture employed. By using items.
Apache Mahout, the team selected a conven-
tional open source framework that implements • Model-based: This approach compares the
profile of an active user to aggregate user
machine-learning algorithms.
clusters, rather than the concrete profiles.
Apache Mahout is a new Apache Software There are many algorithms that are used to
Foundation (ASF) project whose primary goal is calculate similarities between two entities. The
to create scalable machine-learning algorithms choice of algorithm plays a vital role in deciding
that are free to use under the Apache license. the quality of the recommendation that is most
The term “Mahout” is derived from the Hindi suited for a given scenario. Since the IRS for
word that means elephant driver. The Mahout this client needed to offer recommendations to
project started in 2008 as a subproject of scientists based on their multiple preferences,
Apache’s Lucene project, which is a popular we adopted the user-based collaborative filtering
search engine. Given the amount of data that the (see Figure 1).
client possessed, it was imperative that the IRS be
Recommendation System Execution Flow
Data Model
Neighborhoods
Intelligent Similarity
User Recommendation Alogrithms
System
Recommendations
Evaluator Recommender
Program Logic Program
Figure 1
cognizant 20-20 insights 3
4. Providing Recommendations daily to create the rating data for new resources.
Creating an Input Finding Similarity Between Users/Items
Within the process of generating recommenda- After creating a data model, the next step in
tions, a crucial step is for the IRS to generate building an IRS is finding similarities and patterns
rating and usage data. Such data could be applied between users and items. There are many
to which articles are popular and which ones had algorithms which can be used to find item simi-
the most read counts. To collect and consolidate larities: GenericUserSimilarity, TanimotoCoef-
this information, the IRS we ficientSimilarity, PearsonCorrelationSimilarity,
Our IRS used used applied the Boolean SpearsonCorrelationSimilarity and EuclideanDis-
data model. This technique
a customized tracks which article/product tanceSimilarity, to name a few. By far the most
common algorithm is TanimotoCoefficientSimi-
recommender logic was visited by each particular larity as it takes all types of input data, such as
program which user. This strategy is widely binary or numbers. TanimatoCoefficient provides
used in “social bookmarking”
analyses and matches sites. In the IRS developed similarity based on the ratings of each item. In the
IRS we were building, due to the lack of rating data
the ratings of one for this client, articles were for all items, our team decided to customize the
user with similar rated between a range of 1 to similarity algorithm based on user demographic
2, depending on the number
user ratings and of times an article was visited properties. The demographic properties such as
country, division, job title, department, etc. were
recommends articles/ or read. A decimal number in used to establish similarity among users. The
items which match the range closer to 1 signifies output of this algorithm is pivotal as it is the base
fewer hits while numbers
the user interest. closer to 2 signify higher on top of which recommendations are built.
usage. This rated data is then Generating Recommendations
grouped into one of the following data models. In Once similarity is established, special recom-
this context, data model refers to the abstraction mender logic programs are used to recommend
layer which encapsulates product data and cor- items to users. This is very similar to user prefer-
responding user ratings. ences. Our IRS used a customized recommender
logic program which analyses and matches the
The following are the data models that can be
ratings of one user with similar user ratings and
used to create an IRS depending on input data.
recommends articles/items which match the user
interest. The output of this recommender logic
• In-memory data model: Data is prepared
using programs which make objects of the data program are the actual recommendations for
with product properties like product id, userid the user. In the case of multiple domains such as
and ratings. fields of science (ceramics, nanotechnology, etc.),
recommendations might not be accurate since
• File-based data: Data is provided in a comma-
user interests and expertise vary significantly.
separated file along with a preference file.
For such scenarios, recommendations should
Implementation of a data refresh module will
be evaluated with user domain or user demo-
be required to reload data into the comma-
graphics. Programs which aid in cross-verifica-
separated file.
tion of the results of recommendation engines
• Database-based data: Relational databases are known as “evaluators.” Our IRS validates
will be a good choice if the data is in the order recommended items with different relevance
of gigabytes. However, the implication is that parameters such as scientist domain, technology
a recommendation engine based on this model and scientist interest. The output of this program
will be much slower than in-memory data rep- proved beneficial in offering concrete recommen-
resentations. dations to scientists and researchers and, hence,
conserved time.
For our client, an IRS was created using a file-based
data model. This approach delivers superior per- Key Challenges
formance and execution time compared with The problems faced during development were
a database-based model. Also, to sustain the compounded by the complexity of the algorithms
freshness of the file and thereby return newer used, as well as the common problems of
articles and resources with reduced latency, a software design. Apart from these developments,
scheduled refresh module was used which runs issues unique to AI applications, such as the need
cognizant 20-20 insights 4
5. for good quality input data, also arose. Common The next challenge is providing recommenda-
algorithms such as Tanimoto and Pearson tions to a user who is not part of a clear group;
similarity have several advantages, such as the these users can be called “grey sheep.” The
ability to detect the quality and features of an challenge of giving recommendations to new or
item at the time of recommendation, especially unknown users is called the
when a user rating on a scale of 1 to 5 is available. “grey sheep problem.” This Data clustering requires
Another advantage is that these algorithms are problem mostly arises in
useful in domains where analyzing data is difficult Internet applications with
multiple machines to
or costly, such as gathering data about science large and diverse customer handle massive data
articles where domain knowledge is required to bases that access the IRS sets and stores all
rate an article. application. A solution to this
problem is to look for similar
resources that are
The first challenge in the development of our users based on similar demo- required to provide
IRS is that the system was brand new and lacked graphic information or char- recommendations.
user rating information. Huge amount of input acteristics. These users can
data, but few users and no ratings is referred be technically called “neighbors.” To determine
to as “Cold Start Problem.” A solution to this similarity, a “classification” machine-learning
problem was to use the TanimotoCoefficient11 mechanism can be used. Similar neighbors can
algorithm and binary dataset to feed the system be found by implementing different similarity
with the much-sought-after data during the algorithms or by checking similar user properties.
initial use of the system. In the course of time, The probability of this problem is significantly
user rating data will be obtained after which the lower in intranet applications. The classification
similarity algorithm and input data structure can algorithm trains the computer to classify similar
be changed. This initial dataset can be prepared users in a particular group and teaches the IRS
by collecting data such as maximum hits of a to find the specific group when a user is new to
particular product or tracking visits of articles. the system.
If a binary dataset is adopted then most suitable
algorithms are TanimotoCoefficientSimilarity or Mahout currently supports two types of classifier
LogLikelihoodSimilarity12 algorithms, since these approaches: Naive Bayes classifier and Comple-
are specially defined for binary data and use the mentary Naive Bayes classifier. Naive’s classifier
Jaccard13 coefficient formula. As the usage of the is the faster method for the problems such as
recommendation system grows, huge amounts of grey sheep recommendations. This classifier
data will be collected. It is advisable at this point includes two parts. First, it analyzes the existing
to use clustering mechanisms to sustain per- documents or data and finds the common
formance. A solution to deal with massive data features among them. This method is also called
by means of data clustering is Hadoop.14 Data as preparing training examples. Secondly, clas-
clustering requires multiple machines to handle sification is performed using the model that
massive data sets and stores all resources that was created in the first step using the Bayes
are required to provide recommendations. theorem15 to predict the category of a new item
(see Figure 2).
Diagram of Classification Technique
Input Training Training
Predictor Algorithms
Documents/Items Examples Variable e.g. Bayes
Based on Demographic Use User’s Property
Properties of User to Classify
New Predicted
Items/ Predictor Model Estimated Category/
Documents Variable Variable Decision
Similar Users are
Detect Criteria to Classi ed in Predicted Property
Classify the Users Speci c Group on which User Can
Be Classi ed
Figure 2
cognizant 20-20 insights 5
6. Finally, sometimes providing recommendations to can be an important means of bolstering sales
users only on product ratings will not be accurate; and increasing information awareness among
problems may arise when the recommended customers.
items originate from another domain to which
the user is not related. This problem is known as In this paper, we examined how recommendation
a “blind recommendation” problem. This problem systems can be applied to the scientific research
is mostly prevalent in the post-production phase domain. The successful implementation of an
of recommendation systems development. A IRS provided our client with flexibility in using
solution to this problem is to use evaluators that pre-existing information resources. It solved the
continuously verify and validate recommenda- problem of information overload by enabling
tions with actual user details. scientists and researchers to use information
resources more efficiently for their research.
Conclusion Our IRS offers testimony that a recommendation
Online businesses are working very hard to system that provides customizable recommen-
increase their customer bases by providing dations can enable online companies to perform
competitive prices, products and services. In business more effectively.
this frantic effort, recommendation systems
Footnotes
1
http://mahout.apache.org/
2
https://cwiki.apache.org/MAHOUT/mahout-on-elastic-mapreduce.html as of August 2011
3
http://www.netflixprize.com/
4
http://developer.ebay.com/DevZone/xml/docs/Reference/ebay/AddSellingManagerTemplate.html
as of August 2011
5
http://www.practicalecommerce.com/articles/1942-10-Questions-on-Product-Recommendations
as of August 2011
6
http://en.wikipedia.org/wiki/Collaborative_filtering
7
http://www2007.org/papers/paper570.pdf
8
http://www.manning.com/owen/Mahout_MEAP_CH01.pdf as of August 2011
9
http://blogoscoped.com/archive/2007-03-12-n67.html
10
http://news.cnet.com/8301-27076_3-10358662-248.html
11
http://en.wikipedia.org/wiki/Jaccard_index
12
http://en.wikipedia.org/wiki/Likelihood-ratio_test
13
http://en.wikipedia.org/wiki/Jaccard_index
14
http://hadoop.apache.org/
15
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
References:
Mahout: http://www.ibm.com/developerworks/opensource/library/j-mahout/index.html
http://ilk.uvt.nl/~toine/phd-thesis/phd-thesis.pdf
GroupLens Research Project: http://www.grouplens.org/papers/pdf/ec-99.pdf
http://en.wikipedia.org/wiki/Recommender_system
cognizant 20-20 insights 6