Jonathas Magalhães2, Rubens Pessoa, Cleyton Souza, Evandro Costa, Joseana Fechine 
The 2014 RecSys Challenge...
Próximos SlideShares
Carregando em…5

A Recommender System for Predicting User Engagement in Twitter

339 visualizações

Publicada em

The RecSys Challenge is a traditional competition among
Recommender Systems’ (RS) researchers. The 2014 edition is focused on predicting the amount of interaction achieved by tweets related to movies. In this paper, we present an approach to participate in the 2014 RecSys Challenge. Our approach consists of three steps: (i) using binary classification methods in order to split the tweets into two lists, those having user engagement equal to zero, and those having user engagement different from zero; (ii) each list is sorted through the use of regression methods; and (iii) is performed a concatenation of the two lists and a sort of tweets. To validate our approach we tested 126 configurations and verify that the settings using MovieTweetings dataset, Naïve Bayes classifier and Linear Regression, obtained the best results: nDCG@10 = 0.9037242.

Publicada em: Dados e análise
0 comentários
0 gostaram
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Sem downloads
Visualizações totais
No SlideShare
A partir de incorporações
Número de incorporações
Incorporações 0
Nenhuma incorporação

Nenhuma nota no slide

A Recommender System for Predicting User Engagement in Twitter

  1. 1. Jonathas Magalhães2, Rubens Pessoa, Cleyton Souza, Evandro Costa, Joseana Fechine INTRODUCTION The 2014 RecSys Challenge [1] consists of ordering tweets shared by users on IMDb according to the amount of interaction that they received. The interaction of a tweet is defined by the sum of the number of retweets and favorites that it received.Our objective is to present a contestant approach to the 2014 RecSys Challenge. COMPOSING AND PRE-PROCESSING THE DATASET OVERVIEW OF THE RECOMMENDER SYSTEM CLASSIFICATION STEP 1 More information at 2 Corresponding author, e-mail: RECSYS CHALLENGE 2014 FEDERAL UNIVERSITY OF CAMPINA GRANDE FEDERAL UNIVERSITY OF ALAGOAS Intelligent, Personalized and Social Technologies Group1 A RECOMMENDER SYSTEM FOR PREDICTING USER ENGAGEMENT IN TWITTER REGRESSION STEP REFERENCES [1] A. Said, S. Dooms, B. Loni, and D. Tikk. Recommender systems challenge 2014. In Proceedings of the eighth ACM conference on Recommender systems, RecSys ’14, New York, NY, USA, 2014. ACM. [2] S. Dooms, T. De Pessemier, and L. Martens. Movietweetings: a movie rating dataset collected from twitter. In Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys 2013, 2013. We use two datasets: ● The expanded MovieTweetings dataset [2] distributed by the organizers of the challenge, with the following attributes: movie id, movie rating, crawled time, tweet time, followers count, statuses count, favourites count and engagement. ● The IMDb dataset which consists of additional information about movies referenced by tweets in order to complement the MovieTweetings dataset, with the following attributes: IMDb rating, IMDb votes count, Movie year. In this work we use three different regressors: Linear Regression, Pace Regression and induction model trees algorithm M5Base that is an extension of the Quinlan’s algorithm to the regression task. Table 2: Regression models and their parameters. Besides the models presented in Table 2, we implemented three methods to combine them: Average, Median and Ranking. Our approach is divided into three steps: ● Classification; ● Regression and; ● Ordering Results. In the classification and regression steps we use the Weka API to train the models. Figure 1: Overview of the Recommender System. We use three classifiers, Naïve Bayes, Support Vector Machines (SVM) and the Nearest Neighbor algorithm Ibk. Table 1: Classification models and their parameters. We also implement a classifier that combine them using Voting. In other words, an instance will be classified in a given class if it has obtained the required majority of the models presented. Table 3 summarizes the factors and the levels used in each one. Considering the factors and levels used, we have an experimental design with 2 * 7 * 9 = 126 treatments without replication. We use the metric normalized Discounted Cumulative Gain (nDCG) to compare the methods. Table 3: Experimental factors and their levels. METHODOLOGY Table 4 presents the NDCG@10 results of the ten best configurations of our approach. Table 4: The nDCG@10 of the 10 best configurations. RESULTS