1. Topik: Top-k Data Search in Geo-tagged
Social Media
Saurabh Deochake
Rutgers University
Piscataway, NJ, USA
Email: saurabh.deochake@rutgers.edu
Saurabh Singh
Rutgers University
Piscataway, NJ, USA
Email: s.singh@rutgers.edu
Shreepad Patil
Rutgers University
Piscataway, NJ, USA
Email: shreepad.patil@rutgers.edu
Abstract— Increasing popularity of social media platforms
and location based services has made it possible to query the
data generated by social media interactions in novel ways
other than traditional search. With advent of Data Science
techniques, data scientists can delve deep into the data
generated by social media and inspect it to find different ways
the users interact on social media and in turn, Internet. In
this project, we will make use of data generated by social
media interaction for example, tweets, and/or geo-tagged
instragram photos, and/or facebook posts on a particular
topic from a specific region and bring about a top-k Social
media data(User/Tweets/Posts). Given a location l, in the form
of (Latitude,Longitude), distance d, and a set of keywords K,
the algorithm used in our module will fetch top-k users who
have posted media relevant to the desired keyword k *subset*
K in a proximity to location l(Lat,Long) within distance d.
We congregate, analyze and represent the data to end user.
Finally, we will conduct an experimental study using real
social media data sets such as Twitter to evaluate our module
in development.
I. Project Description
Social media is having more and more effect on
our daily lives. Every second around 6000 tweets are
recorded on twitter. In general, people use social media
to keep in touch with their friends and stay updated on
the social events.
Nowadays, with the widespread use of GPS-enabled
mobile devices, we also get data regarding locations
of the users. Geo-tagged data plays an important role
in analyzing the social media data and scenarios like
friend recommendation, spatial decision, etc. For ex-
ample, Facebook activates the Safety Check feature
during a disaster, which helps marking yourself as safe
during a disaster. Also many location based services
are becoming popular these days.
Traditional search, looks for the keywords in a doc-
ument, twitter also uses a search and fetches the
trending topics, but it lacks the geo-spatial component.
With geo-spatial search it is even possible to provide
novel search services.
Our approach is to recommend Twitter users by
taking into account users’ tweet contents, locations,
and tweets’ popularity that indicates the users’ social
influence. In particular, we study the search of top-
k local users in tweets that are associated with lati-
tude/longitude information, namely geo-tagged tweets.
To that end, we devise a scalable architecture for
hybrid index construction in order to handle high-
volume geo-tagged tweets, and we propose functions
to score tweets as well as methods to rank users that
integrate social relationships, keyword relevance and
spatial proximity for processing such social search.
Even though this is a vast topic of research and
development, with a group of three members, this
project can be built in around two months. Some of
major stumbling blocks that may occur as the project
advances in development and testing is a) how should
we store huge geo-spatial data generated by analyz-
ing and processing the social media data. b) repre-
senting the processed data in graphical form well
enough understood by a computer science novice. As
assumed above, the project shall be completed within
two months (before 1 May 2016). We plan to reach
requirement gathering and project design phase by end
of March 2016 and we will start with infrastructure
development and backend development in first week
of April 2016 until 25 April 2016. The testing phase
will run from 25 April 2016 to 1 May 2016 carrying
out all the test cases designed.
The project has four stages: Gathering, Design, In-
frastructure Implementation, and Testing.
A. Stage1 - The Requirement Gathering Stage.
This section describes in details the work done in the
requirement gathering phase. We start by describing
the proposed system and the types of users who will
work on the system. We then move on to the real-world
scenarios and the planned project phases.
• System Description:
This project will fetch top-k social media users
from vast data pool captured via social media
platforms like Twitter. This will be a Flask-based
web project which has a UI developed in Bootstrap
2. and Javascript. The backend will be implemented
in PostgreSQL and project’s pivotal functionality
will be implemented in Python.
• Types of Users:
There are two types of users, which are described
in detail as follows:
1) End User: End user is the one who will input a
keyword into the system and the desired location
in the form of (lat,long). This user will have limited
acess to the system which includes the input given
by it to the system and output generated by the
query provided.
2) Super User: The Super User mode will be suit-
able for any user who wants to control the be-
haviour of the system. This shall include inspecting
the log files of the whole system in conjunction
with important data logs such as PostgreSQL’s
WAL (Write Ahead Logging) XLOG records. The
super user mode will give the user power to further
inspect the data values generated by the system
and perform data science, data mining and pattern
recognition operations.
• User Interaction Mode:
The end user will be able to interact with the sys-
tem via a web-based portal designed using above
mentioned technologies in System Description sec-
tion.
• Real world Scenarions:
The real world scenarios are described below
aggregated by user types:
Type: End User
Scenario: 1
Scenario Description: Top-k people/social media
posts in a locality about a query
System Data Input: A query- "People Talking
about Donald Trump" and lattitude, longitude of a
desired area
Input Data Type: String and a Tuple
System Data Output: List of users and
tweets/social media posts about the query in
the area
Output Data Type: List and JSON Strings
Scenario: 2
Scenario Description: Query-targeted
advertisements in a locality
System Data Input: A query- "Burrito" and latitude,
longitude of a desired area
Input Data Type: String and a Tuple System Data
Output: List of users and tweets/social media
posts about the query in the area
Output Data Type: List and JSON Strings
Type: Super User
Scenario: 1
Scenario Description: Inspecting Log file of
PostgreSQL
System Data Input: Click on the "button" input
type on the form
Input Data Type: HTML Code
System Data Output: Opening the Log file on the
console
Output Data Type: HTML Code
Scenario: 2
Scenario Description: Managing the session and
API keys
System Data Input: Click on the "button" input
type on the form
Input Data Type: HTML Code
System Data Output: List session and API keys
used by the system
Output Data Type: HTML Code
• Project Timeline:
Stage 1 viz., requirement gathering shall be
finished by around 21 March 2016 followed by
project design which shall be completed by 1 April
2016. We will start development of infrastructure
and back-end on 1 April 2016 and shall be over by
25 April 2016. From 25 April 2016 to 1 May 2016,
we will thoroughly test the system against the test
cases developed by one of our team members.
The planned division of labor is as below:
Saurabh Deochake: Requirement Gathering,
Project Design, Infrastructure Implementation
including front-end and back-end, Documentation.
Shreepad Patil: Requirement Gathering, Project
Design, Infrastructure Implementation including
front-end and back-end, Documentation.
Saurabh Singh: Requirement Gathering, Project
Design, Front-end Infrastructure development,
Testing.
B. Stage2 - The Design Stage.
This section explains the architecture of the project
and talks about the algorithm and data structure to be
used in the implementation of the project. We also talk
about the constraints in the projects
3. • Project Description: We describe the main com-
ponents of the project here. The architecture is
also illustrated by the Figure 1.
Twitter API: We use twitter streaming API to ac-
cess the tweets from twitter. We use a filter on the
stream to filter data relevant to our project. The
pseudo code for the Filter is shown in Algorithm
1.
Parser: Once we acquire the filtered stream, we
parse the stream to get the posts from it. We use
stop words and stem the words in the parser. The
post is stored in the format p(uid, t, l, W), where
uid is User ID, t is timestamp, l is location, and W
is set of words in the tweet. We store the acquired
posts into a database. The pseudo code for this
component is shown in Algorithm 2.
Graph Generator: The Graph Generator then
converts these posts to a graph. The graph is
stored as G(U, E), where U is the set of users,
E is an set of edges, (u, v) ∈ E if u has ’replied
to’ or ’forwarded’ v’s post. This component would
append vertices and edges to an existing graph
structure. The graph will stored in adjacency list.
The pseudo code for this component is shown in
Algorithm 3.
Analyzer: This component is the core component
of the architecture. The analyzer receives the
query from the query processor. Internally, the
analyzer finds all the posts that were posted in the
query radius and assign a distance score to these
posts.
Another score is also calculated for the posts, we
call it the keyword relevance score. This score is
the function of tweet popularity and the number of
matching words in the post and query. The tweet
popularity is based on the number of replies and
retweets of the post. The pseudo code for this
component is shown in Algorithm 4.
Query Processor: We plan to use a top-k query
on an aggregate score. The aggregate score is a
weighted combination of distance score and key-
word relevance score. As a response to the query
the analyzer will respond with the top-k Users
relevant to the query. The algorithm is briefly
described in Algorithm 5.
User Interface: The user interface will be taking
user input in the form of q(W, l, r), where W is set
of words, l is location and r is the distance. Once
the user submits the query, it passes the query to
the query processor and displays the output to the
user.
The time complexity of the algorithms is linearith-
mic in the number of posts. However, the number
Figure 1. Proposed architecture for Topik
of posts is expected to be very high. We use a
notation O(nlogn) to denote the time complexity
of the algorithms, where n is number of posts.
Algorithm 1: GetStream
Data: Filter(F) A broad enough filter to acquire
social media feed from API to generate and
store data.
Result: Subset of posts from the feed.
begin
Q ←− empty data structure to get posts to pass
to parser
p(uid, t, l, W) ←− getCurrentTweet()
if p satisfies F then
return p(uid, t, l, W)
else
return ∅
end
end
Algorithm 2: ParsePost
Data: tweet t from getStream() after applying a
filter.
Result: Refined Tweet p (uid, t, l, W ).
begin
for ∀i ∈ tweet words W do
if !IsRelevant(i) then
W ← W − i
end
end
return p (uid, t, l, W)
end
• Flow Diagram Major Constraints: We describe
the major constraints that are will be tackled here
– Twitter Rate limit: We are limited by the
Twitter API for fetching the streaming data
4. Algorithm 3: GraphGenerator
Data: p-refined tweet post from parser, G: If the
graph already exists, the algorithm will
append to it.
Result: An updated graph G(U, E).
begin
for ∀u ∈ U do
if (p.user replied or forwarded u’s post)
then
E ← (u, v)
end
end
return G
end
Algorithm 4: Analyzer and QueryProcessor
Data: k- number of top posts needed in the result,
radius-radial distance of the posts to be analyzed
from user’s location,
W- set of key words provided as input by user,
thresh- Threshold score above which posts is
inserted into result set.
Result: A sorted set Ak containing top k results.
begin
CalculateDistanceScore(radius)
CalculateKeywordScore(W)
L1 ← SortByDistScore()
L2 ← SortByKeywordScore()
Run Top − k()
return Top-k results
end
Rate limits are divided into 15 minute inter-
vals. Additionally, all endpoints require authen-
tication, so there is no concept of unauthenti-
cated calls and rate limits.
Clients which do not implement backoff and
attempt to reconnect as often as possible will
have their connections rate limited for a small
number of minutes. Rate limited clients will
receive HTTP 420 responses for all connection
requests.
Clients which break a connection and then
reconnect frequently (to change query param-
eters, for example) run the risk of being rate
limited.
– Size constraints: The data gathered from
the social media is humongous. But we will be
working on local machines with limited storage
capacities, so we would be storing limited data
and applying constraints on the streaming data
fetched.
Algorithm 5: Top-k Query Processing using Thresh-
old Algorithm
Data: Sorted Lists L1 and L2
Result: Top-k users related to the query
begin
(1) Do sorted access in parallel to each of the 2
sorted lists. As a new object o is seen under
sorted access in some list, do random access
(calculate the other score) to the other list to
find pi(o) in other list Li . Compute the score
F(o) = F(p1, p2) of object o. If this score is
among the k highest scores seen so far, then
remember object o and its score F(o) (ties are
broken arbitrarily, so that only k objects and
their scores are remembered at any time).
(2) For each list Li , let pi be the score of the
last object seen under sorted access. Define
the threshold value T to be F(p1, p2). As soon
as at least k objects have been seen with
scores at least equal to T, halt.
(3) Let Ak be a set containing the k seen
objects with the highest scores. The output is
the sorted set {(o, F(o))|o ∈ Ak}.
end
C. Stage3 - The Implementation Stage.
We are using a mixed framework for the implemen-
tation. The specification is as below:
• Python Tweepy for back-end
• MySQL MariaDB for back-end
• Python Flask for creating web services
• HTML/CSS, javascript, jquery and twitter boot-
strap for UI
The working is shown in the below sections:
• Data snippet: The data is taken from twitter fire-
hose streaming API. The streaming API gives the
data in JSON format which is then processed by
our parser. The data looks like the snippet below:
1 {
2 "id_str": "721061993747120128",
3 "in_reply_to_status_id_str": null,
4 "in_reply_to_user_id_str": null,
5 "in_reply_to_screen_name": null,
6 "user": {
7 "id_str": "51235119",
8 "name": "Heber R",
9 "screen_name": "EoSean",
10 "profile_image_url": "http
://pbs.twimg.com/
profile_images/6884645
5. 05676812288/
jLsNWXDs_normal.jpg"
11 },
12 "geo": null,
13 "coordinates": null,
14 "place": {
15 "bounding_box": {
16 "type": "Polygon",
17 "coordinates": [..
18 ]
19 },
20 "attributes": {}
21 }
22 }
• Sample small output: The output is given by the
back-end via a web service in a JSON format:
1 [
2 {
3 ’scoreUB’: 0.483731326757061,
4 ’LB’: 1,
5 ’user’: ’192842961’,
6 ’scoreLB’: 0.481231326757061
7 },
8 {
9 ’scoreUB’: 0.4376959174838795,
10 ’LB’: 1,
11 ’user’: ’519165974’,
12 ’scoreLB’: 0.4364459174838795
13 },
14 {
15 ’scoreUB’: 0.4271013729546355,
16 ’LB’: 1,
17 ’user’: ’212309060’,
18 ’scoreLB’: 0.4271013729546355
19 }
20 ]
• Working code: One of the main functions is shown
below:
1 def findTopK( self ,kLim = 5, alpha = 0.5) :
2 self . findTopTextResults () #Calculate text
scores
3 self . findTopDistResults () #Calculate distance
scores
4
5 LDist = [ ]
6 LText = [ ]
7
8 topKList = [ ]
9
10 db = MySQLdb. connect( host="localhost " , user="
spd" , passwd="****" , db="topik ")
11
12
13 curDist = db. cursor ()
14 curText = db. cursor ()
15
16 curDist . execute("SELECT uid ,AVG(score) FROM
temp_distScore GROUP BY uid HAVING AVG(
score) >0 ORDER BY AVG(score) DESC; " )
17 curText . execute("SELECT uid , (SUM(score) /40)
FROM temp_textScore GROUP BY uid HAVING
SUM(score) >0 ORDER BY SUM(score) DESC; " )
18
19 LDist . extend( l i s t ( curDist . fetchall () ) ) #
Sorted l i s t
20 LText . extend( l i s t (curText . fetchall () ) ) #
Sorted l i s t
21
22 k = 0
23 i = 0
24 buff = [ ]
25
26 Maxlength = len ( LDist )
27 i f ( len ( LDist )<len (LText) ) :
28 Maxlength = len (LText)
29
30
31 #Top−K Algorithm implementation
32 for i in range(Maxlength) :
33
34 i f ( i<len ( LDist ) ) :
35 UserL1 = LDist [ i ] [0]
36 ScoreL1 = LDist [ i ] [1]
37 else :
38 UserL1 = ’ ’
39 ScoreL1 = 0
40
41 i f ( i<len (LText) ) :
42 UserL2 = LText[ i ] [0]
43 ScoreL2 = LText[ i ] [1]
44 else :
45 UserL2 = ’ ’
46 ScoreL2 = 0
47 threshold = alpha*ScoreL1 + (1−alpha)
*ScoreL2
48
49 i f UserL1 == UserL2:
50 scoreLB = threshold
51 scoreUB = threshold
52 buff .append({"user" :UserL1, "
scoreLB" :scoreLB, "
scoreUB" :scoreUB, "LB" : 1
})
53 else :
54 indexUserL1 = next (( i for i ,
v in enumerate( buff ) i f v
[ "user" ] == UserL1) , None
)
55 i f (indexUserL1) :
56 buff [indexUserL1] [ "
scoreLB" ] +=
alpha*ScoreL1
57 buff [indexUserL1] [ "
scoreUB" ] = buff [
indexUserL1] [ "
scoreLB" ]
58 else :
59 buff .append({"user" :
UserL1, "scoreLB"
: alpha*ScoreL1, "
scoreUB" :
threshold , "LB" :
1})
60
61 indexUserL2 = next (( i for i ,
6. v in enumerate( buff ) i f v
[ "user" ] == UserL2) , None
)
62 i f (indexUserL2) :
63 buff [indexUserL2] [ "
scoreLB" ] += (1−
alpha) *ScoreL2
64 buff [indexUserL2] [ "
scoreUB" ] = buff [
indexUserL2] [ "
scoreLB" ]
65 else :
66 buff .append({"user" :
UserL2, "scoreLB"
: (1−alpha) *ScoreL
2, "scoreUB" :
threshold , "LB" :
2})
67
68 #Update buffers
69 for b in buff :
70 i f (b[ "LB" ]==1) :
71 b[ "scoreUB" ] = b[ "
scoreLB" ] + (1−
alpha) *ScoreL2
72 else :
73 b[ "scoreUB" ] = b[ "
scoreLB" ] + alpha
*ScoreL1
74
75 tempBuff = [ ]
76 for b in buff :
77 i f (b[ "scoreLB" ]>threshold ) :
78 topKList .append(b)
79 k += 1
80 else :
81 tempBuff .append(b)
82
83 buff = tempBuff
84
85 i f (k>=kLim) :
86 return topKList
87 return topKList
• Demo and sample findings:
– Data size: The data is collected from twitter
streaming, after 2 hours of streaming we got
a data of size 11 MB(approx.) in database.
– Interesting Findings: What makes our pro-
gram efficient in terms of space complexity
is because we store only stemmed words in
the database obviating the storage of full text
retrieved from Twitter API. On the other hand,
when we talk about time complexity, overall
our model works in O(nlogn). We run Non
Random Access(NRA) algorithm to process the
data stored inside the database to find top-k
results. The usage of NRA algorithm further
makes our model competent in which we could
reduce the data access in database from O(mn)
to O(m+n). As this project demands operating
on big data originated from Twitter API, im-
plementation of hash-like data structure i.e.,
dictionary in Python further helped us reduce
the overall time complexity. We also realized
that using techniques like locality sensitive
hashing over geo-tagged streaming data would
further enhance our model in terms of time and
space complexity.
D. Stage4 - User Interface.
The project is implemented as a client-server model.
The back-end runs on a server and the user interface
can be opened in a browser(client). To open the user
interface, the users will go to the link. Once the web
page is open, the user can input his/her query as a
series of inputs in the given fields. Words, latitude and
longitude, radius, and k are taken as inputs. Once the
user clicks on submit button, the output is displayed as
a table. This stage include the following items :
• User inputs the text queries in the designated
fields and clicks on the submit button
• If the inputs are not acceptable by the query then
appropriate error messages are displayed to the
user in the form of alerts
– Words field cannot be empty or cannot contain
any special characters. An alert is given if valid
input is not obtained
– Latitude and Longitude cannot be empty or
cannot contain any values outside the range
of valid inputs(-180 to 180 for Longitude and
-85 to 85 for Latitude). An error message will
be displayed if the input is invalid
– Radius field cannot be empty or negative. If
an invalid input is obtained, an appropriate
message is displayed
– K field cannot be empty or negative. If an
invalid input is obtained, an appropriate mes-
sage is displayed
• If every input is valid then a query is triggered and
results are displayed to the user in a table
– Handle field shows the twitter handle used by
the user returned by the query
– Name field shows the full name of the user
returned by the query
– Image field shows the profile picture of the
user returned by the query
• The query is handled in a flask based python
web server. The web-server calls the top-k query
processor and executes the query. The results are
then returned in JSON format
Deliverables for Stage4 are as follows:
• To activate the UI, make sure that the MySQL
DB and flask server are online. Then open the
frontendTopik.html file in a javaScript enabled
browser.
7. Figure 2. Initial screen for the application
Figure 3. Screen with error message for invalid Words input
• Use case: As it is the finals week at Rutgers, many
students are working on project submissions, we
ran a query to find the top 5 users tweeting with
the keyword "project" and within the distance 5
miles. We show the validation cases for running
this query.
• When the user gives improper input, error mes-
sages are shown as below. Refer Figures 3-6. If all
the inputs are valid then the output is shown as
described below. Refer Figure 7
– "keywords can not contain special characters
or can’t be empty":
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the words field
– "longitude/latitude not valid (-180/-85 to
180/85) or can’t be empty":
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the longitude
or latitude field
– "Radius not valid(Input a positive number) or
can’t be empty":
Figure 4. Screen with error message for invalid Latitude/Longitude
input
Figure 5. Screen with error message for invalid radius input
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the radius field
– "k not valid(Input a positive number) or can’t
be empty":
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the k field
Figure 6. Screen with error message for invalid k input
8. Figure 7. Screen with the full output as a table
• When the user hits the Submit button with the
correct input fields the output event is triggered.
– The output: A table with top-k users tweeting
about the words within the radius
– The result consists of three column fields in
the table with Image, Name, and the twitter
handle as the output.