SlideShare uma empresa Scribd logo
1 de 35
Lecture @ International Hellenic University
Thessaloniki, 8 May 2014
Social Media Crawling and Mining
Overview of Hands-on Workshop
Symeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou,
Yiannis Kompatsiaris
Information Technologies Institute (ITI)
Centre for Research & Technologies Hellas (CERTH)
IHU SocialSensor Seminar – May 2014 CERTH-ITI
workshop objectives
• get a glimpse into the problem of social media
monitoring and mining
• get a feeling about nature of social data
• understand the basic operations/tasks involved
• look into underlying research problems
• gain some practical experience with new database
technologies (mongo, Solr)
• be motivated to explore further problems – become
experts
#2
IHU SocialSensor Seminar – May 2014 CERTH-ITI#3
IHU SocialSensor Seminar – May 2014 CERTH-ITI
additional material
https://www.dropbox.com/s/ip0ah4u5r5mqi6i/dbs.zip
http://bit.ly/1nkyfQ7 (~230MB)
https://www.dropbox.com/s/d0d3586fxqtlx4u/ihu-material.zip
http://bit.ly/1uEfo4t (~180MB)
#4
IHU SocialSensor Seminar – May 2014 CERTH-ITI#5
problem setting
• input: streams of content from social sources
• output: statistics/insights
operations:
• data management (pre-processing, indexing)
• data access/retrieval
• mining:
– basic analytics
– trend detection
IHU SocialSensor Seminar – May 2014 CERTH-ITI
basic concepts I
• Item:
– The unit of social content: Can refer to a tweet, a Facebook post, a
Google+ post, an Instagram post, etc.
– It is typically associated with some attributes: ID, title, description, tags,
contributor (here to be referred to as StreamUser), publication date, etc.
– There are some attributes that are social network-specific: location,
retweets, likes, shares, etc.
• MediaItem:
– For Items that point to multimedia content (image/video), we make use of
a special type of object, called MediaItem to store the associated URL of
the image/video along with additional attributes (e.g. image size, video
duration, etc.).
– Depending on the social network, there can be Items with no associated
MediaItem (e.g. Facebook post), or an Item must always be associated
with a MediaItem (e.g. YouTube video).
• Webpage:
– Represents a webpage. It is linked to the Item that shared it and may
contain one or more MediaItems.
#6
IHU SocialSensor Seminar – May 2014 CERTH-ITI
basic concepts I
#7
Item
MediaItem
Webpage
StreamUser
shares/creates
links/contains
contains
links
1
N
1
N
N
N
1
IHU SocialSensor Seminar – May 2014 CERTH-ITI
basic concepts II
• crawling:
– Typically this refers to a process that “explores” the Web
by starting from a seed set of webpages and following the
included links (in a recursive way).
– In a social media context, we will use crawling in a relaxed
manner to mean “collection (in a focused way) of content
shared by social network users”.
– Collection from social media can be done in many ways. In
this workshop, we will use the paradigm of the “stream
manager”.
#8
IHU SocialSensor Seminar – May 2014 CERTH-ITI
basic concepts III
• stream manager:
– This is a process that continuously retrieves Items from a social
source based on a given configuration. The collected Items
along with the embedded MediaItems and Webpages can be
stored to the selected databases for further processing.
– For instance, in the case of Twitter the configuration may
specify a set of keywords/users and/or locations to track (as
supported by the Streaming API). The configuration options vary
depending on the social network in question.
– Once a new Item is obtained, its further handling may include
the following: a) storing to a DB, b) indexing of its text, c)
extraction of contained URLs and/or MediaItems, d) further
analysis (e.g. sentiment detection), etc.
– For this workshop we will restrict to Twitter data.
#9
IHU SocialSensor Seminar – May 2014 CERTH-ITI
basic concepts IV
• indexing:
– There are different types of indices that serve different
access requirements.
– Here, we will deal with the following types of indices:
• Full-text index: This supports free text queries and will be based
on the Solr framework.
• Numerical value index: This supports interval/threshold queries
(e.g. retrieve all tweets with more than 100 retweets) and is
applied on numerical (int/double) fields. The same index can be
used for temporal filtering if the Unix timestamp is indexed.
• Text similarity index: This supports similarity-based queries using
the whole text of an input Item, e.g. bring the most similar tweets.
Our implementation relies on Locality Sensitive Hashing (LSH).
• Visual similarity index: This supports search by example using the
visual content of an image (will not be used in this workshop).
#10
IHU SocialSensor Seminar – May 2014 CERTH-ITI
basic concepts V
• mining:
– Mining refers to the processing of multiple Items with the
goal of extracting some insights, higher-level conclusions
about the data collection, and ultimately about the real
world (e.g. understand which candidate is more popular).
– In this workshop, we will deal with two popular and
relatively straightforward mining problems:
• basic analytics: this involves the computation of basic aggregate
statistics and the extraction of most “important” objects in a given
dataset, e.g. most important contributors, hashtags, Items.
• trend detection: this involves the detection of keywords or
phrases that attract increasing interest during a specific interval.
Those may refer to news stories, events, persons (e.g. celebrities),
memes, etc. We can also often refer to those as topics.
#11
IHU SocialSensor Seminar – May 2014 CERTH-ITI
overview of architecture
#12
Stream
Manager
Mongo
DAO
input.conf.xml
crawling configuration
Solr
Handler
mongo
Solr
Mongo
DAO
Solr
Handler
basic
analytics
trend
detection
crawling indexing mining
streams.conf.xml
DB & OSN credentials
configuration
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques I: tokenization
• Split an input text into lexical units (=tokens)
– useful for indexing (happens behind the scenes)
– useful for trending keyword detection
• Most crude implementation:
String[] tokens = message.split(“ “);
• Available standard implementations in Solr, e.g.:
– Standard Tokenizer
– Letter Tokenizer
– N-Gram Tokenizer (N-Gram = sequence of N tokens)
• For Twitter, Twokenizer is popular.
#13
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques II: entity detection
• entity detection (or named entity detection) is the
marking of particular tokens that refer to things like
persons, organizations (e.g. companies), locations
– useful for giving more importance to these tokens
– useful for filtering noise (e.g. tweets that contain no
entities)
– named entities make good query terms (e.g. for retrieving
content from external sources)
• standard implementations are available
– perhaps the most popular is the Stanford NER library
– others include Balie, GATE, OpenNLP
#14
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques III: LSH (a)
• Locality Sensitive Hashing (LSH) is a set of probabilistic
methods for the hashing of high-dimensional data.
• The basic idea is that similar items according to some
metric are hashed to the same value with high
probability.
• There are available hash functions for several distance
measures: Jaccard Coefficient (MinHash), L1 and L2
distance, Cosine Similarity (random projection).
• Random projection: For an input vector u of length d,
and using K random d-dimensional vectors (with K<<d)
we create a signature of length K, e.g. for K=4: hash(u) =
{1, 0, 1, 1}
#15
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques III: LSH (b)
Applications
• Approximate Nearest-Neighbour search
– Typical values for K are 4, 6, 8  16, 64, 256 unique
signatures. Given this partitioning of the input collection,
instead of searching the whole dataset for the nearest
item, one may search only in the subset of items with the
same signature.
– To increase recall, one could use L different hash functions
and merge the results.
• Near Duplicate Detection
– Items with the same or similar signature are considered
near duplicates. For this to be more precise, we need
higher values of K, e.g. K=12  16,384 unique signatures.
#16
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques IV: querying mongo (a)
• MongoDB stores data in the form of documents,
which are JSON-like field and value pairs.
#17
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques IV: querying mongo (b)
#18
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques V: querying Solr (a)
• For Solr, each Item is a document with a set of
indexed fields, which can be queried in combination.
• Example:
String query = “(title:Crimea) OR (description:Crimea)
SolrQuery solrQuery = new SolrQuery(query);
• Number of results & sorting:
solrQuery.setRows(100);
solrQuery.addSortField(publicationTime, ORDER.desc);
#19
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques V: querying Solr (b)
Additional examples:
• Restrict to a selected time period:
String query = "(title : Crimea) OR (description : Crimea) AND publicationTime:
[minDateTime (long value) TO maxDateTime(long value)]“
• Retrieve results in the order of descending retweets:
solrQuery.addSortField("retweetsCount", ORDER.desc);
• Retrieve only those Items with >100 retweets:
solrQuery.addFilterQuery("retweetsCount: [100 TO *]");
#20
IHU SocialSensor Seminar – May 2014 CERTH-ITI
useful techniques V: querying Solr (c)
#21
IHU SocialSensor Seminar – May 2014 CERTH-ITI
topic detection: basics
• Trending topic: A news story or discussion that attracts a lot
of activity during a given time interval.
• How is it represented?
– Headline (“Bomb explosion in the Z Embassy of X.“)
– Set of keywords (“bomb”, “explosion”, “embassy”, etc.)
– N-grams (“bomb explosion”, “Z Embassy”, etc.)
– Set of characteristic Items and possibly MediaItems
• There are different categories of topic detection methods:
– document-pivot: Cluster incoming Items into groups referring to the
same topic and try to extract a topic representation by processing the
group Items.
– feature-pivot: Try to extract frequently occurring or trending
keywords or phrases that are supposed to correspond to topics.
– probabilistic-generative: Try to find a generative topic model that
underlies the topic distribution on the set of documents (e.g. LDA).
#22
IHU SocialSensor Seminar – May 2014 CERTH-ITI
topic detection: document-pivot
• Objective: Cluster together Items referring to the same topic.
• Simple technique (incremental clustering):
– Compare each incoming Item with existing topics-clusters and if
similarity is higher than a given threshold then assign it to the most
similar, otherwise create a new topic.
– Two issues: a) threshold selection, b) what do you compare with: i) a
representative Item per cluster, ii) an aggregate cluster representation
(e.g. centroid)
• Other conventional clustering techniques could be tried out:
k-means, DBSCAN, hierarchical agglomerative clustering, but
most of them are not easy to apply incrementally.
• Another challenge stems from the short length of Tweets,
which makes similarity between them high only when they
are practically near-duplicate. This leads to the well-known
problem of topic fragmentation.
#23
IHU SocialSensor Seminar – May 2014 CERTH-ITI
topic detection: feature-pivot
• Objective: Try to find terms of phrases that appear very
prominently in the dataset.
• Simple technique:
– Select N most frequent keywords or hashtags.
– Problem-1: If you consider all keywords, you’ll end up with too
many generic words or very broad topics.
– Problem-2: A single keyword is often insufficient to describe a
topic accurately.
• More advanced technique:
– Select those keywords and phrases that are used more
frequently now (in the current time slot) compared to the
previous, cf. BN-gram approach (Aiello et al., 2013).
– Need to compute n-gram frequencies per time slot (use Solr N-
Gram Tokenizer and retrieve most frequent n-grams).
#24
IHU SocialSensor Seminar – May 2014 CERTH-ITI
dataset: SNOW 2014 Data Challenge
• A set of ~1M tweets collected using a list of 5000 UK-
focused “news hounds” and the keywords “Syria”,
“terror”, “Ukraine”, and “bitcoin” for a period of 24
hours starting from Feb 25, 18:00.
• Average rate: ~720 tweets/minute
• Number of unique twitter accounts: ~556K
• Number of retweets: ~648K
• Number of replies: ~135K
• Ground truth topics:
http://figshare.com/articles/SNOW_2014_Data_Challenge/1003755
#25
IHU SocialSensor Seminar – May 2014 CERTH-ITI
overview of hands-on workshop
• introduction (30-40 mins)
– how to use the stream manager
– how to index and query content in Solr and mongoDB
– how to import an existing dataset in the system
• basic analytics (~45 mins)
– how to compute and maintain statistics for most active-
influential users, top hashtags, top tweets
– how to create activity timelines
• trend detection (~90 mins)
– existing solutions: a) document-pivot, b) keywords-based
– work on own implementation: bursty keyword detection
• future work (~10 mins)
#26
IHU SocialSensor Seminar – May 2014 CERTH-ITI
our tutors
#27
Manos Schinas (manosetro@iti.gr)
Katerina Iliakopoulou (ailiakop@iti.gr)
IHU SocialSensor Seminar – May 2014 CERTH-ITI
Questions?
#28
IHU SocialSensor Seminar – May 2014 CERTH-ITI
project idea I
• improve topic detection
– read literature (cf. references at the end)
– make use of external sources as input
• RSS feeds
• Gazetteer of named entities (e.g. Wikipedia) to detect tokens that
refer to persons, locations, organizations
– make use of machine learning
• train classifiers that can separate high-quality Items from noise
• train classifiers that can separate trustworthy sources
(StreamUsers) from less reliable ones
• how to create training set?
#29
IHU SocialSensor Seminar – May 2014 CERTH-ITI
project idea II
• integrate additional sources
– there are several wrappers available for collecting content
from multiple social networks
– try to programmatically use the wrappers to collect social
content around events of interest (e.g. get lists of RSS
feeds as input)
– present collected content in a Web UI
– cf. “meteoroid on steroids” paper (References)
#30
IHU SocialSensor Seminar – May 2014 CERTH-ITI
project idea III
• create an alerting system
– monitor a set of keywords that relate to specific types of
events, e.g. fires, explosions, etc. [try also with Greek
keywords]
– check whether collected Items indeed refer to these
events or maybe are irrelevant
• use topic models (needs training)
• use writing style rules (to check quality or writing)
– if number of Items is considerably larger than mean value,
set alarm!
• automatically send email (also include sample Items)
• create an RSS feed
• automatically tweet about it, see for instance @WikiLiveMon
#31
IHU SocialSensor Seminar – May 2014 CERTH-ITI
project idea IV
• create a geo-topic detector
– monitor geotagged Items (around a given list of bounding
boxes)
– find trending topics per location
– monitor these locations for a longer time
– find persistent topics per location
– find unique topics per location (i.e. topics that do not
appear in other locations)
– visualize the results on a web UI
– See: http://trendsmap.com/
#32
IHU SocialSensor Seminar – May 2014 CERTH-ITI
project idea V
• create a twitter account profiler
– monitor a set of selected twitter accounts
– analyze tweets from these accounts with respect to
keywords and shared URLs
– categorize tweets by these accounts (e.g. economy,
politics, sports, etc.)
– create topic profiles for each account (e.g. user X  10%
sports, 60% economy, 30% politics)
– create a user profile search engine (e.g. “give me accounts
that are discussing more about sports“)
– See: http://wefollow.com/
#33
IHU SocialSensor Seminar – May 2014 CERTH-ITI#34
Thank You!
papadop@iti.gr
Acknowledgements
Contact
https://github.com/socialsensor/
http://www.slideshare.net/sympapadopoulos/
@sympapadopoulos
Check out
IHU SocialSensor Seminar – May 2014 CERTH-ITI
references
• L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A.
Goker, I. Kompatsiaris, A. Jaimes. "Sensing trending topics in Twitter."
Transactions on Multimedia 15(6), 2013: 1268-1282.
• SNOW 2014 Data Challenge Proceedings: http://ceur-ws.org/Vol-1150/
• T. Steiner. “A meteoroid on steroids: ranking media items stemming from
multiple social networks.” In Proceedings of the 22nd international
conference on World Wide Web companion (pp. 31-34). 2013
http://www2013.org/companion/p31.pdf

Mais conteúdo relacionado

Mais procurados

Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersEdeama Onwuchekwa
 
Open Annotation Collaboration Briefing
Open Annotation Collaboration BriefingOpen Annotation Collaboration Briefing
Open Annotation Collaboration BriefingTimothy Cole
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant ReadingShalin Hai-Jew
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Annotating Digital Texts in the Brown University Library
Annotating Digital Texts in the Brown University LibraryAnnotating Digital Texts in the Brown University Library
Annotating Digital Texts in the Brown University LibraryTimothy Cole
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Tobias Kuhn
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)silambu111
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
One IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success MetricsOne IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success MetricsCharleston Conference
 
Data management (newest version)
Data management (newest version)Data management (newest version)
Data management (newest version)Graça Gabriel
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Tobias Kuhn
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 

Mais procurados (20)

Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Text mining
Text miningText mining
Text mining
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Information Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information CentersInformation Retrieval Methods in Libraries and Information Centers
Information Retrieval Methods in Libraries and Information Centers
 
Open Annotation Collaboration Briefing
Open Annotation Collaboration BriefingOpen Annotation Collaboration Briefing
Open Annotation Collaboration Briefing
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Annotating Digital Texts in the Brown University Library
Annotating Digital Texts in the Brown University LibraryAnnotating Digital Texts in the Brown University Library
Annotating Digital Texts in the Brown University Library
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
Text Indexing and Retrieval
Text Indexing and RetrievalText Indexing and Retrieval
Text Indexing and Retrieval
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
One IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success MetricsOne IOTA at a time: A Case Study of OpenURL Success Metrics
One IOTA at a time: A Case Study of OpenURL Success Metrics
 
Data management (newest version)
Data management (newest version)Data management (newest version)
Data management (newest version)
 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger Data
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 

Semelhante a Social Media Crawling & Mining Seminar

Social media crawling and mining [exercises]
Social media crawling and mining [exercises]Social media crawling and mining [exercises]
Social media crawling and mining [exercises]Katerina Iliakopoulou
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsToine Bogers
 
The Social Semantic Server: A Flexible Framework to Support Informal Learning...
The Social Semantic Server: A Flexible Framework to Support Informal Learning...The Social Semantic Server: A Flexible Framework to Support Informal Learning...
The Social Semantic Server: A Flexible Framework to Support Informal Learning...tobold
 
The Social Semantic Server - A Flexible Framework to Support Informal Learnin...
The Social Semantic Server - A Flexible Framework to Support Informal Learnin...The Social Semantic Server - A Flexible Framework to Support Informal Learnin...
The Social Semantic Server - A Flexible Framework to Support Informal Learnin...Sebastian Dennerlein
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsGESIS
 
Di d dlf_handout
Di d dlf_handoutDi d dlf_handout
Di d dlf_handoutcwilliford
 
2016 09-28 social network analysis with node-xl_emke
2016 09-28 social network analysis with node-xl_emke2016 09-28 social network analysis with node-xl_emke
2016 09-28 social network analysis with node-xl_emkeDr Martina Emke
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Learning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisation
Learning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisationLearning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisation
Learning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisationTore Hoel
 
Content Sharing: Whence and Whither?
Content Sharing: Whence and Whither?Content Sharing: Whence and Whither?
Content Sharing: Whence and Whither?Nikos Manouselis
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsAnita de Waard
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technicallisld
 
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Lauri Eloranta
 
E-challenges11 WeGov Workshop
E-challenges11 WeGov WorkshopE-challenges11 WeGov Workshop
E-challenges11 WeGov WorkshopWeGov project
 
Data-Driven Learning Strategy
Data-Driven Learning StrategyData-Driven Learning Strategy
Data-Driven Learning StrategyJessie Chuang
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositoriesPaul Walk
 
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...eMadrid network
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Enrico Motta
 

Semelhante a Social Media Crawling & Mining Seminar (20)

Social media crawling and mining [exercises]
Social media crawling and mining [exercises]Social media crawling and mining [exercises]
Social media crawling and mining [exercises]
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
 
The Social Semantic Server: A Flexible Framework to Support Informal Learning...
The Social Semantic Server: A Flexible Framework to Support Informal Learning...The Social Semantic Server: A Flexible Framework to Support Informal Learning...
The Social Semantic Server: A Flexible Framework to Support Informal Learning...
 
The Social Semantic Server - A Flexible Framework to Support Informal Learnin...
The Social Semantic Server - A Flexible Framework to Support Informal Learnin...The Social Semantic Server - A Flexible Framework to Support Informal Learnin...
The Social Semantic Server - A Flexible Framework to Support Informal Learnin...
 
Demonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations SystemsDemonstrating a Framework for KOS-based Recommendations Systems
Demonstrating a Framework for KOS-based Recommendations Systems
 
Di d dlf_handout
Di d dlf_handoutDi d dlf_handout
Di d dlf_handout
 
2016 09-28 social network analysis with node-xl_emke
2016 09-28 social network analysis with node-xl_emke2016 09-28 social network analysis with node-xl_emke
2016 09-28 social network analysis with node-xl_emke
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Learning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisation
Learning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisationLearning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisation
Learning Analytics – Opportunities for ISO/IEC JTC 1/SC36 standardisation
 
Content Sharing: Whence and Whither?
Content Sharing: Whence and Whither?Content Sharing: Whence and Whither?
Content Sharing: Whence and Whither?
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Towards collaboration at scale: Libraries, the social and the technical
Towards collaboration at scale:  Libraries, the social and the technicalTowards collaboration at scale:  Libraries, the social and the technical
Towards collaboration at scale: Libraries, the social and the technical
 
Hci
HciHci
Hci
 
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
E-challenges11 WeGov Workshop
E-challenges11 WeGov WorkshopE-challenges11 WeGov Workshop
E-challenges11 WeGov Workshop
 
Data-Driven Learning Strategy
Data-Driven Learning StrategyData-Driven Learning Strategy
Data-Driven Learning Strategy
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositories
 
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
 

Mais de Symeon Papadopoulos

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...Symeon Papadopoulos
 
Deepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionSymeon Papadopoulos
 
Knowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationSymeon Papadopoulos
 
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Symeon Papadopoulos
 
COVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingCOVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingSymeon Papadopoulos
 
Similarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSymeon Papadopoulos
 
Twitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualityTwitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualitySymeon Papadopoulos
 
Aggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentAggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentSymeon Papadopoulos
 
Verifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetVerifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetSymeon Papadopoulos
 
A Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionA Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionSymeon Papadopoulos
 
Learning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterLearning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterSymeon Papadopoulos
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersSymeon Papadopoulos
 
Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Symeon Papadopoulos
 
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Symeon Papadopoulos
 
In-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceIn-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceSymeon Papadopoulos
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Symeon Papadopoulos
 
Web and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsSymeon Papadopoulos
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsSymeon Papadopoulos
 
Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Symeon Papadopoulos
 

Mais de Symeon Papadopoulos (20)

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
 
Deepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their Detection
 
Knowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering Localization
 
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
 
COVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingCOVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact Tracing
 
Similarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia content
 
Twitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air QualityTwitter-based Sensing of City-level Air Quality
Twitter-based Sensing of City-level Air Quality
 
Aggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media ContentAggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media Content
 
Verifying Multimedia Content on the Internet
Verifying Multimedia Content on the InternetVerifying Multimedia Content on the Internet
Verifying Multimedia Content on the Internet
 
A Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering DetectionA Web-based Service for Image Tampering Detection
A Web-based Service for Image Tampering Detection
 
Learning to detect Misleading Content on Twitter
Learning to detect Misleading Content on TwitterLearning to detect Misleading Content on Twitter
Learning to detect Misleading Content on Twitter
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016Verifying Multimedia Use at MediaEval 2016
Verifying Multimedia Use at MediaEval 2016
 
Multimedia Privacy
Multimedia PrivacyMultimedia Privacy
Multimedia Privacy
 
Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...
 
In-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging PerformanceIn-depth Exploration of Geotagging Performance
In-depth Exploration of Geotagging Performance
 
Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...
 
Web and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News Professionals
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
 
Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015
 

Último

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Social Media Crawling & Mining Seminar

  • 1. Lecture @ International Hellenic University Thessaloniki, 8 May 2014 Social Media Crawling and Mining Overview of Hands-on Workshop Symeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou, Yiannis Kompatsiaris Information Technologies Institute (ITI) Centre for Research & Technologies Hellas (CERTH)
  • 2. IHU SocialSensor Seminar – May 2014 CERTH-ITI workshop objectives • get a glimpse into the problem of social media monitoring and mining • get a feeling about nature of social data • understand the basic operations/tasks involved • look into underlying research problems • gain some practical experience with new database technologies (mongo, Solr) • be motivated to explore further problems – become experts #2
  • 3. IHU SocialSensor Seminar – May 2014 CERTH-ITI#3
  • 4. IHU SocialSensor Seminar – May 2014 CERTH-ITI additional material https://www.dropbox.com/s/ip0ah4u5r5mqi6i/dbs.zip http://bit.ly/1nkyfQ7 (~230MB) https://www.dropbox.com/s/d0d3586fxqtlx4u/ihu-material.zip http://bit.ly/1uEfo4t (~180MB) #4
  • 5. IHU SocialSensor Seminar – May 2014 CERTH-ITI#5 problem setting • input: streams of content from social sources • output: statistics/insights operations: • data management (pre-processing, indexing) • data access/retrieval • mining: – basic analytics – trend detection
  • 6. IHU SocialSensor Seminar – May 2014 CERTH-ITI basic concepts I • Item: – The unit of social content: Can refer to a tweet, a Facebook post, a Google+ post, an Instagram post, etc. – It is typically associated with some attributes: ID, title, description, tags, contributor (here to be referred to as StreamUser), publication date, etc. – There are some attributes that are social network-specific: location, retweets, likes, shares, etc. • MediaItem: – For Items that point to multimedia content (image/video), we make use of a special type of object, called MediaItem to store the associated URL of the image/video along with additional attributes (e.g. image size, video duration, etc.). – Depending on the social network, there can be Items with no associated MediaItem (e.g. Facebook post), or an Item must always be associated with a MediaItem (e.g. YouTube video). • Webpage: – Represents a webpage. It is linked to the Item that shared it and may contain one or more MediaItems. #6
  • 7. IHU SocialSensor Seminar – May 2014 CERTH-ITI basic concepts I #7 Item MediaItem Webpage StreamUser shares/creates links/contains contains links 1 N 1 N N N 1
  • 8. IHU SocialSensor Seminar – May 2014 CERTH-ITI basic concepts II • crawling: – Typically this refers to a process that “explores” the Web by starting from a seed set of webpages and following the included links (in a recursive way). – In a social media context, we will use crawling in a relaxed manner to mean “collection (in a focused way) of content shared by social network users”. – Collection from social media can be done in many ways. In this workshop, we will use the paradigm of the “stream manager”. #8
  • 9. IHU SocialSensor Seminar – May 2014 CERTH-ITI basic concepts III • stream manager: – This is a process that continuously retrieves Items from a social source based on a given configuration. The collected Items along with the embedded MediaItems and Webpages can be stored to the selected databases for further processing. – For instance, in the case of Twitter the configuration may specify a set of keywords/users and/or locations to track (as supported by the Streaming API). The configuration options vary depending on the social network in question. – Once a new Item is obtained, its further handling may include the following: a) storing to a DB, b) indexing of its text, c) extraction of contained URLs and/or MediaItems, d) further analysis (e.g. sentiment detection), etc. – For this workshop we will restrict to Twitter data. #9
  • 10. IHU SocialSensor Seminar – May 2014 CERTH-ITI basic concepts IV • indexing: – There are different types of indices that serve different access requirements. – Here, we will deal with the following types of indices: • Full-text index: This supports free text queries and will be based on the Solr framework. • Numerical value index: This supports interval/threshold queries (e.g. retrieve all tweets with more than 100 retweets) and is applied on numerical (int/double) fields. The same index can be used for temporal filtering if the Unix timestamp is indexed. • Text similarity index: This supports similarity-based queries using the whole text of an input Item, e.g. bring the most similar tweets. Our implementation relies on Locality Sensitive Hashing (LSH). • Visual similarity index: This supports search by example using the visual content of an image (will not be used in this workshop). #10
  • 11. IHU SocialSensor Seminar – May 2014 CERTH-ITI basic concepts V • mining: – Mining refers to the processing of multiple Items with the goal of extracting some insights, higher-level conclusions about the data collection, and ultimately about the real world (e.g. understand which candidate is more popular). – In this workshop, we will deal with two popular and relatively straightforward mining problems: • basic analytics: this involves the computation of basic aggregate statistics and the extraction of most “important” objects in a given dataset, e.g. most important contributors, hashtags, Items. • trend detection: this involves the detection of keywords or phrases that attract increasing interest during a specific interval. Those may refer to news stories, events, persons (e.g. celebrities), memes, etc. We can also often refer to those as topics. #11
  • 12. IHU SocialSensor Seminar – May 2014 CERTH-ITI overview of architecture #12 Stream Manager Mongo DAO input.conf.xml crawling configuration Solr Handler mongo Solr Mongo DAO Solr Handler basic analytics trend detection crawling indexing mining streams.conf.xml DB & OSN credentials configuration
  • 13. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques I: tokenization • Split an input text into lexical units (=tokens) – useful for indexing (happens behind the scenes) – useful for trending keyword detection • Most crude implementation: String[] tokens = message.split(“ “); • Available standard implementations in Solr, e.g.: – Standard Tokenizer – Letter Tokenizer – N-Gram Tokenizer (N-Gram = sequence of N tokens) • For Twitter, Twokenizer is popular. #13
  • 14. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques II: entity detection • entity detection (or named entity detection) is the marking of particular tokens that refer to things like persons, organizations (e.g. companies), locations – useful for giving more importance to these tokens – useful for filtering noise (e.g. tweets that contain no entities) – named entities make good query terms (e.g. for retrieving content from external sources) • standard implementations are available – perhaps the most popular is the Stanford NER library – others include Balie, GATE, OpenNLP #14
  • 15. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques III: LSH (a) • Locality Sensitive Hashing (LSH) is a set of probabilistic methods for the hashing of high-dimensional data. • The basic idea is that similar items according to some metric are hashed to the same value with high probability. • There are available hash functions for several distance measures: Jaccard Coefficient (MinHash), L1 and L2 distance, Cosine Similarity (random projection). • Random projection: For an input vector u of length d, and using K random d-dimensional vectors (with K<<d) we create a signature of length K, e.g. for K=4: hash(u) = {1, 0, 1, 1} #15
  • 16. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques III: LSH (b) Applications • Approximate Nearest-Neighbour search – Typical values for K are 4, 6, 8  16, 64, 256 unique signatures. Given this partitioning of the input collection, instead of searching the whole dataset for the nearest item, one may search only in the subset of items with the same signature. – To increase recall, one could use L different hash functions and merge the results. • Near Duplicate Detection – Items with the same or similar signature are considered near duplicates. For this to be more precise, we need higher values of K, e.g. K=12  16,384 unique signatures. #16
  • 17. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques IV: querying mongo (a) • MongoDB stores data in the form of documents, which are JSON-like field and value pairs. #17
  • 18. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques IV: querying mongo (b) #18
  • 19. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques V: querying Solr (a) • For Solr, each Item is a document with a set of indexed fields, which can be queried in combination. • Example: String query = “(title:Crimea) OR (description:Crimea) SolrQuery solrQuery = new SolrQuery(query); • Number of results & sorting: solrQuery.setRows(100); solrQuery.addSortField(publicationTime, ORDER.desc); #19
  • 20. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques V: querying Solr (b) Additional examples: • Restrict to a selected time period: String query = "(title : Crimea) OR (description : Crimea) AND publicationTime: [minDateTime (long value) TO maxDateTime(long value)]“ • Retrieve results in the order of descending retweets: solrQuery.addSortField("retweetsCount", ORDER.desc); • Retrieve only those Items with >100 retweets: solrQuery.addFilterQuery("retweetsCount: [100 TO *]"); #20
  • 21. IHU SocialSensor Seminar – May 2014 CERTH-ITI useful techniques V: querying Solr (c) #21
  • 22. IHU SocialSensor Seminar – May 2014 CERTH-ITI topic detection: basics • Trending topic: A news story or discussion that attracts a lot of activity during a given time interval. • How is it represented? – Headline (“Bomb explosion in the Z Embassy of X.“) – Set of keywords (“bomb”, “explosion”, “embassy”, etc.) – N-grams (“bomb explosion”, “Z Embassy”, etc.) – Set of characteristic Items and possibly MediaItems • There are different categories of topic detection methods: – document-pivot: Cluster incoming Items into groups referring to the same topic and try to extract a topic representation by processing the group Items. – feature-pivot: Try to extract frequently occurring or trending keywords or phrases that are supposed to correspond to topics. – probabilistic-generative: Try to find a generative topic model that underlies the topic distribution on the set of documents (e.g. LDA). #22
  • 23. IHU SocialSensor Seminar – May 2014 CERTH-ITI topic detection: document-pivot • Objective: Cluster together Items referring to the same topic. • Simple technique (incremental clustering): – Compare each incoming Item with existing topics-clusters and if similarity is higher than a given threshold then assign it to the most similar, otherwise create a new topic. – Two issues: a) threshold selection, b) what do you compare with: i) a representative Item per cluster, ii) an aggregate cluster representation (e.g. centroid) • Other conventional clustering techniques could be tried out: k-means, DBSCAN, hierarchical agglomerative clustering, but most of them are not easy to apply incrementally. • Another challenge stems from the short length of Tweets, which makes similarity between them high only when they are practically near-duplicate. This leads to the well-known problem of topic fragmentation. #23
  • 24. IHU SocialSensor Seminar – May 2014 CERTH-ITI topic detection: feature-pivot • Objective: Try to find terms of phrases that appear very prominently in the dataset. • Simple technique: – Select N most frequent keywords or hashtags. – Problem-1: If you consider all keywords, you’ll end up with too many generic words or very broad topics. – Problem-2: A single keyword is often insufficient to describe a topic accurately. • More advanced technique: – Select those keywords and phrases that are used more frequently now (in the current time slot) compared to the previous, cf. BN-gram approach (Aiello et al., 2013). – Need to compute n-gram frequencies per time slot (use Solr N- Gram Tokenizer and retrieve most frequent n-grams). #24
  • 25. IHU SocialSensor Seminar – May 2014 CERTH-ITI dataset: SNOW 2014 Data Challenge • A set of ~1M tweets collected using a list of 5000 UK- focused “news hounds” and the keywords “Syria”, “terror”, “Ukraine”, and “bitcoin” for a period of 24 hours starting from Feb 25, 18:00. • Average rate: ~720 tweets/minute • Number of unique twitter accounts: ~556K • Number of retweets: ~648K • Number of replies: ~135K • Ground truth topics: http://figshare.com/articles/SNOW_2014_Data_Challenge/1003755 #25
  • 26. IHU SocialSensor Seminar – May 2014 CERTH-ITI overview of hands-on workshop • introduction (30-40 mins) – how to use the stream manager – how to index and query content in Solr and mongoDB – how to import an existing dataset in the system • basic analytics (~45 mins) – how to compute and maintain statistics for most active- influential users, top hashtags, top tweets – how to create activity timelines • trend detection (~90 mins) – existing solutions: a) document-pivot, b) keywords-based – work on own implementation: bursty keyword detection • future work (~10 mins) #26
  • 27. IHU SocialSensor Seminar – May 2014 CERTH-ITI our tutors #27 Manos Schinas (manosetro@iti.gr) Katerina Iliakopoulou (ailiakop@iti.gr)
  • 28. IHU SocialSensor Seminar – May 2014 CERTH-ITI Questions? #28
  • 29. IHU SocialSensor Seminar – May 2014 CERTH-ITI project idea I • improve topic detection – read literature (cf. references at the end) – make use of external sources as input • RSS feeds • Gazetteer of named entities (e.g. Wikipedia) to detect tokens that refer to persons, locations, organizations – make use of machine learning • train classifiers that can separate high-quality Items from noise • train classifiers that can separate trustworthy sources (StreamUsers) from less reliable ones • how to create training set? #29
  • 30. IHU SocialSensor Seminar – May 2014 CERTH-ITI project idea II • integrate additional sources – there are several wrappers available for collecting content from multiple social networks – try to programmatically use the wrappers to collect social content around events of interest (e.g. get lists of RSS feeds as input) – present collected content in a Web UI – cf. “meteoroid on steroids” paper (References) #30
  • 31. IHU SocialSensor Seminar – May 2014 CERTH-ITI project idea III • create an alerting system – monitor a set of keywords that relate to specific types of events, e.g. fires, explosions, etc. [try also with Greek keywords] – check whether collected Items indeed refer to these events or maybe are irrelevant • use topic models (needs training) • use writing style rules (to check quality or writing) – if number of Items is considerably larger than mean value, set alarm! • automatically send email (also include sample Items) • create an RSS feed • automatically tweet about it, see for instance @WikiLiveMon #31
  • 32. IHU SocialSensor Seminar – May 2014 CERTH-ITI project idea IV • create a geo-topic detector – monitor geotagged Items (around a given list of bounding boxes) – find trending topics per location – monitor these locations for a longer time – find persistent topics per location – find unique topics per location (i.e. topics that do not appear in other locations) – visualize the results on a web UI – See: http://trendsmap.com/ #32
  • 33. IHU SocialSensor Seminar – May 2014 CERTH-ITI project idea V • create a twitter account profiler – monitor a set of selected twitter accounts – analyze tweets from these accounts with respect to keywords and shared URLs – categorize tweets by these accounts (e.g. economy, politics, sports, etc.) – create topic profiles for each account (e.g. user X  10% sports, 60% economy, 30% politics) – create a user profile search engine (e.g. “give me accounts that are discussing more about sports“) – See: http://wefollow.com/ #33
  • 34. IHU SocialSensor Seminar – May 2014 CERTH-ITI#34 Thank You! papadop@iti.gr Acknowledgements Contact https://github.com/socialsensor/ http://www.slideshare.net/sympapadopoulos/ @sympapadopoulos Check out
  • 35. IHU SocialSensor Seminar – May 2014 CERTH-ITI references • L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Goker, I. Kompatsiaris, A. Jaimes. "Sensing trending topics in Twitter." Transactions on Multimedia 15(6), 2013: 1268-1282. • SNOW 2014 Data Challenge Proceedings: http://ceur-ws.org/Vol-1150/ • T. Steiner. “A meteoroid on steroids: ranking media items stemming from multiple social networks.” In Proceedings of the 22nd international conference on World Wide Web companion (pp. 31-34). 2013 http://www2013.org/companion/p31.pdf