This presentation provided an overview of research being conducted at the Apex Lab at Shanghai Jiao Tong University. It began with introductions to Shanghai Jiao Tong University and the Apex Lab itself. It then discussed the lab's research areas including traditional web search, social web search, semantic web search, and machine learning. Specific projects in each area were briefly described. The presentation concluded with a demo of Hermes, a semantic search engine designed to handle billions of triples from heterogeneous data sources on the open web.
9. Apex Lab
p
Ph.D. Students
Haofen Wang
Jing Lu
Jia Chen
Guangcan Liu
Xian Wu
Yunbo Cao
g
Ruihua Song
35 Master Students
10. Agenda
g
Introduction to SJTU
Introduction to Apex Lab
p
Research
Demo
11. Research
Traditional Web
Social Web
Semantic Web
Machine Learning
12. Research
Traditional Web
Social Web
Semantic Web
Machine Learning
13. Search on Traditional Web
Focus on how to
improve search relevance? rank pages?
integrate mining technologies into search?
search finer grained objects instead of documents?
Search Applications
pp
General search engine
Vertical search engine
g
Meta search engine
15. Expert Search (
p (introduction)
)
Treat web page as bag of words
Q
Queries are not fully understood
y
16. Expert Search (
p (motivation)
)
Searching for Experts:
• A more and more important information need
• PM search for Dev
• Patient
P ti t search f D t
h for Doctor
• Student search for Professor
• ……
• Not only in Enterprise
• But also on WWW
17. Query
Ranked
List of
Experts
An Evidence: an expert and a query co-occur in a document under
certain relation constraint
18. Research
Traditional Web
Social Web
Semantic Web
Machine Learning
19. The Emergence of Web 2.0
g
Web gets social
Web 1.0 -> Web 2.0
Publishing -> Participation
Personal Websites -> Blogging
Content Management Systems -> Wikis
Britannica Online ->
> Wikipedia
Directories (taxonomy) -> Tagging ("folksonomy")
Lower the barrier for contribution.
More people are involved. They are less professional.
20. Search on Web 2.0
Focus on how to
elaborate user involved data?
search on new social media
23. Emotion Analysis on the Blog
y g
Blog can be the resource of the news, but also
be the stage for representing the emotion
Enhancing the blog search for different user
24. Blog Search
g
I f
Informative article
ti ti l
News that is similar to the news on traditional
news websites
b it
Technical descriptions, e.g. programming
techniques.
techniques
Commonsense knowledge
Objective comments on the events in the world
Affective article
Diaries about personall affairs
b ff
Self-feelings or self-emotions descriptions
26. Intent-driven blog search (WWW 2007)
Informati Snippets
ve Sense
1 1.00
1 00 The catalogue of IBM certification: DB2
Database Administrator DB2 Application
Developer MQSeries Engineer VisualAge
For Java …
2 -0.94 Crazy Me! I have hesitated between Acer
and smuggled IBM for one week. I
wouldn’t have taken into account the
price, quality or service if I had enough
money …
3 1.00 Selling IBM laptop, t22p3-900, , dvd S3/,
g p p, p , ,
independent accelerating display card.
3550 YUAN. (Post fee not
included) .Please contact 30316255. We
guarantee the quality. This product is only
sold within Tianjing city ...
4 -0.35 I got a laptop from my friend this week.
Although outdated, it is still a classical
one in IBM enthusiast’s mind. There are
many second hand IBM laptops in the
market. Al h
k Although I h
h have sold many IBM
ld
laptops …
5 -0.53 Doctor said that I should make more
preparations mentally. You have stayed
with me for three years, leaving without
y g
any words. Do you feel fair for me? Do
you remember the moments we were
together? You are heartless, I hate you! ...
27. Informative Snippets
Sense
1 1.00 The catalogue of IBM certification: DB2 Database
Administrator DB2 Application Developer MQSeries
Engineer VisualAge For Java …
2 1.00 Biao Lin is a military talent. Stalin called him “thegifted
general .
general”. Americans called him “the unbeaten general”.
the general .
Chiang Kai-shek called him “devil of war”. Biao Lin is a
special person in modern history …
3 0.99 Microsoft’s hotmail can only be registered with suffix
“@hotmail.com” by default. You can register @msn.com by
visiting…
4 0.95
0 95 Yi Sh
Shang i still sending th fil t me. I will practice it l t 1
is till di the file to ill ti later. 1.
Start up Instance (db2inst1) db2start; 2. Stop Instance
(db2inst1) db2stop …
5 0.84 Name: Lei Zhang. Student number: 5030309959. Class
number: 007. The analysis and review about the tendency of
Jilin Chemical Industry’ stock in 2005. Date, Increasing and
Decreasing ranges, Open Price, Close Price, Amount of
deals …
6 0.01 Recently I like reading the Buddhist Scripture. I can learn
philosophies in it. It makes me comfortable. It is from ...
7 -0.11 It’s out of my mind when I first saw it. The water seemed to be
exuding from the building. There was much water on the floor
of education building. Water was all around us, anywhere you
can touch had water. …
8 -0.51 I read an article about the last emperor Po-yee today. I have
watched “The Last Emperor” before, which realistically
described his life without losing artistry. His love impressed
me. As an emperor, he can’t choose the one he loved …
9 -0.53 She is 164 in height with white skin, black hair and long limp
leg. I like the girl who has long hair and likes sport and
dancing. I like sweet girls. …
10 -0.94 I have many things to do at the end of this semester. There are
five
fi final
fi l examinations,
i ti Discrete
Di t Mathematics,
M th ti
Communication Theory, Architecture of Computer, Algorithm
and Law. I know little about them. OMG! Only four weeks are
left. There are also two projects, Compiler and Operation
System. Complier can be easily completed but Operation
System …
28. Research
Traditional Web
Social Web
Semantic Web
Machine Learning
29. Our Vision of Semantic Web Search
• It covers most of the
important topics in SW
• A lot of tools are bu in
o o oo s e built
each layer
• 10+ top papers
(WWW’09, SIGMOD’09,
SIGMOD’08, VLDB’07,
ICDE’09, ISWC’07, etc)
30. Knowledge Engineering Layer
g g g y
Ontology Engineering
Orient: Integrating Ontology Engineering into Industry
Tooling Environment (ISWC 2004)
Ontology Learning & Population
O t l L i P l ti
EachWiki: Facilitating Semantics Reuse for Wikipedia
Authoring (ISWC/ASWC 2007)
u o g ( S C/ S C 00 )
PORE: Semi-supervised Positive Only Relation Extraction
from Wikipedia (ISWC/ASWC 2007)
HS E l
Explorer: Unsupervised Hierarchical Semantics Explorer
U i d Hi hi l S ti E l
for Social Annotations (ISWC/ASWC 2007)
Catriple: Extracting Triples from Wikipedia Categories
p g p p g
(ASWC 2008)
31.
32. Indexing and Search Layer
g y
Ontology Query Engine based on DBMS
SOR: A Practical System for OWL Ontology Storage,
Reasoning and Search (VLDB 2007, SIGMOD 2008)
Annotation-based Semantic Search Engine (DB + IR)
A t ti b dS ti S hE i
CE2: Towards Large Scale Annotation-based Semantic
Sea c (CIKM 2008)
Search (C 008)
An Extension to IR index for Relational Search
Semplore: An IR Approach to Scalable Hybrid Query of
Semantic Web Data (ISWC/ASWC 2007, ASWC 2008,
WWW 2009, JWS)
Pattern-based
Pattern based RDF Store
33. SOR
Semantic Object
Repository
Based on IBM DB2
Supports T-Box
T Box
reasoning
34. Semplore
p
Extension to
traditional IR engine
Ranking is
considered
35. CE^2
Search over
semantically annotated
corpus
Combination of DB and
IR search engines
36. Pattern-based RDF store
Learning to materialize join results
Efficient retrieval of pattern matches
p
Reasonable extra space -> Significant
performance increase (on some dataset)
37. Query Interface and User Interaction
Layer
Keyword Interface for Semantic Search
Q2Semantic: Lightweight Ontology based Keyword
Interpretation for Semantic Search (ESWC 2008, ICDE
2009)
Natural Language Interface for Semantic Search
PANTO: A Portable Natural Language Interface to
Ontologies (ESWC 2007)
Snippet Generation
Snippet Generation for Semantic Web Search Engines
(ASWC 2008)
Ontology Presentation
ZoomRDF: Semantic-driven Fisheye Zooming for RDF Data
(WWW 2010)
38. Q
Q2Semantic
Structured queries vs.
keyword queries
Structural data
39. RDF Snippet
pp
Representation of
search results
How will you know
which answers are
most relevant?
41. Research
Traditional Web
Social Web
Semantic Web
Machine Learning
42. Agenda
g
Introduction to SJTU
Introduction to Apex Lab
p
Research
Demo
43. How to make them as a whole?
We focused on Semantic Web
search
Closed corpus / one single data source
involved
Just scale to million triples
Uncertainty is not fully considered or used
y y
We need Semantic Web search,
however
More th 11 million data sources (Web
M than illi d t (W b
heterogeneity)
More than 2 billion triples (Scalability)
Uncertainty everywhere
Thus, we carefully consider the
following topics
Pay as you go for semantic data integration
Semantic search engine towards billion
triples Missing
User-friendly query Interface for Semantic
Web Let’s
Let s Forget
44. Hermes (2nd place Billion Triple Challenge,
SIGMOD 2009, JWS)
S S
1. Integrate and index data sources 2. Understand user’s need 3. Search and refine
Input keywords
p y Select a query
q y Refine or navigate
g
3
“Article 2
Stanford 1 Results
Rudi Studer,
Turing Semantic Web
Award” ...
Suggestions
Affiliations
...
Schema‐level Mapping Data‐level Mapping
Graph Data Processing Keyword Translation Distributed Query Processing
Data Graph Summarization Query Graph Result Combination
Keyword Mapping
Decomposition & Ranking
Element Label Extraction Top‐k Query
Query Planning
Query Planning Local Query
Local Query
Graph Search
G hS h
& Optimization Processing
Graph Element Scoring
Mapping Discovery
Mapping Discovery
Internal Indices
Indexing
Keyword Schema Structure Mapping Graph Indices
Index Index Index Index
46. Machine Learning Team in APEX
g
Focus on machine learning and its application in
Web mining and IR.
Transfer learning
Advertising Techniques in Web
Short text classification&clustering
Multiligual search result integeration
47. Outline
Introduction to heterogeneous transfer learning
Cross media: Text Imageg
Clustering
Classification
Cross language: English Chinese
Application: Visual Contextual Advertising
47
48. Outline
Introduction to heterogeneous transfer learning
Cross media: Text Imageg
Clustering
Classification
Cross language: English Chinese
Application: Visual Contextual Advertising
48
49. Traditional machine learning
g
training data and test data in a same
distribution.
Training data: T t d
T i i d t newsda
Test 49
50. Transfer learning
g
Transfer learning: distributions are not
identical.
Training data:Test data
g news
50
51. Heterogeneous Transfer Learning
g g
Learning across different feature spaces.
A fixed-wing aircraft, typically
called an airplane, aeroplane or
simply plane, is an aircraft
capable of flight using forward
motion that generates lift as
the wing moves through the
air…
An automobile, motor car or
car is a wheeled motor vehicle
used for transporting
p
passengers, which also carries
g ,
its own engine or motor...
Training data: Text Do
T i i d TT data
Test d D 51
52. Related Areas of Heterogeneous Learning
g g
Multiple Domain
Data
Feature Space
among Domains
Heterogeneous Homogeneous
Instance
Data Distribution
D t Di t ib ti
Correspondences
among Domains
Each instance in among Domains There are few or
one no
Different Same
domain has its Instance
correspondences correspondence
In other domains among domains
Multi-view Heterogeneous Transfer Learning
Traditional
Learning Transfer Learning across Different
Machine Learning
Distributions
Apple is a Banana is
Source fr-uit that the
Domain can be common
found … name for…
Target
Domain
52
53. Related Areas of Heterogeneous Learning
g g
Multiple Domain
Data
Feature Space
among Domains
Heterogeneous Homogeneous
Instance
Data Distribution
D t Di t ib ti
Correspondences
among Domains
Each instance in among Domains There are few or
one no
Different Same
domain has its Instance
correspondences correspondence
In other domains among domains
Multi-view Heterogeneous Transfer Learning
Traditional
Learning Transfer Learning across Different
Machine Learning
Distributions
Apple is a Banana is
Source fr-uit that the
Domain can be common
found … name for…
Target
Domain
53
54. Related Areas of Heterogeneous Learning
g g
Multiple Domain
Data
Feature Space
among Domains
Heterogeneous Homogeneous
Instance
Data Distribution
D t Di t ib ti
Correspondences
among Domains
Each instance in among Domains There are few or
one no
Different Same
domain has its Instance
correspondences correspondence
In other domains among domains
Multi-view Heterogeneous Transfer Learning
Traditional
Learning Transfer Learning across Different
Machine Learning
Distributions
Apple is a Banana is
Source fr-uit that the
Domain can be common
found … name for…
Target
Domain
54
55. Related Areas of Heterogeneous Learning
g g
Multiple Domain
Data
Feature Space
among Domains
Heterogeneous Homogeneous
Instance
Data Distribution
D t Di t ib ti
Correspondences
among Domains
Each instance in among Domains There are few or
one no
Different Same
domain has its Instance
correspondences correspondence
In other domains among domains
Multi-view Heterogeneous Transfer Learning
Traditional
Learning Transfer Learning across Different
Machine Learning
Distributions
Apple is a Banana is
Source fr-uit that the
Domain can be common
found … name for…
Target
Domain
55
56. Related Areas of Heterogeneous Learning
g g
Multiple Domain
Data
Feature Space
among Domains
Heterogeneous Homogeneous
Instance
Data Distribution
D t Di t ib ti
Correspondences
among Domains
Each instance in among Domains There are few or
one no
Different Same
domain has its Instance
correspondences correspondence
In other domains among domains
Multi-view Heterogeneous Transfer Learning
Traditional
Learning Transfer Learning across Different
Machine Learning
Distributions
Apple is a Banana is
Source fr-uit that the
Domain can be common
found … name for…
Target
Domain
56
57. Outline
Introduction to heterogeneous transfer learning
Cross media: Text Imageg
Classification
Clusteringg
Cross language: English Chinese
Application: Visual Contextual Advertising
57
58. Text to Images
[Dai et al. NIPS 2008] [Lin et al. APWeb 2010]
Mining and learning the multimedia data is
becoming increasing important
Limited b
Li i d by scarce labeled image data, can we
l b l di d
use abundant text data in the Web?
Our answer is YES
58
59. Objective
Ele
Learning
ph
ma In O
an pu
ssi
i ut
translati
ts t
t In
ve pu
ng O
Learning
are
ho pu t
learning
ut 59
65. Experiments: TAIC
p
Data sets: 9 binary classification data sets and 5
are six-class classification data sets
Image data from Caltech-256 and Fifteen scene
Auxiliary text data from Open Directory Project
Baseline methods
Base classifiers: Naïve Bayes (NBC) and Support
vector machine (SVM)
65
67. Outline
Introduction to heterogeneous transfer learning
Cross media: Text Imageg
Classification
Clusteringg
Cross language: English Chinese
Application: Visual Contextual Advertising
67
68. Text-aided Image Cl t rin
T t id d Im Clustering
[Yang et al. ACL 2009]
Image clustering is a effective method for
increasing accessibility of image search result
Apple =
OR do not work
But traditional clustering methods
well with small amount of data
We consider use annotated images in the social
d d h l
Web to help image clustering
68
69. Annotated PLSA Model for Clustering
Leveraging the
auxiliary text data by
using the topics
Z as From Flickr.
Flickr
a bridge
Words
W d
from Topics
Image features
Aux
Ima
I
Data 69
70. Making the transfer…
g
Log-likelihood objective function
Two parts: i
T t image f t
features and auxiliary text
d ili t t
features
Image feature to image instance correlation: A
Word feature to image feature correlation: B
trade Nor
L
Aij
log P ( f j | vi ) (1 )
B lj
log P ( f j | wl )
Aij ' B
-off mali
off
j i j' l j ' lj '
paraLik lih d of lih
zatio Lik
ti Likeliho
Likelihood f 70
71. Experiment Setup
p p
Data sets:
Generated from Caltech-256 and 15-scene corpora
Baseline methods
Baseline clustering methods: KMeans, PLSA and STC
Strategies:
clustering on target image data only
combined: clustering target image data and annotated image
data together and evaluate result for target image data
71
72. Experimental Result
p
KM_Seperate KM_Combine PLSA_Seperate PLSA_Combine STC aPLSA
2
1.8
1.6
1.4
14
1.2
Entropy
1
0.8
0.6
06
0.4
0.2
0
Heterogeneous TL No‐Heterogeneous TL
Average Entropy
Average Entropy 0.741 0.786
5 7% entroy redu
72
73. Clustering Results
g
on Caltech256 [Griffin et al. TR 2007]
frog
f k kayak j
k jesus-christh
bear
b watch
h it t
73
74. Outline
Introduction to heterogeneous transfer learning
Cross media: Text Imageg
Clustering
Classification
Cross language: English Chinese
Application: Visual Contextual Advertising
74
75. Cross-language Classification
g g
[Ling et al. WWW 2008]
Classifier
learn classify
Labeled Unlabeled
Chinese Web
Chinese Web Chinese Web
Web
pages pages
Text Classification 75
76. Cross-language Classification
g g
Much labelled data in English, but few in
g ,
Chinese.
Labeled Data English Chinese
News Reuters 21578
Reuters‐21578 ?
newsgroups 20 Newsgroups ?
Web pages Open Document Very few ODP
Project data
data
(> 1M) (< 20k, ~ 1%)
76
77. Cross-language Classification
g g
Classifier
learn classify
Labeled Unlabeled
English Web Chinese Web
pages pages
Cross‐language Classification
77
78. Cross-language Classification
g g
Information Bottleneck
X : signals to be encoded (Web pages)
l b d d( b )
X : codewords (class labels)
Y : features related to X (terms)
X
78
79. Cross-language Classification
g g
Optimization
minimize
Information betw
Information betw
Minimize
this distance
79
81. Outline
Introduction to heterogeneous transfer learning
Cross media: Text Imageg
Clustering
Classification
Cross language: English Chinese
Application: Visual Contextual Advertising
81
82. Application: Visual Contextual
Advertising
[
[Chen et al. AAAI 2010]
]
Previous research focused on advertising for text
P i hf d d ti i f t t
Web pages.
With th b
the booming of multimedia data, we need
i f lti di d t d
to recommend advertisement for these data
Difficulty: image and the text i different f t
Diffi lt i d th t t in diff t feature
spaces
Use th
U the co-occurrence d t t b id these two
data to bridge th t
feature spaces
85. Experimental Results
p
Co-occurrence data from Flickr.
Test Image from Flickr and Fifteen scene data
g
set
Advertisement are crawled from MSN search
engine with queries chosen from AOL query log.
88. Thank you
y
For more details of APEXLAB
http://apex.sjtu.edu.cn/apex_wiki/FrontPage
Our works
http://apex.sjtu.edu.cn/apex_wiki/Papers
p // p j / p _ / p