SlideShare a Scribd company logo
1 of 7
Download to read offline
PRE-RANKING DOCUMENTS
VALORIZATION IN THE INFORMATION
RETRIEVAL PROCESS
Chkiwa Mounira1, Jedidi Anis1 and Faiez Gargouri1
1

Multimedia, InfoRmation systems and Advanced Computing Laboratory
Sfax University, Tunisia
m.chkiwa@gmail.com, jedidianis@gmail.com, faiez.gargouri@isimsf.rnu.tn

ABSTRACT
In this short paper we present three methods to valorise score relevance of some documents
basing on their characteristics in order to enhance their ranking. Our framework is an
information retrieval system dedicated to children. The valorisation methods aim to increase the
relevance score of some documents by an additional value which is proportional to the number
of multimedia objects included, the number of objects linked to the user particulars and the
included topics. All of the three valorization methods use fuzzy rules to identify the valorization
value.

KEYWORDS
Information Retrieval, Pre-ranking valorization, Fuzzy Logic, Fuzzy Rules

1. INTRODUCTION
In the information retrieval process, younger users have particularities concerning what they
really need in results. In this paper we study those particularities in order to have maximum
coverage of relevant documents. To do it, we present the Pre-ranking Documents Valorization
which aims to increase some documents relevance score by an additional value in order to
enhance their ranking. In order to make the additional value be proportional to the document
characteristics we use fuzzy logic principles. The pre-ranking document valorization takes place
after running the querying process which finds the relevant documents. Our framework
materializes collaboration between two axes: the Semantic Web [1 and 2] and the Fuzzy Logic [3,
4 and 5] .We use semantic web technologies as RDF [6 and 7] to annotate semantically our
collection of web documents and SPARQL [8] for querying the annotation. Also, we use Fuzzy
rules to find the exact value added to a score relevance in order to enhance the ranking of the
correspondent document. The rest of the paper is organized as follows: in section 2 we introduce
some preliminaries about our framework which is an information retrieval system dedicated for
children. Section 3 we present the pre-ranking document valorization by explaining its three
types. Finally section 4 concludes the paper.

David C. Wyld et al. (Eds) : CST, ITCS, JSE, SIP, ARIA, DMS - 2014
pp. 353–359, 2014. © CS & IT-CSCP 2014

DOI : 10.5121/csit.2014.4132
354

Computer Science & Information Technology (CS & IT)

2. FRAMEWORK
To ensure a high quality of use, available information retrieval systems dedicated for children
have common highlighted facts:
−

−
−
−

Security: Since the web covers a large amount of uncontrollable data, security represents the
first factor taken into account to create a safe information retrieval system dedicated for
children.
Design: it represents the main visual factor to call users attention.
Profile: it represents the common way used to personalize a search process taking into
account the user’s personal interests.
Querying: this factor makes the difference between information search engines even if it
follows outwardly the same demarche: the annotation, the query/document matching, the
ranking and the result display.

In addition of considering the cited facts, our particularity resides in the “pre-ranking document
valorisation” which is localized in the figure below.
User
query Q

Querying process
Search information
interface

relevant
documents

1. Query/ document matching => score relevance RS
2. pre-ranking document valorization => added value V to RS
3. ranking process

Figure 1. Our querying particularity

In order to explore young web users searching interests, we do a survey covering 723 children
between 8 and 15 years. The results show that more than 75% of them prefer results rich by
multimedia objects (especially pictures and video sequences). Likewise, more than 70% of
surveyed children want to see in returned results information dealing with their own particulars
(own country, avenue, friends …) especially with the success of social networks use. In the next
section we present two flexible methods that give the priority to relevant results rich by
multimedia objects and/or dealing with the current user particulars. In the other side, we consider
the fact that a returned document may be exhaustive and deal with different topics; this fact could
misplace a user attention especially if it is a child. We propose therefore a method that avoid
“multi-topic” documents and make priority to documents focusing only on query main topic. The
three pre-ranking valorization methods are submitted to fuzzy rules which mainly use variables
representing the user ages; Figure 2 shows the membership function representing web young
user’s age as input set which ranges from 3 to 14
Computer Science & Information Technology (CS & IT)

355

1.2
preschool

1

main childhood

preteen

0.8
0.6
0.4
0.2
0
1

2

3

4

5

6

7

8

9

10 11 12 13 14 15

Figure 2. Fuzzy sets representing the different childhood periods of young web users.

3. PRE-RANKING DOCUMENT VALORIZATION
After running a classical query/document process, we get the score relevance of each document to
the query. The pre-ranking document valorization aims to increase score relevance of relevant
documents depending on their particularities. It’s much easier and time-saver for the system to
handle only relevant documents and not all the collection documents. For that, we choose to not
run the document valorization while the querying process but after it. We define three types of
document valorization: the “multimedia richer” document valorization, the “Nearest personal
information” document valorization and “same-topic” document valorization. All of the three
types of the pre-ranking document valorization are submitted to the application of fuzzy rules.

3.1. “Multimedia Richer” Document Valorization
Given that younger user are interested more in multimedia objects (images, video sequences …)
while the information retrieval process. The “multimedia richer” document valorization aims to
increase the relevance score of relevant document including more multimedia objects. This fact is
submitted to fuzzy rules aiming to decide the value added V to score relevance proportionally to
the user age UA and the Number of Multimedia Objects in the Document NMOD.

─ If UA is in preschool period and NMOD is high then V is high.
─ If UA is in main childhood period and NMOD is high then V is medium.
─ If UA is in preteen period and NMOD is high or medium then V is low.
The Number of Multimedia Objects in the Document NMOD is an input set represented by
triangular membership functions which ranges from Min to Max (see figure 3). Min and Max are
variables representing respectively the minimum number of multimedia objects included in a
document in the collection and the maximum number of multimedia objects included in a
document in the collection.
356

Computer Science & Information Technology (CS & IT)
1.2
Low

1

medium

high

0.8
0.6
0.4
0.2
NMOD

0
Min

(min+max)*0,25

(min+max)*0,5

(min+max)*0,75

Max

Figure 3. Fuzzy sets representing the variation range of NMOD.

3.2. “Nearest Personal Information” Document Valorization
As we mention before, more than 70% of surveyed children want to see in returned results
information dealing with their own particulars (own country, own avenue, own school, friends
…). The idea is to familiarize the information searching concept to children through the
maximum coverage of their own particular in the information searching process. Referring to this
observation, we suppose that results will be considered relevant if they include some personal
data. The nearest personal information document valorization exploits the user age UA and the
Number of Personal information Items found in a Document NPID as numerical inputs of the
fuzzification operation submitted to the fuzzy rules. A personal information item found in a
document may be a piece of text, a picture, or any multimedia object dealing with the current user
particularities. The fuzzy rules listed below make decision about the value V added to a document
relevance score in order to valorize documents dealing with user personal information:
─ If UA is in main childhood period and NPID is high then V is medium.
─ If UA in preteen period and NPID is high then V is high.
─ If UA in preteen period and NPID is medium or low then V is low.
As the NMOD variable, The NPID is an input set represented by triangular membership functions
which ranges from 0 to Max (see figure 4). Max is a variable representing the maximum number
of personal information items concerning a user found in a document of the collection and 0
means that a document didn’t include any information items concerning the current user.
1.2

Low

medium

high

1
0.8
0.6
0.4
0.2
NPID

0
0

max*0,25

max*0,5

max*0,75

max

Figure 4. Fuzzy sets representing the variation range of NPID.
Computer Science & Information Technology (CS & IT)

357

3.3. “Same-Topic” Document Valorization
A “Same-topic” document valorization aims to increase relevance score of document referring to
the topics mentioned therein. The Figure 5 gives a structural view of this type of document
valorization. The meta-document is introduced in [9] and it is able to annotate multimedia objects
as well as web documents in a way that ensures its reusability. The querying process matches the
user query with the meta-documents in order to identify the score relevance of the document to
the query. We define the “topic cloud” as groups of weighted terms concerning a given topic.
Simply, we collect potential terms representing a given topic to construct a topic cloud. The
terms’ weights express the ability of each term to represent the topic. After running a usual
querying process matching the query and the meta-documents, we get the relevance score for
each annotated resource or document. At this point, the topic clouds are used to enhance ranking
results in the benefit of relevant documents focusing mainly on query interests.
WD1T1

Document
1

MetaDocument 1

topic
1

Document
2
.
.
.
.
.
.
.
.
.
.

MetaDocument 2
.
.
.
.
.
.
.
.
.
.

Topic
2
.
.
.
.
.
.
.
.
.
.

Document
n

MetaDocument n

Topic
m

WQT1
Query Q
WQT2

Figure 5. The “Same-topic” document valorization.

To run the “Same-topic” document valorization, first, we establish the meta-document/topic
weighted links WDT. WDT expresses the potential topics mentioned by the document. To assign a
weight WDT to a meta-document/topic link, we simply sum the weights of topic terms existing in
the meta-document. Then we establish query/topics weighted links which express the ability of
each topic to represent the query. To assign a weight WQT to a Query/Topic link, we use the
classic similarity measure between two weighted terms vectors:
W୕୘ = simሺQ, Tiሻ =

෍

౪

౪

ౠసభ

୛౧ౠ ∗ ୛౪౟ౠ

ሺ୛౧ౠ ሻమ ∗ ෍

ඨ෍

ౠసభ

౪

ሺ୛౪౟ౠ ሻమ

(1)

ౠసభ

The next step of the “Same topic” document valorization is to calculate for each document his
topic similarity relative to the query in order to increase or decrease its relevance score in terms
of the value of the topic similarity. The topic similarity TS is calculated as follow:
୩

TS ሺQ, Dሻ = ෍

หW୕୘౟ − Wୈ୘౟ ห

୧ୀଵ

(2)

The main goal of a “Same topic” document valorization is to increase relevance of documents
focusing on the same query topics and valorize “mono-topic” documents to users at an early age
in order to facilitate its comprehension. The TS (Q, D) value is optimal when its value is minimal;
this means that the query and the document are focusing on the same topics with approached
values. Contrariwise, if the TS is high, this means that the document deals with other topics in
addition to the query topics. Finally, the increase value V affected to a document Relevance Score
RS is based on the following fuzzy rules:
─ If UA is in (pre-school or main childhood period) and TS is low then V is high
─ If UA is in preteen period and TS is low or medium then V is medium
358

Computer Science & Information Technology (CS & IT)

Remains to mention that this valorization method is inspired from our previous work [10], except
that we exclude the rule that decrease relevance score of documents having a high TS value. This
exclusion allows to not penalizing some document because of the diversity of topics included
therein. Also, we include the UA variable in this valorization method instead of score relevance
RS. The TS variable is an input set represented by triangular membership functions which ranges
from 0 to Max (see figure 6). Maximum TS means that the current document deals with several
topics in addition to the query topics and a null TS means that the document and the query are
dealing with the same topics.
1.2

Low

medium

1
0.8
0.6
0.4
0.2
TS

0
0

max*0,25

max*0,5

max*0,75

max

Figure 6. Fuzzy sets representing the variation range of TS.
After the application of valorization methods separately to a document Di we sum the eventually
deduced values to get the value Vi which is added to the document score relevance in order to get
the new score relevance. After applying the valorization process to the relevant documents set
found in the querying process, we pass finally to the ranking procedure. Table 1 summarizes the
document valorization process.
Table 1. Recapitulation of the pre-ranking document valorization process.
Valorization method
Multimedia Richer
Nearest Personal Information
Same-Topic
Valorization process

Inputs variables
UA
NMOD

output
v1

NPID

UA

v2

TS

UA

v3

Document Di

Vi=∑ଷ v୩
௞ୀଵ

4. CONCLUSION
In this paper, we present three methods to valorize score relevance of some documents depending
on their characteristics concerning the multimedia included objects, the user particulars and the
topics mentioned therein. In this work, we use fuzzy rules to define in a flexible way the value
added to the score relevance of valorized documents. Our framework is under development and it
represents an information retrieval system dedicated for kids. We are working in the short-term
on the identification of the relevant range and shape of the membership function representing the
added value V usable on the three valorization methods.

REFERENCES
[1]
[2]

Tim Berners-Lee, James Hendler and Ora Lassila the Semantic Web. Scientific American: Feature
Article: The Semantic Web: May 2001
W3C. Semantic Web http://www.w3.org/standards/semanticweb/
Computer Science & Information Technology (CS & IT)
[3]

359

Zadeh, L.A.: Fuzzy sets. Information and Control (1965) 338–353
http://www-bisc.cs.berkeley.edu/Zadeh-1965.pdf
[4] Lotfi A. Zadeh: Knowledge Representation in Fuzzy Logic. IEEE Trans. Knowl. Data Eng. 1(1): 89100 (1989)
[5] Lotfi A. Zadeh: A Summary and Update of "Fuzzy Logic". GrC 2010: 42-44
[6] Manola, F. Miller, E., Beckett, D. and Herman, I. 2007. RDF Primer
http://www.w3.org/2007/02/turtle/primer/
[7] RDF Working Group. Resource Description Framework (RDF) http://www.w3.org/RDF/
[8] Eric Prud'hommeaux, W3C. SPARQL Query Language for RDF W3C Recommendation 15 January
2008 http://www.w3.org/TR/rdf-sparql-query/
[9] Jedidi A. (July 2005). « modélisation générique de documents multimédia par des métadonnées :
mécanismes d’annotation et d’interrogation » Thesis of « Université TOULOUSE III Paul Sabatier »,
France.
[10] Chkiwa, M., Jedidi, A., Gargouri, F. (October 2013). Handling Uncertainty in Semantic Information
Retrieval Process. URSW 2013: 29-33, Sydney Australia

More Related Content

What's hot

Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
dannyijwest
 
Bioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalBioinformatioc: Information Retrieval
Bioinformatioc: Information Retrieval
Dr. Rupak Chakravarty
 
STATS415-Final_report
STATS415-Final_reportSTATS415-Final_report
STATS415-Final_report
Yilei Zhang
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
alaa223
 
Information Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept SimilarityInformation Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept Similarity
rahulmonikasharma
 

What's hot (19)

Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
 
Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...
 
Data Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training SimulationData Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training Simulation
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Mam assign
Mam assignMam assign
Mam assign
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
 
Bioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalBioinformatioc: Information Retrieval
Bioinformatioc: Information Retrieval
 
Lec 2
Lec 2Lec 2
Lec 2
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
STATS415-Final_report
STATS415-Final_reportSTATS415-Final_report
STATS415-Final_report
 
Hu3414421448
Hu3414421448Hu3414421448
Hu3414421448
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
SOFTWARE ENGINEERING AND SOFTWARE PROJECT MANAGEMENT
SOFTWARE ENGINEERING AND SOFTWARE PROJECT MANAGEMENTSOFTWARE ENGINEERING AND SOFTWARE PROJECT MANAGEMENT
SOFTWARE ENGINEERING AND SOFTWARE PROJECT MANAGEMENT
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Information Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept SimilarityInformation Retrieval on Text using Concept Similarity
Information Retrieval on Text using Concept Similarity
 
2. an efficient approach for web query preprocessing edit sat
2. an efficient approach for web query preprocessing edit sat2. an efficient approach for web query preprocessing edit sat
2. an efficient approach for web query preprocessing edit sat
 
Kp3518241828
Kp3518241828Kp3518241828
Kp3518241828
 

Similar to PRE-RANKING DOCUMENTS VALORIZATION IN THE INFORMATION RETRIEVAL PROCESS

Evidence Based Healthcare Design
Evidence Based Healthcare DesignEvidence Based Healthcare Design
Evidence Based Healthcare Design
Carmen Martin
 
Final review m score
Final review m scoreFinal review m score
Final review m score
azhar4010
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
Subrat Swain
 

Similar to PRE-RANKING DOCUMENTS VALORIZATION IN THE INFORMATION RETRIEVAL PROCESS (20)

Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
Evidence Based Healthcare Design
Evidence Based Healthcare DesignEvidence Based Healthcare Design
Evidence Based Healthcare Design
 
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
 
Designing a Survey Study to Measure the Diversity of Digital Learners
Designing a Survey Study to Measure the Diversity of Digital LearnersDesigning a Survey Study to Measure the Diversity of Digital Learners
Designing a Survey Study to Measure the Diversity of Digital Learners
 
Designing a Survey Study to Measure the Diversity of Digital Learners
Designing a Survey Study to Measure the Diversity of Digital Learners Designing a Survey Study to Measure the Diversity of Digital Learners
Designing a Survey Study to Measure the Diversity of Digital Learners
 
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
 
Final review m score
Final review m scoreFinal review m score
Final review m score
 
Achieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logsAchieving Privacy in Publishing Search logs
Achieving Privacy in Publishing Search logs
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata PrivacyTwo-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
Two-Phase TDS Approach for Data Anonymization To Preserving Bigdata Privacy
 
Using Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data MiningUsing Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data Mining
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Meliorating usable document density for online event detection
Meliorating usable document density for online event detectionMeliorating usable document density for online event detection
Meliorating usable document density for online event detection
 
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdfKIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
 
Big Data: Privacy and Security Aspects
Big Data: Privacy and Security AspectsBig Data: Privacy and Security Aspects
Big Data: Privacy and Security Aspects
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 
RESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEWRESEARCH IN BIG DATA – AN OVERVIEW
RESEARCH IN BIG DATA – AN OVERVIEW
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

PRE-RANKING DOCUMENTS VALORIZATION IN THE INFORMATION RETRIEVAL PROCESS

  • 1. PRE-RANKING DOCUMENTS VALORIZATION IN THE INFORMATION RETRIEVAL PROCESS Chkiwa Mounira1, Jedidi Anis1 and Faiez Gargouri1 1 Multimedia, InfoRmation systems and Advanced Computing Laboratory Sfax University, Tunisia m.chkiwa@gmail.com, jedidianis@gmail.com, faiez.gargouri@isimsf.rnu.tn ABSTRACT In this short paper we present three methods to valorise score relevance of some documents basing on their characteristics in order to enhance their ranking. Our framework is an information retrieval system dedicated to children. The valorisation methods aim to increase the relevance score of some documents by an additional value which is proportional to the number of multimedia objects included, the number of objects linked to the user particulars and the included topics. All of the three valorization methods use fuzzy rules to identify the valorization value. KEYWORDS Information Retrieval, Pre-ranking valorization, Fuzzy Logic, Fuzzy Rules 1. INTRODUCTION In the information retrieval process, younger users have particularities concerning what they really need in results. In this paper we study those particularities in order to have maximum coverage of relevant documents. To do it, we present the Pre-ranking Documents Valorization which aims to increase some documents relevance score by an additional value in order to enhance their ranking. In order to make the additional value be proportional to the document characteristics we use fuzzy logic principles. The pre-ranking document valorization takes place after running the querying process which finds the relevant documents. Our framework materializes collaboration between two axes: the Semantic Web [1 and 2] and the Fuzzy Logic [3, 4 and 5] .We use semantic web technologies as RDF [6 and 7] to annotate semantically our collection of web documents and SPARQL [8] for querying the annotation. Also, we use Fuzzy rules to find the exact value added to a score relevance in order to enhance the ranking of the correspondent document. The rest of the paper is organized as follows: in section 2 we introduce some preliminaries about our framework which is an information retrieval system dedicated for children. Section 3 we present the pre-ranking document valorization by explaining its three types. Finally section 4 concludes the paper. David C. Wyld et al. (Eds) : CST, ITCS, JSE, SIP, ARIA, DMS - 2014 pp. 353–359, 2014. © CS & IT-CSCP 2014 DOI : 10.5121/csit.2014.4132
  • 2. 354 Computer Science & Information Technology (CS & IT) 2. FRAMEWORK To ensure a high quality of use, available information retrieval systems dedicated for children have common highlighted facts: − − − − Security: Since the web covers a large amount of uncontrollable data, security represents the first factor taken into account to create a safe information retrieval system dedicated for children. Design: it represents the main visual factor to call users attention. Profile: it represents the common way used to personalize a search process taking into account the user’s personal interests. Querying: this factor makes the difference between information search engines even if it follows outwardly the same demarche: the annotation, the query/document matching, the ranking and the result display. In addition of considering the cited facts, our particularity resides in the “pre-ranking document valorisation” which is localized in the figure below. User query Q Querying process Search information interface relevant documents 1. Query/ document matching => score relevance RS 2. pre-ranking document valorization => added value V to RS 3. ranking process Figure 1. Our querying particularity In order to explore young web users searching interests, we do a survey covering 723 children between 8 and 15 years. The results show that more than 75% of them prefer results rich by multimedia objects (especially pictures and video sequences). Likewise, more than 70% of surveyed children want to see in returned results information dealing with their own particulars (own country, avenue, friends …) especially with the success of social networks use. In the next section we present two flexible methods that give the priority to relevant results rich by multimedia objects and/or dealing with the current user particulars. In the other side, we consider the fact that a returned document may be exhaustive and deal with different topics; this fact could misplace a user attention especially if it is a child. We propose therefore a method that avoid “multi-topic” documents and make priority to documents focusing only on query main topic. The three pre-ranking valorization methods are submitted to fuzzy rules which mainly use variables representing the user ages; Figure 2 shows the membership function representing web young user’s age as input set which ranges from 3 to 14
  • 3. Computer Science & Information Technology (CS & IT) 355 1.2 preschool 1 main childhood preteen 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 2. Fuzzy sets representing the different childhood periods of young web users. 3. PRE-RANKING DOCUMENT VALORIZATION After running a classical query/document process, we get the score relevance of each document to the query. The pre-ranking document valorization aims to increase score relevance of relevant documents depending on their particularities. It’s much easier and time-saver for the system to handle only relevant documents and not all the collection documents. For that, we choose to not run the document valorization while the querying process but after it. We define three types of document valorization: the “multimedia richer” document valorization, the “Nearest personal information” document valorization and “same-topic” document valorization. All of the three types of the pre-ranking document valorization are submitted to the application of fuzzy rules. 3.1. “Multimedia Richer” Document Valorization Given that younger user are interested more in multimedia objects (images, video sequences …) while the information retrieval process. The “multimedia richer” document valorization aims to increase the relevance score of relevant document including more multimedia objects. This fact is submitted to fuzzy rules aiming to decide the value added V to score relevance proportionally to the user age UA and the Number of Multimedia Objects in the Document NMOD. ─ If UA is in preschool period and NMOD is high then V is high. ─ If UA is in main childhood period and NMOD is high then V is medium. ─ If UA is in preteen period and NMOD is high or medium then V is low. The Number of Multimedia Objects in the Document NMOD is an input set represented by triangular membership functions which ranges from Min to Max (see figure 3). Min and Max are variables representing respectively the minimum number of multimedia objects included in a document in the collection and the maximum number of multimedia objects included in a document in the collection.
  • 4. 356 Computer Science & Information Technology (CS & IT) 1.2 Low 1 medium high 0.8 0.6 0.4 0.2 NMOD 0 Min (min+max)*0,25 (min+max)*0,5 (min+max)*0,75 Max Figure 3. Fuzzy sets representing the variation range of NMOD. 3.2. “Nearest Personal Information” Document Valorization As we mention before, more than 70% of surveyed children want to see in returned results information dealing with their own particulars (own country, own avenue, own school, friends …). The idea is to familiarize the information searching concept to children through the maximum coverage of their own particular in the information searching process. Referring to this observation, we suppose that results will be considered relevant if they include some personal data. The nearest personal information document valorization exploits the user age UA and the Number of Personal information Items found in a Document NPID as numerical inputs of the fuzzification operation submitted to the fuzzy rules. A personal information item found in a document may be a piece of text, a picture, or any multimedia object dealing with the current user particularities. The fuzzy rules listed below make decision about the value V added to a document relevance score in order to valorize documents dealing with user personal information: ─ If UA is in main childhood period and NPID is high then V is medium. ─ If UA in preteen period and NPID is high then V is high. ─ If UA in preteen period and NPID is medium or low then V is low. As the NMOD variable, The NPID is an input set represented by triangular membership functions which ranges from 0 to Max (see figure 4). Max is a variable representing the maximum number of personal information items concerning a user found in a document of the collection and 0 means that a document didn’t include any information items concerning the current user. 1.2 Low medium high 1 0.8 0.6 0.4 0.2 NPID 0 0 max*0,25 max*0,5 max*0,75 max Figure 4. Fuzzy sets representing the variation range of NPID.
  • 5. Computer Science & Information Technology (CS & IT) 357 3.3. “Same-Topic” Document Valorization A “Same-topic” document valorization aims to increase relevance score of document referring to the topics mentioned therein. The Figure 5 gives a structural view of this type of document valorization. The meta-document is introduced in [9] and it is able to annotate multimedia objects as well as web documents in a way that ensures its reusability. The querying process matches the user query with the meta-documents in order to identify the score relevance of the document to the query. We define the “topic cloud” as groups of weighted terms concerning a given topic. Simply, we collect potential terms representing a given topic to construct a topic cloud. The terms’ weights express the ability of each term to represent the topic. After running a usual querying process matching the query and the meta-documents, we get the relevance score for each annotated resource or document. At this point, the topic clouds are used to enhance ranking results in the benefit of relevant documents focusing mainly on query interests. WD1T1 Document 1 MetaDocument 1 topic 1 Document 2 . . . . . . . . . . MetaDocument 2 . . . . . . . . . . Topic 2 . . . . . . . . . . Document n MetaDocument n Topic m WQT1 Query Q WQT2 Figure 5. The “Same-topic” document valorization. To run the “Same-topic” document valorization, first, we establish the meta-document/topic weighted links WDT. WDT expresses the potential topics mentioned by the document. To assign a weight WDT to a meta-document/topic link, we simply sum the weights of topic terms existing in the meta-document. Then we establish query/topics weighted links which express the ability of each topic to represent the query. To assign a weight WQT to a Query/Topic link, we use the classic similarity measure between two weighted terms vectors: W୕୘ = simሺQ, Tiሻ = ෍ ౪ ౪ ౠసభ ୛౧ౠ ∗ ୛౪౟ౠ ሺ୛౧ౠ ሻమ ∗ ෍ ඨ෍ ౠసభ ౪ ሺ୛౪౟ౠ ሻమ (1) ౠసభ The next step of the “Same topic” document valorization is to calculate for each document his topic similarity relative to the query in order to increase or decrease its relevance score in terms of the value of the topic similarity. The topic similarity TS is calculated as follow: ୩ TS ሺQ, Dሻ = ෍ หW୕୘౟ − Wୈ୘౟ ห ୧ୀଵ (2) The main goal of a “Same topic” document valorization is to increase relevance of documents focusing on the same query topics and valorize “mono-topic” documents to users at an early age in order to facilitate its comprehension. The TS (Q, D) value is optimal when its value is minimal; this means that the query and the document are focusing on the same topics with approached values. Contrariwise, if the TS is high, this means that the document deals with other topics in addition to the query topics. Finally, the increase value V affected to a document Relevance Score RS is based on the following fuzzy rules: ─ If UA is in (pre-school or main childhood period) and TS is low then V is high ─ If UA is in preteen period and TS is low or medium then V is medium
  • 6. 358 Computer Science & Information Technology (CS & IT) Remains to mention that this valorization method is inspired from our previous work [10], except that we exclude the rule that decrease relevance score of documents having a high TS value. This exclusion allows to not penalizing some document because of the diversity of topics included therein. Also, we include the UA variable in this valorization method instead of score relevance RS. The TS variable is an input set represented by triangular membership functions which ranges from 0 to Max (see figure 6). Maximum TS means that the current document deals with several topics in addition to the query topics and a null TS means that the document and the query are dealing with the same topics. 1.2 Low medium 1 0.8 0.6 0.4 0.2 TS 0 0 max*0,25 max*0,5 max*0,75 max Figure 6. Fuzzy sets representing the variation range of TS. After the application of valorization methods separately to a document Di we sum the eventually deduced values to get the value Vi which is added to the document score relevance in order to get the new score relevance. After applying the valorization process to the relevant documents set found in the querying process, we pass finally to the ranking procedure. Table 1 summarizes the document valorization process. Table 1. Recapitulation of the pre-ranking document valorization process. Valorization method Multimedia Richer Nearest Personal Information Same-Topic Valorization process Inputs variables UA NMOD output v1 NPID UA v2 TS UA v3 Document Di Vi=∑ଷ v୩ ௞ୀଵ 4. CONCLUSION In this paper, we present three methods to valorize score relevance of some documents depending on their characteristics concerning the multimedia included objects, the user particulars and the topics mentioned therein. In this work, we use fuzzy rules to define in a flexible way the value added to the score relevance of valorized documents. Our framework is under development and it represents an information retrieval system dedicated for kids. We are working in the short-term on the identification of the relevant range and shape of the membership function representing the added value V usable on the three valorization methods. REFERENCES [1] [2] Tim Berners-Lee, James Hendler and Ora Lassila the Semantic Web. Scientific American: Feature Article: The Semantic Web: May 2001 W3C. Semantic Web http://www.w3.org/standards/semanticweb/
  • 7. Computer Science & Information Technology (CS & IT) [3] 359 Zadeh, L.A.: Fuzzy sets. Information and Control (1965) 338–353 http://www-bisc.cs.berkeley.edu/Zadeh-1965.pdf [4] Lotfi A. Zadeh: Knowledge Representation in Fuzzy Logic. IEEE Trans. Knowl. Data Eng. 1(1): 89100 (1989) [5] Lotfi A. Zadeh: A Summary and Update of "Fuzzy Logic". GrC 2010: 42-44 [6] Manola, F. Miller, E., Beckett, D. and Herman, I. 2007. RDF Primer http://www.w3.org/2007/02/turtle/primer/ [7] RDF Working Group. Resource Description Framework (RDF) http://www.w3.org/RDF/ [8] Eric Prud'hommeaux, W3C. SPARQL Query Language for RDF W3C Recommendation 15 January 2008 http://www.w3.org/TR/rdf-sparql-query/ [9] Jedidi A. (July 2005). « modélisation générique de documents multimédia par des métadonnées : mécanismes d’annotation et d’interrogation » Thesis of « Université TOULOUSE III Paul Sabatier », France. [10] Chkiwa, M., Jedidi, A., Gargouri, F. (October 2013). Handling Uncertainty in Semantic Information Retrieval Process. URSW 2013: 29-33, Sydney Australia