Published in:
Proceedings International Joint Conference, 7th Ibero-American Conference, 15th Brazilian Symposium on AI, IBERAMIA-SBIA 2000, Open Discussion Track Proceedings, p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22 2000. ISBN: 85-87837-03-6.
Download:
http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1
EL CICLO PRÁCTICO DE UN MOTOR DE CUATRO TIEMPOS.pptx
Clustering of Similar Values, in Spanish, for the Improvement of Search Systems
1. Clustering of Similar Values, in Spanish,
for the Improvement of Search Systems
Sergio Luján-Mora & Manuel Palomar
(sergio.lujan@ua.es / @sergiolujanmora)
Department of Software and Computing Systems
University of Alicante, Spain
Published in:
Proceedings International Joint Conference, 7th IberoAmerican Conference, 15th Brazilian Symposium on AI,
IBERAMIA-SBIA 2000, Open Discussion Track Proceedings,
p. 217-226, Atibaia - Sao Paulo (Brasil), November 19-22
2000. ISBN: 85-87837-03-6.
Download:
http://gplsi.dlsi.ua.es/almacenes/ver.php?pdf=1
1
2. Clustering of Similar Values,
in Spanish,
for the Improvement of
Search Systems
Sergio Luján-Mora & Manuel Palomar
Department of Software and Computing Systems
University of Alicante, Spain
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
2
3. Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
3
4. Introduction
• Information systems Rapid and
precise access
• Databases Find information
• Inconsistency: a term represented by
different values
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
4
5. Introduction
• Term
– Universidad de Alicante
• Different values found in databases:
– Universidad Alicante
– Unibersidad de Alicante
– Universitat d’Alacant
– University of Alicante
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
5
6. Introduction
• The problem:
– Data redundancy Inconsistency
– Integration of different databases into a
common repository (e.g. data
warehouses):
• different criteria data redundancy
Dpto. de Lenguajes y Sistemas Informáticos
Inconsistency
Universidad de Alicante (España)
6
7. Introduction
• We use clustering within an automatic
method for reducing on inconsistency
1. Values that refer to a same term are
clustered
2. All values are replaced by the cluster
sample
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
7
8. Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
8
9. Taxonomy of different values
• Omission or inclusion of the written
accent:
Asociación Astronómica
Asociacion Astronomica
• Lower-case / upper-case:
Departamento de Lenguajes y Sistemas
Departamento de lenguajes y sistemas
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
9
10. Taxonomy of different values
• Abbreviations and acronyms:
Dpto. de Derecho Civil
Departamento de Derecho Civil
• Word order:
Miguel de Cervantes Saavedra
Cervantes Saavedra, Miguel de
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
10
11. Taxonomy of different values
• Different denominations:
Unidad de Registro Sismológico
Unidad de Registro Sísmico
• Punctuation marks:
Laboratorio Multimedia (mmlab)
Laboratorio Multimedia - mmlab
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
11
12. Taxonomy of different values
• Errors (misspelling, typing or printing
errors):
Gabinete de imagen
Gavinete de imagen
• Different languages:
Universidad de Alicante
University of Alicante
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
12
13. Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
13
14. The solution
1. Preparation
Main
step
2. Reading
3. Sorting
4. Clustering
5. Checking
6. Updating
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
14
15. Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
15
16. The clustering algorithm
• Similarity:
– Edit distance or Levenshtein distance (LD)
– Invariant distance from word position
(IDWP)
Universidad de Alicante
Alicante, Universidad de
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
16
17. The clustering algorithm
• Filtering:
– Length distance (LEND)
– Transposition-invariant distance (TID)
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
17
18. The clustering algorithm
Input:
C: Sorted strings in descending order by frequency (c1…cm)
Output:
G: Set of clusters (g1…gn)
STEPS
1 Select ci, the first string in C, and insert it into the new
cluster gk
2 Remove ci from C
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
18
19. The clustering algorithm
3. For each string cj in C
If LEND(ci, cj) < α LEND(ci, cj) then
If TID(ci, cj) < α TID(ci, cj) then
If LD(ci, cj) < α LD(ci, cj) then
Insert cj into cluster gk
Remove cj from C
Else If IDWP(ci, cj) < α IDWP(ci, cj) then
Insert cj into cluster gk
Dpto. de Lenguajes y Sistemas Informáticos
Remove c from C
Universidad de Alicante j (España)
19
20. Contents
• Introduction
• Taxonomy of different values
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
20
21. Results
Indexes for measuring the cluster complexity
CI: Consistency Index
FCI: File Consistency Index
∑∑ LD( x , x )
n
CI =
n
i
i =1 j =1
j
n
∑x
i =1
m
i
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
FCI =
∑ CI
i =1
i
m
21
22. Results
• File A
• File B
– Without
• FCI: 0.31
– With
• FCI: 0.12
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
– Without
• FCI: 1.72
– With
• FCI: 1.11
22
23. Results
• Evaluation measures:
– ONC: optimal number of clusters
– NC: number of clusters generated
– NCC: number of completely correct
clusters
– NIC: number of incorrect clusters
– NES: number of erroneous strings
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
23
24. Results
• Precision: NCC / ONC
• Error: NIC / ONC
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
24
25. Results
• File A
• File B
– Without
– Without
• Precision: 70.7%
• Precision: 67.4%
• Error: 7.6%
• Error: 8.7%
– With
– With
• Precision: 84.8%
• Precision: 72.8%
• Error: 0%
• Error: 6.5%
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
25
26. Contents
• Introduction
• The problem: causes
• The solution
• The clustering algorithm
• Results
• Conclusions
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
26
27. Conclusions
• Achieves good results: improves on
data quality
• Review obtained clusters
• Expansion of abbreviations
• Parameters
Dpto. de Lenguajes y Sistemas Informáticos
Universidad de Alicante (España)
27
Notas do Editor
{"27":"This method achieves good results in all experiments done, but it does not eliminate the need to review the obtained clusters.\nThe expansion of abbreviations improves on the results.\n","16":"The similarity between two strings must be evaluated. We use the EDIT DISTANCE and the INVARIANT DISTANCE FROM WORD POSITION. \nThe edit distance of two strings is a measure of similarity that is given by the minimal number of simple edit operations needed to transform an string into the other.\nThe simple editing operations considered are: the insertion of a character, the deletion of a character and the substitution of one character with another.\n","5":"For example, if we consult a database that stores information about university researches, we may easily find that there are different values for the same university:\n- Universidad Alicante (without the preposition)\n- Unibersidad de Alicante (with misspelling)\n- Universitat d’Alacant (in Catalan)\n- and even University of Alicante (in English)\nIf a database suffers inconsistency, a search using a given value will not provide all the available information about the term.\n","22":"As we can see, the clusters of file B are more complex than those of file.\nIn both files, the FCI is reduced when expanding the abbreviations.\n","11":"- Different denominations.\n- Punctuation marks: hyphens, commas, semi-colons, brackets, exclamation marks and so on.\n","17":"These distances speed up the clustering. The expensive computation of LD and IDWP can be avoided. \n","6":"The problem of the inconsistency in the values stored in databases may have two origins:\n- Different people may insert the same term with different values in a database.\n- When we try to integrate different databases, they may use different values for representing the same term.\n","23":"We have evaluated the clusters obtained by using four measures that are obtained by comparing the clusters produced with the optimal clusters (handcrafted).\n","12":"- Errors: misspelling, typing or printing errors.\n- Use of different languages.\n","1":"Perhaps we should begin.\nGood morning everyone. Thanks for coming. I am a member of the Deparment of Software and Computing Systems at the University of Alicante in Spain.\nThe work I’m going to present is Clustering of Similar Values, in Spanish, for the Improvement of Search Systems.\n","7":"We present an automatic method for reducing on the inconsistency found in existing databases, and thus, improving data quality. \nAll the values that refer to a same term are clustered by measuring their degree of similarity.\nThe clustered values can be assigned to a common value which, in principle, could be substituted for the original values.\n","24":"From the previous measures, we obtain Precision: NCC divided by ONC and Error: NIC divided by ONC.\n","13":"The method we propose can be divided into six steps.\n","2":"Perhaps we should begin.\nGood morning everyone. Thanks for coming. I am a member of the Deparment of Software and Computing Systems at the University of Alicante in Spain.\nThe work I’m going to present is Clustering of Similar Values, in Spanish, for the Improvement of Search Systems.\n","8":"After analysing several databases with information both in Spanish, we have noticed that the different values that appear for a given term are due to a combination of the following causes.\n","25":"In both files, the expansion of abbreviations produces improvements: it increases the precision and reduces the error.\nFor file A, a maximum precision of 70.7% and 84.8% is obtained without and with expansion of abbreviations.\nFor file B, a maximum precision of 67.4% and 72.8% is obtained without and with expansion of abbreviations.\n","14":"1. Preparation. It may be necessary to prepare the strings before applying the clustering algorithm.\n2. Reading. The following process is repeated for each of the strings contained in the input file: Read a string, Expand abbreviations and acronyms, Remove accents, Shift string to lower-case, Store the string.\n3. Sorting. The strings are sorted, in descending order, by frequency of appearance.\n","3":"First of all, I would like to outline the main points of my talk. \nI have structured my talk into six sections. Firstly, I will give an introduction to my research and I will make a few observations about the inconsistency problem. Then I will go on to explain the main causes of the problem. Next, I will talk about our proposed method for reducing inconsistency in databases. And then, I will present our clustering algorithm and the distance metrics it uses. Finally, I will highlight the main results of our method and the conclusions of our research.\nLet’s move on to the first part of the presentation.\n","9":"- The omission or inclusion of the written accent.\n- The use of lower-case and upper-case letters.\n","15":"The standard method of detecting exact duplicates in a table is to sort the table and then to check if neighboring tuples are identical. This approach can be extended to detect approximate duplicates.\n","4":"Existing information systems provide rapid and precise access to information stored in databases.\nOne of the main uses of databases is find information.\nIf a database has a bad design, it is very likely that it suffers inconsistency: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on.\n","21":"We have used two files for evaluating our method. They contain data from two databases with inconsistency problems.\nWe have developed a coefficient named CONSISTENCY INDEX that permits the evaluation of the complexity of a cluster: the greater the value of the coefficient, the more different the strings that form the cluster are.\nThe FILE CONSISTENCY INDEX is defined as the average of the consistency indexes of all the existing clusters in the file.\n","10":"- The use of abbreviations and acronyms.\n- Different word order.\n"}