SlideShare uma empresa Scribd logo
1 de 47
Baixar para ler offline
The CORFU technique | MTSR 2013

Towards a stepwise method for unifying and reconciling
corporate names in public contracts metadata.
The CORFU technique.
Michalis Vafopoulos (speaker)
and
{Jose María Álvarez-Rodríguez, Patricia Ordoñez De Pablos and
Jose Emilio Labra-Gayo}
MTSR 2013 | 7th Metadata and Semantics Research Conference
Track on Metadata and Semantics for Open Repositories, Research Information Systems
and Data Infrastructures

1 / 43
The CORFU technique | MTSR 2013

1

Introduction

2

Related Work

3

The CORFU technique
Use Case: the Public Spending initiative

4

Evaluation and Discussion

5

Conclusions and Future Work

6

Metadata and Information

2 / 43
The CORFU technique | MTSR 2013
Introduction

The Problem...What is the “Big Name”?
“Oracle (Corp) Aust Pty Ltd”
“Oracle Corp (Aust) Pty Ltd”
“Oracle Corp Aust Pty Ltd”
“Oracle Corp. Australia”
“Oracle Corp. Australia Pty.Ltd.”
“Oracle Corpoartion (Aust) Pty Ltd”

“Oracle”

“Oracle Corporate Aust Pty Ltd”
“Oracle Corporation”
“Oracle Risk Consultants”
“ORACLE SYSTEMS (AUSTRALIA) PTY LTD”
“Oracle University”
...
“Accenture”
“Accenture Aust Holdings”
“Accenture Aust Holdings”
“Accenture Aust Holdings Pty Ltd”
“Accenture Australia Holding P/L”

“Accenture”

“Accenture Australia Holdings P/Ltd”
“Accenture Australia Holdings Pty Lt”
“Accenture Australia Limited”
...
...

...
3 / 43
The CORFU technique | MTSR 2013
Introduction

...and if you have 400K corporate names

4 / 43
The CORFU technique | MTSR 2013
Introduction

Scope
Public Procurement
1

e-Procurement is a strategic sector (17 % of the GDP in Europe).

2

Action Plans 2004 and 2020.

3

Projects: E-Certis, Fiscalis 2013, E-Prior, PEPPOL, STORK,etc.

4

Other actions: TED, RAMON metadata server, CPV, NUTS, etc.

5

Legal Framework.

6

Boost participation with special focus on SMEs.s

...but it also requires...
1

Accomplish with Open Data principles.

2

Improve transparency of public bodies.

3

Track where public money goes.

4

...
5 / 43
The CORFU technique | MTSR 2013
Introduction

Scope
Public Procurement
1

e-Procurement is a strategic sector (17 % of the GDP in Europe).

2

Action Plans 2004 and 2020.

3

Projects: E-Certis, Fiscalis 2013, E-Prior, PEPPOL, STORK,etc.

4

Other actions: TED, RAMON metadata server, CPV, NUTS, etc.

5

Legal Framework.

6

Boost participation with special focus on SMEs.s

...but it also requires...
1

Accomplish with Open Data principles.

2

Improve transparency of public bodies.

3

Track where public money goes.

4

...
6 / 43
The CORFU technique | MTSR 2013
Introduction

The Problem...
How can we track public procurement processes?
1
2

Data and information is already out there.
Relevant metadata can be (re)used:
Normalized product scheme classifications such as the CPV 2008
(Common Procurement Vocabulary) [1].
Territorial units (NUTS).
Currency.

3

...

...and “names”?
Both Payer and Payee names are not usually normalized.

Normalized and unified names (“Big Name”)...
...with the aim of tracking both payers and payees.
7 / 43
The CORFU technique | MTSR 2013
Introduction

The Problem...
How can we track public procurement processes?
1
2

Data and information is already out there.
Relevant metadata can be (re)used:
Normalized product scheme classifications such as the CPV 2008
(Common Procurement Vocabulary) [1].
Territorial units (NUTS).
Currency.

3

...

...and “names”?
Both Payer and Payee names are not usually normalized.

Normalized and unified names (“Big Name”)...
...with the aim of tracking both payers and payees.
8 / 43
The CORFU technique | MTSR 2013
Introduction

The Problem...
How can we track public procurement processes?
1
2

Data and information is already out there.
Relevant metadata can be (re)used:
Normalized product scheme classifications such as the CPV 2008
(Common Procurement Vocabulary) [1].
Territorial units (NUTS).
Currency.

3

...

...and “names”?
Both Payer and Payee names are not usually normalized.

Normalized and unified names (“Big Name”)...
...with the aim of tracking both payers and payees.
9 / 43
The CORFU technique | MTSR 2013
Introduction

The Problem...
Some remarks...
1

It is not a mere problem of reconciling entities (dealing with)...
Misspelling errors.
Name/acronym mismatches.
...

2

3

...but to create a unique name or link (n string literals → 1
company → 1 URI).
E.g.
“Oracle” and “Oracle University” could be respectively aligned to the
entities <Oracle_Corporation> and <Oracle_University>
...but the problem of grouping by a unique (Big ) name, identifier or
resource still remains.

10 / 43
The CORFU technique | MTSR 2013
Introduction

The Public Spending initiative...

...a joint effort trying to answer...
1

Who really gets the public money?

2

For what? From whom?

3

Can we compare them?

4

Is public spending effective?

5

...

Learn more: http://publicspending.net/

11 / 43
The CORFU technique | MTSR 2013
Related Work

Natural Language Processing, Computational Linguistics
and Entity Reconciliation.
Existing works and APIs to deal with natural language issues
1

Misspelling errors [13, 8].

2

Name/acronym mismatches [17].

3

APIs such as NLTK for Python, Lingpipe, OpenNLP or Gate for Java and search engines
such as Apache Lucene/Solr.

4

Extraction of clinical terms [16] for electronic health records.

5

Creation of bibliometrics [4] or identification of gene names [7, 5].

6

Entity reconciliation processes [6, 2] using the DBPedia [10] or URI comparison [9].

7

Tools such as Google Refine, etc.

Preliminary evaluation...
Algorithms to deal with natural language heterogeneities are already available.
Existing works are usually focused in some domain (prototypes cannot be easily
customized to other domain).
...but methodologies and NLP algorithms can be re-applied to new domains.
12 / 43
The CORFU technique | MTSR 2013
Related Work

Natural Language Processing, Computational Linguistics
and Entity Reconciliation.
Existing works and APIs to deal with natural language issues
1

Misspelling errors [13, 8].

2

Name/acronym mismatches [17].

3

APIs such as NLTK for Python, Lingpipe, OpenNLP or Gate for Java and search engines
such as Apache Lucene/Solr.

4

Extraction of clinical terms [16] for electronic health records.

5

Creation of bibliometrics [4] or identification of gene names [7, 5].

6

Entity reconciliation processes [6, 2] using the DBPedia [10] or URI comparison [9].

7

Tools such as Google Refine, etc.

Preliminary evaluation...
Algorithms to deal with natural language heterogeneities are already available.
Existing works are usually focused in some domain (prototypes cannot be easily
customized to other domain).
...but methodologies and NLP algorithms can be re-applied to new domains.
13 / 43
The CORFU technique | MTSR 2013
Related Work

Corporate Information.
Corporate Databases
1

Some corporate databases: The Spanish Chambers of Commerce, “Empresia.es” or
“Axesor.es” to name a few (just in Spain).

2

The DBPedia and the Orgpedia [3].

3

The CrocTail [12] effort (part of the “Corporate Research Project”).

4

“The Open Database Of The Corporate World” [14].

5

Forbes, Google Places, Google Maps, Foursquare, Linkedin Companies or Facebook.

6

Similar initiatives: Openspending.net, LOD2 project e-Procurement, etc.

7

...

Preliminary evaluation...
Corporate information is public but access is restricted or under a fee (valuable
metadata)...
Large databases (“infobesity?”) but...
...the problem of mapping ,(n string literals → 1 company → 1 URI) as a human would
do, still remains.
14 / 43
The CORFU technique | MTSR 2013
Related Work

Corporate Information.
Corporate Databases
1

Some corporate databases: The Spanish Chambers of Commerce, “Empresia.es” or
“Axesor.es” to name a few (just in Spain).

2

The DBPedia and the Orgpedia [3].

3

The CrocTail [12] effort (part of the “Corporate Research Project”).

4

“The Open Database Of The Corporate World” [14].

5

Forbes, Google Places, Google Maps, Foursquare, Linkedin Companies or Facebook.

6

Similar initiatives: Openspending.net, LOD2 project e-Procurement, etc.

7

...

Preliminary evaluation...
Corporate information is public but access is restricted or under a fee (valuable
metadata)...
Large databases (“infobesity?”) but...
...the problem of mapping ,(n string literals → 1 company → 1 URI) as a human would
do, still remains.
15 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Company, ORganization and Firm name Unifier-CORFU (I)

16 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Company, ORganization and Firm name Unifier-CORFU (II)

17 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Example step by step...
Scenario: Australian supplier names (400K).
1st try: use of Google Refine+Open Corporates reconciliation service
(just an 8 % of unified names, see Figure below).
2nd try: design of the CORFU technique using Python NLTK and
other third-party APIs for NLP processing.

18 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 0: Load corporate names

Load
Input: a list of corporate names as raw text (one per line).
Output: a set of names represented as strings.
Example:

“Accenture Australia Holding P/L”
“Oracle (Corp) Aust Pty Ltd”

19 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 1: Normalize raw text and remove duplicates
Normalize
Input: a set of names represented as strings.
Process: this step is comprised of:
1 Remove strange characters and punctuation marks but
keeping those that are part of a word avoiding potential
changes in abbreviations or acronyms;
2 Lowercase raw text (although some semantics can be
lost previous works and empirical tests show that this is
the best approach).
3 Remove duplicates.
4 Lemmatize the corporate name.
Output: a set of normalized corporate names.
Example:

“Accenture Australia Holding PL”
“Oracle Corp Aust Pty Ltd”
20 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 2: Filter the basic set of common stopwords in English
Filter
Input: a set of normalized corporate names and a set of stopwords.
Process: load a set of stopwords (a minimal set of stopwords from the
Python NLTK API has been used):
Including common English stopwords but...
Avoiding to filter relevant words.
Output: a set of cleaned and normalized corporate names.
Example:

“Accenture Australia Holding PL”
“Oracle Corp Aust Pty Ltd”

21 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 3: Dictionary-based expansion of common acronyms
and filtering
Dictionary-based expansion
Input: a set of cleaned and normalized corporate names and a
dictionary of acronyms.
Process: load the dictionary of acronyms and expand.
Output: a set of cleaned, normalized and acronym-expanded corporate
names.
Example:

“Accenture Australia Holding Proprietary Company
Limited”
“Oracle Corporation Aust Proprietary Company Limited”

22 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 4: Filter the expanded set of most common words in
the dataset
Filter
Input: a set of cleaned, normalized and acronym-expanded corporate
names.
Process: extract statistics of “most used words” in the input dataset,
expand those words and filter.
Output: a set of cleaned, normalized and acronym-expanded corporate
names.
Example:

“Accenture Australia Holding”
“Oracle Aust”

23 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 5: Identification of contextual information and filtering

Context filtering
Input: a set of cleaned, normalized and acronym-expanded corporate
names.
Output: a set of cleaned, normalized and acronym-expanded corporate
names without contextual information.
Example:

“Accenture Holding”
“Oracle”

24 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 6: Spell checking (optional)

Spell checking
Input: a set of cleaned, normalized and acronym-expanded corporate
names without contextual information.
Output: the previous set without spelling errors.
Example:

“Accenture Holding”
“Oracle”

25 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 7: Pos-tagging parts of speech according to a grammar
and filtering the non-relevant ones

Pos-tagging
Input: a set of cleaned, normalized and acronym-expanded corporate
names without contextual information.
Output: a tree according to a pre-defined grammar for corporate
names that only contains nouns.
Example:

“Accenture”
“Oracle”

26 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 8: Cluster corporate names

Clustering
Input: a set of strings derivate from the aforementioned tree.
Output: a set of clusters for each extracted corporate name.
Example:

“Accenture”
“Oracle”

27 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Step 9: Validate and reconcile the generated corporate name
via an existing reconcile service (optional)

Validate and reconcile
Input: a set of clusters for each extracted corporate name.
Output: (n string literals → 1 company → 1 URI).
Example:

(“Accenture”, dbpedia-res:Accenture)
(“Oracle”, dbpedia-res:Oracle_Corporation)

28 / 43
The CORFU technique | MTSR 2013
The CORFU technique

Representing and querying the corporate information in
RDF...
§
:o1 a o r g : O r g anization ;
skos :prefL abel ‘‘ Oracle ’ ’;
skos:altLabel ‘‘ Oracle Corporation ’ ’ , ‘‘ Oracle ( Corp )
s ko s: c lo s eM atch dbpedia - res: Or a cl e _C o rp o ra t io n ;
...
.

¦

§

SELECT str (? label ) ( COUNT (? org ) as ? pCount ) WHERE {
? ppn : rewarded - to ? org .
? org rdf : type org : Organization .
? org skos : prefLabel ? label .
...
}
GROUP BY str (? label )
ORDER BY desc (? pCount )

¦


¤
Aust Pty Ltd ’ ’ ,

...;

¥
¤

¥

29 / 43
The CORFU technique | MTSR 2013
The CORFU technique
Use Case: the Public Spending initiative

Demo and application...

Learn more: http://publicspending.net/
30 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

The case of Australian supplier names...
Step

Name

Customization

0

Load corporate names

430188 full names and 77526 unique names
(period 2004-2012)

1

Normalize raw text and remove duplicates

Default

2

Filter the basic set of common stopwords in
English

Default

3

Filter the expanded set of most common
words in the dataset

Two stopwords sets: 355 words (manually)
and words with more than n = 50 apparitions (automatically)

4

Dictionary-based expansion
acronyms and filtering

Set of 50 acronyms variations (manually)

5

Identification of contextual information and
filtering

Use of Geonames REST service

6

Spell checking (optional)

Train dataset of 128457 words provided by
Peter Norvig’s spell-checker [13].

7

Pos-tagging parts of speech according to a
grammar and filtering the non-relevant ones

Default

8

Cluster corporate names

Default

9

Validate and reconcile the generated corporate name via an existing reconcile service
(optional)

Python client and Google Refine

of

common

31 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

Research Design
1
2

Configure and execute the CORFU technique.
Validate (manually) the dump of unified names and calculate:
Precision, see Eq. 1, is “the number of supplier names that have been
correctly unified under the same name”
Recall is, see Eq. 2, “the number of supplier names that have not been
correctly classified under a proper name” and F1 score, see Eq. 3,
where...
. . . tp is “the number of corporate names properly unified”
. . . fp is “the number of corporate names wrongly unified”
. . . tn is “the number of corporate names properly non-unified” and
. . . fn is “the number of corporate names wrongly non-unified”.

Precision =

tp
tp + fp

(1)

F1 = 2 ·

Recall =

Precision · Recall
Precision + Recall

tp
tp + fn

(2)
(3)
32 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

Results of applying the CORFU approach to the Australian
supplier names.
Total number
of companies

Unique
names

430188

77526

430188

299
77526

CORFU
unified
names

Precision

Recall

F1 score

40277
in

%
of
unified
names
48 %

0,762

0,311

0,441

68

100 %

0,926

0,926

0,926

Comments
A 48 % (77526 − 40278 = 37248) of supplier names have been unified with a precision of
0,762 and a recall of 0,311 (best values must be close to 1).
The first 100 companies in the Forbes list, actually 68 companies were found in the
dataset with 299 appearances.
Results a , in this second case, show a better performance: precision, 0,926, and recall,
0,926.
a

Best values must be close to 1.
33 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

Graphical view after applying the CORFU technique...

34 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

Graphical view after applying the CORFU technique to the
first 100 Forbes companies...

35 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

...the first 100 Forbes companies in bubbles...

36 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

Discussion
Advantages
1

A custom technique for a particular domain.

2

Unification works pretty nice.

3

It enables the possibility of comparing companies in public
procurement.

Drawbacks
1

Execution time ( 20’ for Australian corporate names).

2

It is necessary to test with other datasets.

3

It requires the use of more advanced data mining techniques for
machine learning.

4

It should be applied to other domains (Bibliometrics?).
37 / 43
The CORFU technique | MTSR 2013
Evaluation and Discussion

Discussion
Advantages
1

A custom technique for a particular domain.

2

Unification works pretty nice.

3

It enables the possibility of comparing companies in public
procurement.

Drawbacks
1

Execution time ( 20’ for Australian corporate names).

2

It is necessary to test with other datasets.

3

It requires the use of more advanced data mining techniques for
machine learning.

4

It should be applied to other domains (Bibliometrics?).
38 / 43
The CORFU technique | MTSR 2013
Conclusions and Future Work

Conclusions
1

The public e-Procurement sector is seeking for new methods to:
...improve interoperability
...boost transparency
...or increase participation to name a few.

2

The PublicSpending initiative is addressing some of the challenges in
the e-Procurement sector.

3

The CORFU technique is a key enabler to ease the comparison of
countries, payers, payees, etc.

4

...a technique that helps to take the most of data.

5

It must be technically improved and extended to cover more datasets
and to be “smarter”.

39 / 43
The CORFU technique | MTSR 2013
Conclusions and Future Work

Future Work
1

Application to new public procurement datasets.

2

Reuse the Opencorporates reconciliation service (it has been updated).

3

Contribute to the e-Procurement sector and the PublicSpending
initiative [15, 11].

4

Add more advanced NLP techniques: n − grams.

5

Add machine learning algorithms to automatically classify new
corporate names.

6

Improve the execution time (performance).

7

...

40 / 43
The CORFU technique | MTSR 2013
End of the presentation...

41 / 43
The CORFU technique | MTSR 2013
Metadata and Information

Acknowledgements...
Dr. Michalis Vafopoulos
Leader of the Public Spending initiative.
E-mail: vafo@me.com
WWW: http://vafopoulos.org/
The Public Spending team:
Marios Meimaris
Giannis Xidias
Giorgos Vafeiadis
Michalis Klonaras
Panagiotis Kranidiotis
Prodromos Tsiavos
Georgia Lioliou
WWW: http://publicspending.org/
http://publicspending.net

42 / 43
The CORFU technique | MTSR 2013
Metadata and Information

Roster...
Dr. Jose María Alvarez-Rodríguez
SEERC (until August, 2013) and Carlos III
University of Madrid, Spain
E-mail: josemaria.alvarez@uc3m.es
WWW: http://www.josemalvarez.es

Prof. Dr. Patricia Ordoñez De Pablos
University of Oviedo, Spain
E-mail: patriop@uniovi.es

Prof. Dr. Jose Emilio Labra-Gayo
University of Oviedo, Spain
E-mail: labra@uniovi.es
WWW: http://www.di.uniovi.es/~labra
43 / 43
The CORFU technique | MTSR 2013
References

J. M. Alvarez-Rodríguez, J. E. Labra-Gayo, A. Rodríguez-González, and P. O. D. Pablos.
Empowering the access to public procurement opportunities by means of linking controlled vocabularies. A
case study of Product Scheme Classifications in the European e-Procurement sector.
Computers in Human Behavior, (0):–, 2013.
S. Araujo, J. Hidders, D. Schwabe, and A. P. De Vries.
SERIMI – Resource Description Similarity , RDF Instance Matching and Interlinking.
WebDB 2012, 2011.
J. Erickson.
TWC RPI’s OrgPedia Technology Demonstrator, May 2013.
http://tw.rpi.edu/orgpedia/.
C. Galvez and F. Moya-Anegón.
The unification of institutional addresses applying parametrized finite-state graphs (P-FSG).
Scientometrics, 69(2):323–345, 2006.
C. Galvez and F. Moya-Anegón.
A Dictionary-Based Approach to Normalizing Gene Names in One Domain of Knowledge from the Biomedical
Literature.
Journal of Documentation, 68(1):5–30, 2012.
R. Isele, A. Jentzsch, and C. Bizer.
Silk Server - Adding missing Links while consuming Linked Data.
In COLD, 2010.
M. Krauthammer and G. Nenadic.
Term identification in the biomedical literature.
J. of Biomedical Informatics, 37(6):512–526, Dec. 2004.
S. N. L. P. Lecture.
44 / 43
The CORFU technique | MTSR 2013
References

Spelling Correction and the Noisy Channel. The Spelling Correction Task, Mar. 2013.
http://www.stanford.edu/class/cs124/lec/spelling.pdf.
F. Maali, R. Cyganiak, and V. Peristeras.
Re-using Cool URIs: Entity Reconciliation Against LOD Hubs.
In C. Bizer, T. Heath, T. Berners-Lee, and M. Hausenblas, editors, LDOW, CEUR Workshop Proceedings.
CEUR-WS.org, 2011.
P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer.
DBpedia spotlight: shedding light on the web of documents.
In Proc. of the 7th International Conference on Semantic Systems, I-Semantics ’11, pages 1–8, New York,
NY, USA, 2011. ACM.
M. M. Michail Vafopoulos and G. X. et al.
Publicspending. gr: interconnecting and visualizing Greek public expenditure following Linked Open Data
directives, jul 2012.
G. Michalec and S. Bender-deMoll.
Browser and API for CorpWatch, May 2013.
http://croctail.corpwatch.org/.
P. Norvig.
How to Write a Spelling Corrector, Mar. 2013.
http://norvig.com/spell-correct.html.
C. Taggart and R. McKinnon.
The Open Database Of The Corporate World, May 2013.
http://opencorporates.com/.
M. Vafopoulos.
The Web economy: goods, users, models and policies. Foundations and Trends R in Web Science, volume 1.
45 / 43
The CORFU technique | MTSR 2013
References

Now Publishers Inc., 2012.
Y. Wang.
Annotating and recognising named entities in clinical notes.
In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, ACLstudent ’09, pages 18–26,
Stroudsburg, PA, USA, 2009. Association for Computational Linguistics.
S. Yeates.
Automatic Extraction of Acronyms from Text.
In University of Waikato, pages 117–124, 1999.

46 / 43
The CORFU technique | MTSR 2013
References

Towards a stepwise method for unifying and reconciling
corporate names in public contracts metadata.
The CORFU technique.
Michalis Vafopoulos (speaker)
and
{Jose María Álvarez-Rodríguez, Patricia Ordoñez De Pablos and
Jose Emilio Labra-Gayo}
MTSR 2013 | 7th Metadata and Semantics Research Conference
Track on Metadata and Semantics for Open Repositories, Research Information Systems
and Data Infrastructures

47 / 43

Mais conteúdo relacionado

Mais procurados

Data Mining with Excel 2010 and PowerPivot
Data Mining with Excel 2010 and PowerPivotData Mining with Excel 2010 and PowerPivot
Data Mining with Excel 2010 and PowerPivotMark Tabladillo
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
Ron Kohavi, PhD
Ron Kohavi, PhDRon Kohavi, PhD
Ron Kohavi, PhDbutest
 
Fair webinar, Ted slater: progress towards commercial fair data products and ...
Fair webinar, Ted slater: progress towards commercial fair data products and ...Fair webinar, Ted slater: progress towards commercial fair data products and ...
Fair webinar, Ted slater: progress towards commercial fair data products and ...Pistoia Alliance
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligencevty
 
PROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataPROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataSemantic Web Company
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityOscar Corcho
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationPRELIDA Project
 
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & FasterReinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & FasterOSTHUS
 
Oop principles a good book
Oop principles a good bookOop principles a good book
Oop principles a good booklahorisher
 
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
Session 4.2   unleash the triple: leveraging a corporate discovery interface....Session 4.2   unleash the triple: leveraging a corporate discovery interface....
Session 4.2 unleash the triple: leveraging a corporate discovery interface....semanticsconference
 
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Sören Auer
 
II-SDV 2012 Towards Unified Access Systems for Data Exploration
II-SDV 2012 Towards Unified Access Systems for Data ExplorationII-SDV 2012 Towards Unified Access Systems for Data Exploration
II-SDV 2012 Towards Unified Access Systems for Data ExplorationDr. Haxel Consult
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research datavty
 
Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Artificial Intelligence Institute at UofSC
 
Next-generation Enterprise Information Systems - Research opportunities and c...
Next-generation Enterprise Information Systems - Research opportunities and c...Next-generation Enterprise Information Systems - Research opportunities and c...
Next-generation Enterprise Information Systems - Research opportunities and c...Milan Zdravković
 

Mais procurados (20)

Graham Cousins
Graham CousinsGraham Cousins
Graham Cousins
 
Data Mining with Excel 2010 and PowerPivot
Data Mining with Excel 2010 and PowerPivotData Mining with Excel 2010 and PowerPivot
Data Mining with Excel 2010 and PowerPivot
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
Ron Kohavi, PhD
Ron Kohavi, PhDRon Kohavi, PhD
Ron Kohavi, PhD
 
Fair webinar, Ted slater: progress towards commercial fair data products and ...
Fair webinar, Ted slater: progress towards commercial fair data products and ...Fair webinar, Ted slater: progress towards commercial fair data products and ...
Fair webinar, Ted slater: progress towards commercial fair data products and ...
 
The Future of LOD
The Future of LODThe Future of LOD
The Future of LOD
 
Fighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial IntelligenceFighting COVID-19 with Artificial Intelligence
Fighting COVID-19 with Artificial Intelligence
 
PROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataPROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked Data
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
Mike Bennett
Mike BennettMike Bennett
Mike Bennett
 
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & FasterReinventing Laboratory Data To Be Bigger, Smarter & Faster
Reinventing Laboratory Data To Be Bigger, Smarter & Faster
 
Oop principles a good book
Oop principles a good bookOop principles a good book
Oop principles a good book
 
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
Session 4.2   unleash the triple: leveraging a corporate discovery interface....Session 4.2   unleash the triple: leveraging a corporate discovery interface....
Session 4.2 unleash the triple: leveraging a corporate discovery interface....
 
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Towards Knowledge Graph based Representation, Augmentation and Exploration of...
Towards Knowledge Graph based Representation, Augmentation and Exploration of...
 
II-SDV 2012 Towards Unified Access Systems for Data Exploration
II-SDV 2012 Towards Unified Access Systems for Data ExplorationII-SDV 2012 Towards Unified Access Systems for Data Exploration
II-SDV 2012 Towards Unified Access Systems for Data Exploration
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research data
 
Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...
 
Next-generation Enterprise Information Systems - Research opportunities and c...
Next-generation Enterprise Information Systems - Research opportunities and c...Next-generation Enterprise Information Systems - Research opportunities and c...
Next-generation Enterprise Information Systems - Research opportunities and c...
 

Destaque (7)

Simple Presentation for Slideshare
Simple Presentation for SlideshareSimple Presentation for Slideshare
Simple Presentation for Slideshare
 
HTML5 Audio & Vídeo
HTML5 Audio & VídeoHTML5 Audio & Vídeo
HTML5 Audio & Vídeo
 
Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0
 
Ejemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en SaludEjemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en Salud
 
HTML5-Aplicaciones web
HTML5-Aplicaciones webHTML5-Aplicaciones web
HTML5-Aplicaciones web
 
Introducción a Sistemas de Información
Introducción a Sistemas de InformaciónIntroducción a Sistemas de Información
Introducción a Sistemas de Información
 
Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"
 

Semelhante a CORFU-MTSR 2013

AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINAN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINcscpconf
 
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...Wim Laurier
 
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...Wim Laurier
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Carlo Vaccari
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleVasu S
 
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Findwise
 
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...hamidnazary2002
 
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityBarry Smith
 
IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)fridolin.wild
 
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...ijnlc
 
Processing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud PlatformProcessing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud Platformterradue
 
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...Digitalmikkeli
 
Notes for talk on 12th June 2013 to Open Innovation meeting, Glasgow
Notes for talk on 12th June 2013 to Open Innovation meeting, GlasgowNotes for talk on 12th June 2013 to Open Innovation meeting, Glasgow
Notes for talk on 12th June 2013 to Open Innovation meeting, GlasgowPeterWinstanley1
 
Record Deduplication and Record Linkage
Record Deduplication and  Record LinkageRecord Deduplication and  Record Linkage
Record Deduplication and Record LinkageCRISLANIO MACEDO
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringIRJET Journal
 

Semelhante a CORFU-MTSR 2013 (20)

AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAINAN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
AN ONTOLOGY-BASED DATA WAREHOUSE FOR THE GRAIN TRADE DOMAIN
 
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
 
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
Thought Leadership Session: Enterprise Semantics & Ontology, The Power of Und...
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8
 
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | QuboleO'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
O'Reilly ebook: Machine Learning at Enterprise Scale | Qubole
 
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
 
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...
Enterprise and Data Mining Ontology Integration to Extract Actionable Knowled...
 
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
 
Rdaeu russia_fg_1_july2014_final
Rdaeu  russia_fg_1_july2014_finalRdaeu  russia_fg_1_july2014_final
Rdaeu russia_fg_1_july2014_final
 
IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)IEEE augmented reality learning experience model (ARLEM)
IEEE augmented reality learning experience model (ARLEM)
 
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
 
Processing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud PlatformProcessing Open Data using Terradue Cloud Platform
Processing Open Data using Terradue Cloud Platform
 
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
Datajalostamo-seminaari 5.6.2014: Tutkimusdatan avoimuus – globaalit tutkimus...
 
Notes for talk on 12th June 2013 to Open Innovation meeting, Glasgow
Notes for talk on 12th June 2013 to Open Innovation meeting, GlasgowNotes for talk on 12th June 2013 to Open Innovation meeting, Glasgow
Notes for talk on 12th June 2013 to Open Innovation meeting, Glasgow
 
Record Deduplication and Record Linkage
Record Deduplication and  Record LinkageRecord Deduplication and  Record Linkage
Record Deduplication and Record Linkage
 
The Value and Benefits of Data-to-Text Technologies
The Value and Benefits of Data-to-Text TechnologiesThe Value and Benefits of Data-to-Text Technologies
The Value and Benefits of Data-to-Text Technologies
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
 

Mais de CARLOS III UNIVERSITY OF MADRID

Engineering 4.0: Digitization through task automation and reuse
Engineering 4.0:  Digitization through task automation and reuseEngineering 4.0:  Digitization through task automation and reuse
Engineering 4.0: Digitization through task automation and reuseCARLOS III UNIVERSITY OF MADRID
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...CARLOS III UNIVERSITY OF MADRID
 
Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...CARLOS III UNIVERSITY OF MADRID
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...CARLOS III UNIVERSITY OF MADRID
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...CARLOS III UNIVERSITY OF MADRID
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainCARLOS III UNIVERSITY OF MADRID
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...CARLOS III UNIVERSITY OF MADRID
 
Systems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingSystems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingCARLOS III UNIVERSITY OF MADRID
 
Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...CARLOS III UNIVERSITY OF MADRID
 
News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...CARLOS III UNIVERSITY OF MADRID
 

Mais de CARLOS III UNIVERSITY OF MADRID (20)

Proyecto IVERES-UC3M
Proyecto IVERES-UC3MProyecto IVERES-UC3M
Proyecto IVERES-UC3M
 
RTVE: Sustainable Development Goal Radar
RTVE: Sustainable Development Goal  RadarRTVE: Sustainable Development Goal  Radar
RTVE: Sustainable Development Goal Radar
 
Engineering 4.0: Digitization through task automation and reuse
Engineering 4.0:  Digitization through task automation and reuseEngineering 4.0:  Digitization through task automation and reuse
Engineering 4.0: Digitization through task automation and reuse
 
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
LOTAR-PDES: Engineering digitalization through task automation and reuse in t...
 
SESE 2021: Where Systems Engineering meets AI/ML
SESE 2021: Where Systems Engineering meets AI/MLSESE 2021: Where Systems Engineering meets AI/ML
SESE 2021: Where Systems Engineering meets AI/ML
 
Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...Sailing the V: Engineering digitalization through task automation and reuse i...
Sailing the V: Engineering digitalization through task automation and reuse i...
 
Deep Learning Notes
Deep Learning NotesDeep Learning Notes
Deep Learning Notes
 
H2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional DesignH2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional Design
 
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...AI4SE: Challenges and opportunities in the integration of Systems Engineering...
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
 
INCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems EngineeringINCOSE IS 2019: AI and Systems Engineering
INCOSE IS 2019: AI and Systems Engineering
 
Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...Challenges in the integration of Systems Engineering and the AI/ML model life...
Challenges in the integration of Systems Engineering and the AI/ML model life...
 
Blockchain en la Industria Musical
Blockchain en la Industria MusicalBlockchain en la Industria Musical
Blockchain en la Industria Musical
 
OSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchainOSLC KM: Elevating the meaning of data and operations within the toolchain
OSLC KM: Elevating the meaning of data and operations within the toolchain
 
Blockchain y sector asegurador
Blockchain y sector aseguradorBlockchain y sector asegurador
Blockchain y sector asegurador
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
 
Systems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modellingSystems and Software Architecture: an introduction to architectural modelling
Systems and Software Architecture: an introduction to architectural modelling
 
Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...Detection of fraud in financial blockchain-based transactions through big dat...
Detection of fraud in financial blockchain-based transactions through big dat...
 
News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...News headline generation with sentiment and patterns: A case study of sports ...
News headline generation with sentiment and patterns: A case study of sports ...
 
Blockchain y la industria musical
Blockchain y la industria musicalBlockchain y la industria musical
Blockchain y la industria musical
 
Preparing your Big Data start-up pitch
Preparing your Big Data start-up pitchPreparing your Big Data start-up pitch
Preparing your Big Data start-up pitch
 

Último

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 

Último (20)

4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 

CORFU-MTSR 2013

  • 1. The CORFU technique | MTSR 2013 Towards a stepwise method for unifying and reconciling corporate names in public contracts metadata. The CORFU technique. Michalis Vafopoulos (speaker) and {Jose María Álvarez-Rodríguez, Patricia Ordoñez De Pablos and Jose Emilio Labra-Gayo} MTSR 2013 | 7th Metadata and Semantics Research Conference Track on Metadata and Semantics for Open Repositories, Research Information Systems and Data Infrastructures 1 / 43
  • 2. The CORFU technique | MTSR 2013 1 Introduction 2 Related Work 3 The CORFU technique Use Case: the Public Spending initiative 4 Evaluation and Discussion 5 Conclusions and Future Work 6 Metadata and Information 2 / 43
  • 3. The CORFU technique | MTSR 2013 Introduction The Problem...What is the “Big Name”? “Oracle (Corp) Aust Pty Ltd” “Oracle Corp (Aust) Pty Ltd” “Oracle Corp Aust Pty Ltd” “Oracle Corp. Australia” “Oracle Corp. Australia Pty.Ltd.” “Oracle Corpoartion (Aust) Pty Ltd” “Oracle” “Oracle Corporate Aust Pty Ltd” “Oracle Corporation” “Oracle Risk Consultants” “ORACLE SYSTEMS (AUSTRALIA) PTY LTD” “Oracle University” ... “Accenture” “Accenture Aust Holdings” “Accenture Aust Holdings” “Accenture Aust Holdings Pty Ltd” “Accenture Australia Holding P/L” “Accenture” “Accenture Australia Holdings P/Ltd” “Accenture Australia Holdings Pty Lt” “Accenture Australia Limited” ... ... ... 3 / 43
  • 4. The CORFU technique | MTSR 2013 Introduction ...and if you have 400K corporate names 4 / 43
  • 5. The CORFU technique | MTSR 2013 Introduction Scope Public Procurement 1 e-Procurement is a strategic sector (17 % of the GDP in Europe). 2 Action Plans 2004 and 2020. 3 Projects: E-Certis, Fiscalis 2013, E-Prior, PEPPOL, STORK,etc. 4 Other actions: TED, RAMON metadata server, CPV, NUTS, etc. 5 Legal Framework. 6 Boost participation with special focus on SMEs.s ...but it also requires... 1 Accomplish with Open Data principles. 2 Improve transparency of public bodies. 3 Track where public money goes. 4 ... 5 / 43
  • 6. The CORFU technique | MTSR 2013 Introduction Scope Public Procurement 1 e-Procurement is a strategic sector (17 % of the GDP in Europe). 2 Action Plans 2004 and 2020. 3 Projects: E-Certis, Fiscalis 2013, E-Prior, PEPPOL, STORK,etc. 4 Other actions: TED, RAMON metadata server, CPV, NUTS, etc. 5 Legal Framework. 6 Boost participation with special focus on SMEs.s ...but it also requires... 1 Accomplish with Open Data principles. 2 Improve transparency of public bodies. 3 Track where public money goes. 4 ... 6 / 43
  • 7. The CORFU technique | MTSR 2013 Introduction The Problem... How can we track public procurement processes? 1 2 Data and information is already out there. Relevant metadata can be (re)used: Normalized product scheme classifications such as the CPV 2008 (Common Procurement Vocabulary) [1]. Territorial units (NUTS). Currency. 3 ... ...and “names”? Both Payer and Payee names are not usually normalized. Normalized and unified names (“Big Name”)... ...with the aim of tracking both payers and payees. 7 / 43
  • 8. The CORFU technique | MTSR 2013 Introduction The Problem... How can we track public procurement processes? 1 2 Data and information is already out there. Relevant metadata can be (re)used: Normalized product scheme classifications such as the CPV 2008 (Common Procurement Vocabulary) [1]. Territorial units (NUTS). Currency. 3 ... ...and “names”? Both Payer and Payee names are not usually normalized. Normalized and unified names (“Big Name”)... ...with the aim of tracking both payers and payees. 8 / 43
  • 9. The CORFU technique | MTSR 2013 Introduction The Problem... How can we track public procurement processes? 1 2 Data and information is already out there. Relevant metadata can be (re)used: Normalized product scheme classifications such as the CPV 2008 (Common Procurement Vocabulary) [1]. Territorial units (NUTS). Currency. 3 ... ...and “names”? Both Payer and Payee names are not usually normalized. Normalized and unified names (“Big Name”)... ...with the aim of tracking both payers and payees. 9 / 43
  • 10. The CORFU technique | MTSR 2013 Introduction The Problem... Some remarks... 1 It is not a mere problem of reconciling entities (dealing with)... Misspelling errors. Name/acronym mismatches. ... 2 3 ...but to create a unique name or link (n string literals → 1 company → 1 URI). E.g. “Oracle” and “Oracle University” could be respectively aligned to the entities <Oracle_Corporation> and <Oracle_University> ...but the problem of grouping by a unique (Big ) name, identifier or resource still remains. 10 / 43
  • 11. The CORFU technique | MTSR 2013 Introduction The Public Spending initiative... ...a joint effort trying to answer... 1 Who really gets the public money? 2 For what? From whom? 3 Can we compare them? 4 Is public spending effective? 5 ... Learn more: http://publicspending.net/ 11 / 43
  • 12. The CORFU technique | MTSR 2013 Related Work Natural Language Processing, Computational Linguistics and Entity Reconciliation. Existing works and APIs to deal with natural language issues 1 Misspelling errors [13, 8]. 2 Name/acronym mismatches [17]. 3 APIs such as NLTK for Python, Lingpipe, OpenNLP or Gate for Java and search engines such as Apache Lucene/Solr. 4 Extraction of clinical terms [16] for electronic health records. 5 Creation of bibliometrics [4] or identification of gene names [7, 5]. 6 Entity reconciliation processes [6, 2] using the DBPedia [10] or URI comparison [9]. 7 Tools such as Google Refine, etc. Preliminary evaluation... Algorithms to deal with natural language heterogeneities are already available. Existing works are usually focused in some domain (prototypes cannot be easily customized to other domain). ...but methodologies and NLP algorithms can be re-applied to new domains. 12 / 43
  • 13. The CORFU technique | MTSR 2013 Related Work Natural Language Processing, Computational Linguistics and Entity Reconciliation. Existing works and APIs to deal with natural language issues 1 Misspelling errors [13, 8]. 2 Name/acronym mismatches [17]. 3 APIs such as NLTK for Python, Lingpipe, OpenNLP or Gate for Java and search engines such as Apache Lucene/Solr. 4 Extraction of clinical terms [16] for electronic health records. 5 Creation of bibliometrics [4] or identification of gene names [7, 5]. 6 Entity reconciliation processes [6, 2] using the DBPedia [10] or URI comparison [9]. 7 Tools such as Google Refine, etc. Preliminary evaluation... Algorithms to deal with natural language heterogeneities are already available. Existing works are usually focused in some domain (prototypes cannot be easily customized to other domain). ...but methodologies and NLP algorithms can be re-applied to new domains. 13 / 43
  • 14. The CORFU technique | MTSR 2013 Related Work Corporate Information. Corporate Databases 1 Some corporate databases: The Spanish Chambers of Commerce, “Empresia.es” or “Axesor.es” to name a few (just in Spain). 2 The DBPedia and the Orgpedia [3]. 3 The CrocTail [12] effort (part of the “Corporate Research Project”). 4 “The Open Database Of The Corporate World” [14]. 5 Forbes, Google Places, Google Maps, Foursquare, Linkedin Companies or Facebook. 6 Similar initiatives: Openspending.net, LOD2 project e-Procurement, etc. 7 ... Preliminary evaluation... Corporate information is public but access is restricted or under a fee (valuable metadata)... Large databases (“infobesity?”) but... ...the problem of mapping ,(n string literals → 1 company → 1 URI) as a human would do, still remains. 14 / 43
  • 15. The CORFU technique | MTSR 2013 Related Work Corporate Information. Corporate Databases 1 Some corporate databases: The Spanish Chambers of Commerce, “Empresia.es” or “Axesor.es” to name a few (just in Spain). 2 The DBPedia and the Orgpedia [3]. 3 The CrocTail [12] effort (part of the “Corporate Research Project”). 4 “The Open Database Of The Corporate World” [14]. 5 Forbes, Google Places, Google Maps, Foursquare, Linkedin Companies or Facebook. 6 Similar initiatives: Openspending.net, LOD2 project e-Procurement, etc. 7 ... Preliminary evaluation... Corporate information is public but access is restricted or under a fee (valuable metadata)... Large databases (“infobesity?”) but... ...the problem of mapping ,(n string literals → 1 company → 1 URI) as a human would do, still remains. 15 / 43
  • 16. The CORFU technique | MTSR 2013 The CORFU technique Company, ORganization and Firm name Unifier-CORFU (I) 16 / 43
  • 17. The CORFU technique | MTSR 2013 The CORFU technique Company, ORganization and Firm name Unifier-CORFU (II) 17 / 43
  • 18. The CORFU technique | MTSR 2013 The CORFU technique Example step by step... Scenario: Australian supplier names (400K). 1st try: use of Google Refine+Open Corporates reconciliation service (just an 8 % of unified names, see Figure below). 2nd try: design of the CORFU technique using Python NLTK and other third-party APIs for NLP processing. 18 / 43
  • 19. The CORFU technique | MTSR 2013 The CORFU technique Step 0: Load corporate names Load Input: a list of corporate names as raw text (one per line). Output: a set of names represented as strings. Example: “Accenture Australia Holding P/L” “Oracle (Corp) Aust Pty Ltd” 19 / 43
  • 20. The CORFU technique | MTSR 2013 The CORFU technique Step 1: Normalize raw text and remove duplicates Normalize Input: a set of names represented as strings. Process: this step is comprised of: 1 Remove strange characters and punctuation marks but keeping those that are part of a word avoiding potential changes in abbreviations or acronyms; 2 Lowercase raw text (although some semantics can be lost previous works and empirical tests show that this is the best approach). 3 Remove duplicates. 4 Lemmatize the corporate name. Output: a set of normalized corporate names. Example: “Accenture Australia Holding PL” “Oracle Corp Aust Pty Ltd” 20 / 43
  • 21. The CORFU technique | MTSR 2013 The CORFU technique Step 2: Filter the basic set of common stopwords in English Filter Input: a set of normalized corporate names and a set of stopwords. Process: load a set of stopwords (a minimal set of stopwords from the Python NLTK API has been used): Including common English stopwords but... Avoiding to filter relevant words. Output: a set of cleaned and normalized corporate names. Example: “Accenture Australia Holding PL” “Oracle Corp Aust Pty Ltd” 21 / 43
  • 22. The CORFU technique | MTSR 2013 The CORFU technique Step 3: Dictionary-based expansion of common acronyms and filtering Dictionary-based expansion Input: a set of cleaned and normalized corporate names and a dictionary of acronyms. Process: load the dictionary of acronyms and expand. Output: a set of cleaned, normalized and acronym-expanded corporate names. Example: “Accenture Australia Holding Proprietary Company Limited” “Oracle Corporation Aust Proprietary Company Limited” 22 / 43
  • 23. The CORFU technique | MTSR 2013 The CORFU technique Step 4: Filter the expanded set of most common words in the dataset Filter Input: a set of cleaned, normalized and acronym-expanded corporate names. Process: extract statistics of “most used words” in the input dataset, expand those words and filter. Output: a set of cleaned, normalized and acronym-expanded corporate names. Example: “Accenture Australia Holding” “Oracle Aust” 23 / 43
  • 24. The CORFU technique | MTSR 2013 The CORFU technique Step 5: Identification of contextual information and filtering Context filtering Input: a set of cleaned, normalized and acronym-expanded corporate names. Output: a set of cleaned, normalized and acronym-expanded corporate names without contextual information. Example: “Accenture Holding” “Oracle” 24 / 43
  • 25. The CORFU technique | MTSR 2013 The CORFU technique Step 6: Spell checking (optional) Spell checking Input: a set of cleaned, normalized and acronym-expanded corporate names without contextual information. Output: the previous set without spelling errors. Example: “Accenture Holding” “Oracle” 25 / 43
  • 26. The CORFU technique | MTSR 2013 The CORFU technique Step 7: Pos-tagging parts of speech according to a grammar and filtering the non-relevant ones Pos-tagging Input: a set of cleaned, normalized and acronym-expanded corporate names without contextual information. Output: a tree according to a pre-defined grammar for corporate names that only contains nouns. Example: “Accenture” “Oracle” 26 / 43
  • 27. The CORFU technique | MTSR 2013 The CORFU technique Step 8: Cluster corporate names Clustering Input: a set of strings derivate from the aforementioned tree. Output: a set of clusters for each extracted corporate name. Example: “Accenture” “Oracle” 27 / 43
  • 28. The CORFU technique | MTSR 2013 The CORFU technique Step 9: Validate and reconcile the generated corporate name via an existing reconcile service (optional) Validate and reconcile Input: a set of clusters for each extracted corporate name. Output: (n string literals → 1 company → 1 URI). Example: (“Accenture”, dbpedia-res:Accenture) (“Oracle”, dbpedia-res:Oracle_Corporation) 28 / 43
  • 29. The CORFU technique | MTSR 2013 The CORFU technique Representing and querying the corporate information in RDF... § :o1 a o r g : O r g anization ; skos :prefL abel ‘‘ Oracle ’ ’; skos:altLabel ‘‘ Oracle Corporation ’ ’ , ‘‘ Oracle ( Corp ) s ko s: c lo s eM atch dbpedia - res: Or a cl e _C o rp o ra t io n ; ... . ¦ § SELECT str (? label ) ( COUNT (? org ) as ? pCount ) WHERE { ? ppn : rewarded - to ? org . ? org rdf : type org : Organization . ? org skos : prefLabel ? label . ... } GROUP BY str (? label ) ORDER BY desc (? pCount ) ¦ ¤ Aust Pty Ltd ’ ’ , ...; ¥ ¤ ¥ 29 / 43
  • 30. The CORFU technique | MTSR 2013 The CORFU technique Use Case: the Public Spending initiative Demo and application... Learn more: http://publicspending.net/ 30 / 43
  • 31. The CORFU technique | MTSR 2013 Evaluation and Discussion The case of Australian supplier names... Step Name Customization 0 Load corporate names 430188 full names and 77526 unique names (period 2004-2012) 1 Normalize raw text and remove duplicates Default 2 Filter the basic set of common stopwords in English Default 3 Filter the expanded set of most common words in the dataset Two stopwords sets: 355 words (manually) and words with more than n = 50 apparitions (automatically) 4 Dictionary-based expansion acronyms and filtering Set of 50 acronyms variations (manually) 5 Identification of contextual information and filtering Use of Geonames REST service 6 Spell checking (optional) Train dataset of 128457 words provided by Peter Norvig’s spell-checker [13]. 7 Pos-tagging parts of speech according to a grammar and filtering the non-relevant ones Default 8 Cluster corporate names Default 9 Validate and reconcile the generated corporate name via an existing reconcile service (optional) Python client and Google Refine of common 31 / 43
  • 32. The CORFU technique | MTSR 2013 Evaluation and Discussion Research Design 1 2 Configure and execute the CORFU technique. Validate (manually) the dump of unified names and calculate: Precision, see Eq. 1, is “the number of supplier names that have been correctly unified under the same name” Recall is, see Eq. 2, “the number of supplier names that have not been correctly classified under a proper name” and F1 score, see Eq. 3, where... . . . tp is “the number of corporate names properly unified” . . . fp is “the number of corporate names wrongly unified” . . . tn is “the number of corporate names properly non-unified” and . . . fn is “the number of corporate names wrongly non-unified”. Precision = tp tp + fp (1) F1 = 2 · Recall = Precision · Recall Precision + Recall tp tp + fn (2) (3) 32 / 43
  • 33. The CORFU technique | MTSR 2013 Evaluation and Discussion Results of applying the CORFU approach to the Australian supplier names. Total number of companies Unique names 430188 77526 430188 299 77526 CORFU unified names Precision Recall F1 score 40277 in % of unified names 48 % 0,762 0,311 0,441 68 100 % 0,926 0,926 0,926 Comments A 48 % (77526 − 40278 = 37248) of supplier names have been unified with a precision of 0,762 and a recall of 0,311 (best values must be close to 1). The first 100 companies in the Forbes list, actually 68 companies were found in the dataset with 299 appearances. Results a , in this second case, show a better performance: precision, 0,926, and recall, 0,926. a Best values must be close to 1. 33 / 43
  • 34. The CORFU technique | MTSR 2013 Evaluation and Discussion Graphical view after applying the CORFU technique... 34 / 43
  • 35. The CORFU technique | MTSR 2013 Evaluation and Discussion Graphical view after applying the CORFU technique to the first 100 Forbes companies... 35 / 43
  • 36. The CORFU technique | MTSR 2013 Evaluation and Discussion ...the first 100 Forbes companies in bubbles... 36 / 43
  • 37. The CORFU technique | MTSR 2013 Evaluation and Discussion Discussion Advantages 1 A custom technique for a particular domain. 2 Unification works pretty nice. 3 It enables the possibility of comparing companies in public procurement. Drawbacks 1 Execution time ( 20’ for Australian corporate names). 2 It is necessary to test with other datasets. 3 It requires the use of more advanced data mining techniques for machine learning. 4 It should be applied to other domains (Bibliometrics?). 37 / 43
  • 38. The CORFU technique | MTSR 2013 Evaluation and Discussion Discussion Advantages 1 A custom technique for a particular domain. 2 Unification works pretty nice. 3 It enables the possibility of comparing companies in public procurement. Drawbacks 1 Execution time ( 20’ for Australian corporate names). 2 It is necessary to test with other datasets. 3 It requires the use of more advanced data mining techniques for machine learning. 4 It should be applied to other domains (Bibliometrics?). 38 / 43
  • 39. The CORFU technique | MTSR 2013 Conclusions and Future Work Conclusions 1 The public e-Procurement sector is seeking for new methods to: ...improve interoperability ...boost transparency ...or increase participation to name a few. 2 The PublicSpending initiative is addressing some of the challenges in the e-Procurement sector. 3 The CORFU technique is a key enabler to ease the comparison of countries, payers, payees, etc. 4 ...a technique that helps to take the most of data. 5 It must be technically improved and extended to cover more datasets and to be “smarter”. 39 / 43
  • 40. The CORFU technique | MTSR 2013 Conclusions and Future Work Future Work 1 Application to new public procurement datasets. 2 Reuse the Opencorporates reconciliation service (it has been updated). 3 Contribute to the e-Procurement sector and the PublicSpending initiative [15, 11]. 4 Add more advanced NLP techniques: n − grams. 5 Add machine learning algorithms to automatically classify new corporate names. 6 Improve the execution time (performance). 7 ... 40 / 43
  • 41. The CORFU technique | MTSR 2013 End of the presentation... 41 / 43
  • 42. The CORFU technique | MTSR 2013 Metadata and Information Acknowledgements... Dr. Michalis Vafopoulos Leader of the Public Spending initiative. E-mail: vafo@me.com WWW: http://vafopoulos.org/ The Public Spending team: Marios Meimaris Giannis Xidias Giorgos Vafeiadis Michalis Klonaras Panagiotis Kranidiotis Prodromos Tsiavos Georgia Lioliou WWW: http://publicspending.org/ http://publicspending.net 42 / 43
  • 43. The CORFU technique | MTSR 2013 Metadata and Information Roster... Dr. Jose María Alvarez-Rodríguez SEERC (until August, 2013) and Carlos III University of Madrid, Spain E-mail: josemaria.alvarez@uc3m.es WWW: http://www.josemalvarez.es Prof. Dr. Patricia Ordoñez De Pablos University of Oviedo, Spain E-mail: patriop@uniovi.es Prof. Dr. Jose Emilio Labra-Gayo University of Oviedo, Spain E-mail: labra@uniovi.es WWW: http://www.di.uniovi.es/~labra 43 / 43
  • 44. The CORFU technique | MTSR 2013 References J. M. Alvarez-Rodríguez, J. E. Labra-Gayo, A. Rodríguez-González, and P. O. D. Pablos. Empowering the access to public procurement opportunities by means of linking controlled vocabularies. A case study of Product Scheme Classifications in the European e-Procurement sector. Computers in Human Behavior, (0):–, 2013. S. Araujo, J. Hidders, D. Schwabe, and A. P. De Vries. SERIMI – Resource Description Similarity , RDF Instance Matching and Interlinking. WebDB 2012, 2011. J. Erickson. TWC RPI’s OrgPedia Technology Demonstrator, May 2013. http://tw.rpi.edu/orgpedia/. C. Galvez and F. Moya-Anegón. The unification of institutional addresses applying parametrized finite-state graphs (P-FSG). Scientometrics, 69(2):323–345, 2006. C. Galvez and F. Moya-Anegón. A Dictionary-Based Approach to Normalizing Gene Names in One Domain of Knowledge from the Biomedical Literature. Journal of Documentation, 68(1):5–30, 2012. R. Isele, A. Jentzsch, and C. Bizer. Silk Server - Adding missing Links while consuming Linked Data. In COLD, 2010. M. Krauthammer and G. Nenadic. Term identification in the biomedical literature. J. of Biomedical Informatics, 37(6):512–526, Dec. 2004. S. N. L. P. Lecture. 44 / 43
  • 45. The CORFU technique | MTSR 2013 References Spelling Correction and the Noisy Channel. The Spelling Correction Task, Mar. 2013. http://www.stanford.edu/class/cs124/lec/spelling.pdf. F. Maali, R. Cyganiak, and V. Peristeras. Re-using Cool URIs: Entity Reconciliation Against LOD Hubs. In C. Bizer, T. Heath, T. Berners-Lee, and M. Hausenblas, editors, LDOW, CEUR Workshop Proceedings. CEUR-WS.org, 2011. P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. DBpedia spotlight: shedding light on the web of documents. In Proc. of the 7th International Conference on Semantic Systems, I-Semantics ’11, pages 1–8, New York, NY, USA, 2011. ACM. M. M. Michail Vafopoulos and G. X. et al. Publicspending. gr: interconnecting and visualizing Greek public expenditure following Linked Open Data directives, jul 2012. G. Michalec and S. Bender-deMoll. Browser and API for CorpWatch, May 2013. http://croctail.corpwatch.org/. P. Norvig. How to Write a Spelling Corrector, Mar. 2013. http://norvig.com/spell-correct.html. C. Taggart and R. McKinnon. The Open Database Of The Corporate World, May 2013. http://opencorporates.com/. M. Vafopoulos. The Web economy: goods, users, models and policies. Foundations and Trends R in Web Science, volume 1. 45 / 43
  • 46. The CORFU technique | MTSR 2013 References Now Publishers Inc., 2012. Y. Wang. Annotating and recognising named entities in clinical notes. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, ACLstudent ’09, pages 18–26, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. S. Yeates. Automatic Extraction of Acronyms from Text. In University of Waikato, pages 117–124, 1999. 46 / 43
  • 47. The CORFU technique | MTSR 2013 References Towards a stepwise method for unifying and reconciling corporate names in public contracts metadata. The CORFU technique. Michalis Vafopoulos (speaker) and {Jose María Álvarez-Rodríguez, Patricia Ordoñez De Pablos and Jose Emilio Labra-Gayo} MTSR 2013 | 7th Metadata and Semantics Research Conference Track on Metadata and Semantics for Open Repositories, Research Information Systems and Data Infrastructures 47 / 43