Python Notes for mca i year students osmania university.docx
BibBase Linked Data Triplification Challenge 2010 Presentation
1. BibBase Triplified http://data.bibbase.org/
Presented by:
Reynold S. Xin UC Berkeley
Joint work with:
Oktie Hassanzadeh, Yang Yang, Jiang Du, Minghua Zhao,
Renee J. Miller University of Toronto
Christian Fritz University of Southern California
2. Outline
Goals and Status
Duplicate detection
Interlinking of data sources
Additional features
Conclusions and future work
3.
4.
5. Goals http://www.bibbase.org
Makes it easy for scientists to maintain publications pages
Scientists maintain a bibtex file; BibBase does the rest
Publishes them in HTML
6. Goals http://data.bibbase.org
Makes it easy for scientists to maintain publications pages
Scientists maintain a bibtex file; BibBase does the rest
Publishes them in HTML
Publishes them in RDF
Links entries to the open linked data cloud
With incentive, scientists are helping us build a
bibliographic database (think DBLP but automated)
Invaluable data set for benchmarking duplicate
detection and semantic link discovery systems
7.
8. Some statistics
“Beta” went online in June 2010
As of yesterday (September 1, 2010)
~ 100 active users
4520 publications, 4883 authors, 502 journals, 1881
proceedings, 88 keywords
39201 author links, 2768 publication links, 30 keyword links
Note that this is before we do any form of “marketing”
9. Duplicate Detection
Examples
Authors: “Renee J. Miller” or “R. J. Miller” or “RJ Miller”
Publication entries
Journal & conferences: “VLDB” or “Very Large Data Base”
Solutions
Local detection (within a single bibtex file)
Global detection (across multiple files)
10. Local Detection
A set of predefined rules to identify duplicates.
E.g. within a single file, it is highly likely that “Renee J Miller” is
the same as “RJ Miller”.
Users can specify a suffix to the name to differentiate
them (DBLP approach).
E.g. “Min Wang” vs “Min Wang2”
11. Global Detection
Duplicate detection, also known as entity resolution,
record linkage, or reference reconciliation is a well-
studied problem and an active research area. [Tutorial-
VLDB’05, Tutorial-SIGMOD’06]
We use existing declarative techniques [D.App.σ-SIGMOD’07]
to detect duplicates across multiple files.
Display disambiguation page on HTML interface and
rdfs:seeAlso attribute on RDF interface.
Also enables user to provide feedback by
@string{vldb = Very Large Data Base}
12. Interlinking of Data Sources
Leverages both offline dictionaries and online real-time
URL verifications.
Some external data sources
DBLP
DBpedia
RKBExplorer
Semantic Web Dogfood
LOD foaf
13. Additional Features
Storage and publication of provenance information
Dynamic grouping of entities (by year, keyword, etc)
RSS feed for notification
DBLP scraper to generate bibtex files from DBLP records
Statistics on usage
Enhancement to existing MIT bibtex ontology file
14. Conclusion and Future Work
BibBase
Light-weight publication of bibliographic data
Semantic web technologies as a result of complex
triplification performed inside the system
Invaluable data set
Future Work
More comprehensive duplicate detection
Links to more external data sources
Better engineering and service level agreement (99.99%?)
Broader user base