Linked Data, Free Pictures, and Markets For Semantic Data

•Transferir como PPTX, PDF•

5 gostaram•1,296 visualizações

Ookaboo is a collection of about 1,000,000 Creative Commons images gathered from social media to 500,000 Linked Data concepts from Freebase and DBpedia. Ookaboo’s semantic API and RDF dump let applications connect topic such as people, places, species and things to free pictures with almost perfect precision. To create Ookaboo’s photo collection and user interface, I had to extensively clean Linked Data and construct a knowledge base about “commonsense” topics such as grammar, the relative importance of things, offensiveness, and the categorization and naming of things. Had this knowledge been commercially available, I could have more time acquiring images and building a community. Although free Linked Data defines a shared vocabulary that enables interoperation, next generation text analysis, data integration, and content generation systems will depend on reusable knowledge bases that take resources and specialized skills to create – a market in semantic data will fill this need.

Tecnologia Educação

Linked Data, Free Pictures and
Markets for Semantic Data

Paul Houle
paul@ontology2.com

Overview
the New taxonomy
Freebase and DBpedia

Overview
the New taxonomy
freebase and DBpedia
the social-semantic ecosystem

Overview
the New taxonomy
freebase and DBpedia
the semantic-social ecosystem
commonsense knowledge in practice

Overview
the New taxonomy
freebase and DBpedia
the semantic-social ecosystem
commonsense knowledge in practice
the economics of semantic data

Overview
the New taxonomy
freebase and DBpedia
collecting pictures
the semantic-social ecosystem
commonsense knowledge in practice
the economics of semantic data
proof and trust

virtuous circle
People Use
Images

Get
Links
Content

Revenue Traffic

Vernacular Taxonomy for Animals

Mammals Birds

Primates Rodents Others

automating the process

Identify topics search for candidates filter correct images

describe images

dbpedia flickr

2012
2011
2010
2009
2008
2007
…
1990
1989
1988
1987
1986
1985

2012 Acura
2011 Alfa Romeo
2010 Aston Martin
2009 Audi
2008 Bentley
2007 BMW
… …
1990 Scion
1989 Subaru
1988 Suzuki
1987 Toyota
1986 Volkswagen
1985 Volvo

2012 Acura CC
2011 Alfa Romeo CC 4Motion
2010 Aston Martin Eos
2009 Audi Gti
2008 Bentley Jetta
2007 BMW Jetta SportWagen
… … New Beetle
1990 Scion New Beetle Convertible
1989 Subaru Passat
1988 Suzuki Passat Wagon
1987 Toyota Routan FWD
1986 Volkswagen Tiguan 4motion
1985 Volvo Tourareg

2012 Acura CC
2011 Alfa Romeo CC 4Motion
2010 Aston Martin Eos
2009 Audi Gti
2008 Bentley Jetta 6 speed automatic
2007 BMW Jetta SportWagen
… … New Beetle
1990 Scion New Beetle Convertible
1989 Subaru Passat
1988 Suzuki Passat Wagon
1987 Toyota Routan FWD
5 speed manual
1986 Volkswagen Tiguan 4motion
1985 Volvo Tourareg

Constructed Taxonomy

Chevrolet Honda Volkswagen

S360 Civic Accord Element FCX

The only way is no
way…

The only limits are
no limits…

The only taxonomy
is no taxonomy…

network “taxonomy”
people creative works

places
inventions life forms

What’s out there?
Type Count
Person 1,035,529
Location 707,679
Organism Classification 192,632
Organization 177,999
Music Album 118,568
Film 76,681
Structure 74,061
Event 73.992
Written Work 51,937
TV Program 30,094
Fictional Character 29,461
Celestial Object 24,174
Ship 23,006

ookaboo semantic API
<http://dbpedia.org/resource/Thailand>

API

Thanks: Andyindia, Echiner1, Rene Eherhardt

linked data

human contributions

other online communities

linked data

human contributions

other online communities

knowledge engineering

Text Analysis

Car Image CC-BY from http://www.flickr.com/photos/aharden/2618801756/

Number of Facts

Cyc: 3 million Freebase: 600 million

Number of Concepts

SUMO: 1000, DBpedia: 3.9 million
WordNet: 118,000 Freebase: 23 million

Number of Facts

Cyc: 3 million Freebase: 600 million

Number of Concepts

SUMO: 1000, Wikipedia: 3.9 million
WordNet: 118,000 Freebase: 23 million

critical mass?

“Any brain, machine or other thing that has a
mind must be composed of smaller things that
cannot think at all”

Marvin Minsky

Rome Planet
Deity Rings
Saturn1 Saturn2

Mythology Astronomy

yankees vs. red sox

carbon vs. silicon

aerosmith vs. the ramones

yankees vs. red sox

carbon vs. silicon

aerosmith vs. the ramones

Jeopardy vs. family feud

A cautionary tale

advertising
revenue
time

“I know it when I see it”
- Supreme Court Justice Potter Stuart

50 offensive categories

1000 offensive topics

50 offensive categories

1000 offensive topics

1800 offensive images

50 offensive categories

1000 offensive topics

1800 offensive images

950,000 good images

99.81% accuracy isn’t good enough!

Hyperprecision!

Publishing Knowledge

SPARQL
Dereferencing
Endpoint

API RDF Dump

Clip art licensed from the Clip Art Gallery on DiscoverySchool.com

Dereferencing
<http://rdf.freebase.com/ns/en.graphene>

Dereferencing
<http://rdf.freebase.com/ns/en.graphene>

http GET

Dereferencing
<http://rdf.freebase.com/ns/en.graphene>

http GET

fbase:en.graphene
a fbase:common.topic ,
fbase:award.award_winning_work , fbase:law.invention;
fbase:award.award_winning_work.awards_won
fbase:m.0dg75z8 ;
fbase:common.topic.article
fbase:m.03p5rz ;
fbase:common.topic.image
fbase:m.089q2k3 , fbase:m.02f5b7f ,
fbase:m.041wl9z ;
fbase:law.invention.inventor
fbase:en.andre_geim ...

Ookaboo RDF Dump

Metadata for 950,000 Pictures

500,000+ topics

630 MB

50 million facts

Two Challenges
Ookaboo needs
better tools to
build navigation

Customers need
tools to find
concepts

Not so “big” …

:BaseKB is 2.8 GB
:BaseKB is free under CC-BY
:BaseKB takes an 1 hour to load on a workstation PC

… but very complex

:BaseKB has 11,361 types and 102,949 properties
“A isPartOf B” can be expressed in 139 different ways!

Photo credit: http://commons.wikimedia.org/wiki/User:Evan-Amos

N-Triples is compatible with…

RDF Database

N-Triples is compatible with…

RDF Database

awk, sed, grep, …

N-Triples is compatible with…

RDF Database

awk, sed, grep, …

Hadoop

N-Triples is compatible with…

RDF Database

awk, sed, grep, …

Hadoop

Lucene (SIREn)

Enterprise Data Warehousing

Data
Operations ETL Analytics
Warehouse

Knowledge-Based System

Data
Linked Data ETL Operations
Warehouse

“Businesses often spend five to 10 times
more money to correct their data after it is
entered into the system than they would
have if they had headed the problems off at
the source.”

- Larry P. English, Information Impact International

Data Quality Economics
Assume 25 Consumers

Consumers Clean
25 x $N = $25 N

Publisher Cleans
1 x $N = $N

Reusable Knowledge Base
effect on schedule

build
develop knowledge base

adopt
develop knowledge base

decision point

time

Build
Knowledge
Base

Develop
Get Feedback
Profitable
And Revenue
Applications

Linked Data Business Models

Free Shared
Vocabulary Enables …but the profit motive
Interconnection… spurs investment to
create quality data.

… big and ambitious systems

Paul Houle
paul@ontology2.com

Mais conteúdo relacionado

Mais de Paul Houle

Chatbots in 2017 -- Ithaca Talk Dec 6Paul Houle

Estimating the Software Product Value during the Development ProcessPaul Houle

Universal Standards for LEI and other Corporate Reference Data: Enabling risk...Paul Houle

Fixing a leaky bucket; Observations on the Global LEI SystemPaul Houle

Cisco Fog Strategy For Big and Smart DataPaul Houle

Making the semantic web workPaul Houle

Ontology2 platformPaul Houle

Ontology2 Platform EvolutionPaul Houle

Paul houle the supermenPaul Houle

Paul houle what ails enterprise search Paul Houle

Subjective Importance SmackdownPaul Houle

Extension methods, nulls, namespaces and precedence in c#Paul Houle

Dropping unique constraints in sql serverPaul Houle

Paul houle resumePaul Houle

Embrace dynamic PHPPaul Houle

Once asynchronous, always asynchronousPaul Houle

Pro align snap 2Paul Houle

Proalign Snapshot 1Paul Houle

Text wise technology textwise company, llcPaul Houle

Tapir user managerPaul Houle

Mais de Paul Houle (20)

Chatbots in 2017 -- Ithaca Talk Dec 6

Estimating the Software Product Value during the Development Process

Universal Standards for LEI and other Corporate Reference Data: Enabling risk...

Fixing a leaky bucket; Observations on the Global LEI System

Cisco Fog Strategy For Big and Smart Data

Making the semantic web work

Ontology2 platform

Ontology2 Platform Evolution

Paul houle the supermen

Paul houle what ails enterprise search

Subjective Importance Smackdown

Extension methods, nulls, namespaces and precedence in c#

Dropping unique constraints in sql server

Paul houle resume

Embrace dynamic PHP

Once asynchronous, always asynchronous

Pro align snap 2

Proalign Snapshot 1

Text wise technology textwise company, llc

Tapir user manager

Último

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

AI as an Interface for Commercial BuildingsMemoori

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

How to convert PDF to text with Nanonetsnaman860154

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Linked Data, Free Pictures, and Markets For Semantic Data

1. Linked Data, Free Pictures and Markets for Semantic Data Paul Houle paul@ontology2.com

2. Overview the New taxonomy

3. Overview the New taxonomy Freebase and DBpedia

4. Overview the New taxonomy freebase and DBpedia the social-semantic ecosystem

5. Overview the New taxonomy freebase and DBpedia the semantic-social ecosystem commonsense knowledge in practice

6. Overview the New taxonomy freebase and DBpedia the semantic-social ecosystem commonsense knowledge in practice the economics of semantic data

7. Overview the New taxonomy freebase and DBpedia collecting pictures the semantic-social ecosystem commonsense knowledge in practice the economics of semantic data proof and trust

10.

11. virtuous circle People Use Images Get Links Content Revenue Traffic

12. animalphotos.info

13. Scientific Classification of Animals

14. Vernacular Taxonomy for Animals Mammals Birds Primates Rodents Others

15.

16.

17. <http://dbpedia.org/resource/Gear>

18. automating the process Identify topics search for candidates filter correct images describe images dbpedia flickr

19. amazon mechanical turk

20. carpictures.cc

21. 2012 2011 2010 2009 2008 2007 … 1990 1989 1988 1987 1986 1985

22. 2012 Acura 2011 Alfa Romeo 2010 Aston Martin 2009 Audi 2008 Bentley 2007 BMW … … 1990 Scion 1989 Subaru 1988 Suzuki 1987 Toyota 1986 Volkswagen 1985 Volvo

23. 2012 Acura CC 2011 Alfa Romeo CC 4Motion 2010 Aston Martin Eos 2009 Audi Gti 2008 Bentley Jetta 2007 BMW Jetta SportWagen … … New Beetle 1990 Scion New Beetle Convertible 1989 Subaru Passat 1988 Suzuki Passat Wagon 1987 Toyota Routan FWD 1986 Volkswagen Tiguan 4motion 1985 Volvo Tourareg

24. 2012 Acura CC 2011 Alfa Romeo CC 4Motion 2010 Aston Martin Eos 2009 Audi Gti 2008 Bentley Jetta 6 speed automatic 2007 BMW Jetta SportWagen … … New Beetle 1990 Scion New Beetle Convertible 1989 Subaru Passat 1988 Suzuki Passat Wagon 1987 Toyota Routan FWD 5 speed manual 1986 Volkswagen Tiguan 4motion 1985 Volvo Tourareg

25. Constructed Taxonomy Chevrolet Honda Volkswagen S360 Civic Accord Element FCX

26.

27. Good Category…

28. …Bad Category

29. Wikipedia Categories

30. “data wiki” -> better data quality

31. ny-pictures.com

32.

33. geospatial selection + Wikipedia graph

34. The only way is no way… The only limits are no limits… The only taxonomy is no taxonomy…

35. network “taxonomy” people creative works places inventions life forms

36. What’s out there? Type Count Person 1,035,529 Location 707,679 Organism Classification 192,632 Organization 177,999 Music Album 118,568 Film 76,681 Structure 74,061 Event 73.992 Written Work 51,937 TV Program 30,094 Fictional Character 29,461 Celestial Object 24,174 Ship 23,006

37. ookaboo.com

38. User contributed content

39. ookaboo semantic API <http://dbpedia.org/resource/Thailand> API Thanks: Andyindia, Echiner1, Rene Eherhardt

40. social-semantic ecosystem

41. linked data

42. linked data human contributions

43. linked data human contributions other online communities

44. linked data human contributions other online communities knowledge engineering

45.

46.

47.

48.

49.

50.

51.

52. Text Analysis

53. Text Analysis

54. Text Analysis Car Image CC-BY from http://www.flickr.com/photos/aharden/2618801756/

55. Text Analysis

56. commonsense logic?

57. Number of Facts Cyc: 3 million Freebase: 600 million Number of Concepts SUMO: 1000, DBpedia: 3.9 million WordNet: 118,000 Freebase: 23 million

58. Number of Facts Cyc: 3 million Freebase: 600 million Number of Concepts SUMO: 1000, Wikipedia: 3.9 million WordNet: 118,000 Freebase: 23 million critical mass?

59. “Any brain, machine or other thing that has a mind must be composed of smaller things that cannot think at all” Marvin Minsky

60.

61.

62. Rome Deity Saturn1 Mythology

63. Rome Planet Deity Rings Saturn1 Saturn2 Mythology Astronomy

64. Rome Planet Deity Rings Saturn1 Saturn2 Mythology Astronomy

65. Rome Planet Deity Rings Saturn1 Saturn2 Mythology Astronomy

66. autocompletion

67.

68.

69. ad-hoc SPARQL query

70. a database of names…

71. … plus subjective importance

72.

73. yankees vs. red sox

74. yankees vs. red sox carbon vs. silicon

75. yankees vs. red sox carbon vs. silicon aerosmith vs. the ramones

76. yankees vs. red sox carbon vs. silicon aerosmith vs. the ramones Jeopardy vs. family feud

77.

78.

79.

80.

81.

82. the airports query

83. Airports in English

84.

85. 空港の日本語

86. A cautionary tale advertising revenue time

87. “I know it when I see it” - Supreme Court Justice Potter Stuart

88. 50 offensive categories

89. 50 offensive categories 1000 offensive topics

90. 50 offensive categories 1000 offensive topics 1800 offensive images

91. 50 offensive categories 1000 offensive topics 1800 offensive images 950,000 good images

92. 950,000 good images

93.

94. 99.81% accuracy isn’t good enough!

95. 99.81% accuracy isn’t good enough! Hyperprecision!

96.

97. Publishing Knowledge SPARQL Dereferencing Endpoint API RDF Dump

98.

99.

100. Thanks: andrefontana, Isakkk, laynaaa

101.

102. Clip art licensed from the Clip Art Gallery on DiscoverySchool.com

103. Dereferencing

104. Dereferencing <http://rdf.freebase.com/ns/en.graphene>

105. Dereferencing <http://rdf.freebase.com/ns/en.graphene> http GET

106. Dereferencing <http://rdf.freebase.com/ns/en.graphene> http GET fbase:en.graphene a fbase:common.topic , fbase:award.award_winning_work , fbase:law.invention; fbase:award.award_winning_work.awards_won fbase:m.0dg75z8 ; fbase:common.topic.article fbase:m.03p5rz ; fbase:common.topic.image fbase:m.089q2k3 , fbase:m.02f5b7f , fbase:m.041wl9z ; fbase:law.invention.inventor fbase:en.andre_geim ...

107. Thanks: Thomas Shahan

108.

109. Publishing Knowledge API RDF Dump

110. Ookaboo RDF Dump Metadata for 950,000 Pictures 500,000+ topics 630 MB 50 million facts

111. Two Challenges Ookaboo needs better tools to build navigation Customers need tools to find concepts

112. :BaseKB is free under CC-BY

113. Not so “big” … :BaseKB is 2.8 GB :BaseKB is free under CC-BY :BaseKB takes an 1 hour to load on a workstation PC … but very complex :BaseKB has 11,361 types and 102,949 properties “A isPartOf B” can be expressed in 139 different ways! Photo credit: http://commons.wikimedia.org/wiki/User:Evan-Amos

114. N-Triples is compatible with… RDF Database

115. N-Triples is compatible with… RDF Database awk, sed, grep, …

116. N-Triples is compatible with… RDF Database awk, sed, grep, … Hadoop

117. N-Triples is compatible with… RDF Database awk, sed, grep, … Hadoop Lucene (SIREn)

118. Data Quality

119. Quality Perimeter Data Quality

120. Repairing Folksonomic Trees

121. Repairing Folksonomic Trees

122. Enterprise Data Warehousing Data Operations ETL Analytics Warehouse

123. Enterprise Data Warehousing Data Operations ETL Analytics Warehouse

124. Knowledge-Based System Data Linked Data ETL Operations Warehouse

125. “Businesses often spend five to 10 times more money to correct their data after it is entered into the system than they would have if they had headed the problems off at the source.” - Larry P. English, Information Impact International

126. Data Quality Economics Assume 25 Consumers Consumers Clean 25 x $N = $25 N Publisher Cleans 1 x $N = $N

127. Reusable Knowledge Base effect on schedule build develop knowledge base adopt develop knowledge base decision point time

128. Build Knowledge Base Develop Get Feedback Profitable And Revenue Applications

129. Linked Data Business Models Free Shared Vocabulary Enables …but the profit motive Interconnection… spurs investment to create quality data.

130.

131. Trust Proof

132.

133. Publishers

134. Consumers Publishers

135. A Market in Common Sense

136. … big and ambitious systems Paul Houle paul@ontology2.com

Notas do Editor

***Afew years ago I started collecting and organizing free pictures and sharing them with people on the web.At first I made lists of topics by hand, and I’d find pictures myself. One day an got the idea of how to automate this process, and that was the beginning of the journey that has brought me here today.While creating a site called “Ookaboo”, I discovered that there’s a rapidly growing community of web sites and applications that use Linked Data as a shared vocabulary and because of that I’ll be talking to you today about the knowledge bases behind these applications. I’ll talk about how they made, how you can use them and a bit about the business and economics.
So to gather pictures of things, you need to have a list of things. You probably want to know what those things are. That means developing a taxonomy.Top-down taxonomies, like the Library of Congress Subject Headings, are painstakingly expensive to develop. When you look closely, however, even the best taxonomies show serious flaws.Lately there’s been interest in tagging and folksonomies; tags can be fantastically cheap and abundant. At best, they cover subjects in great detail. At worst, someone uploads 1000 pictures of a wedding and tags them all “rat”.In 2012, we’re developing “a new taxonomy” which combines formal taxonomies and folksonomies with large databases of facts about specific facts about different topics. With this, we can create and organize lists of things and we’re often able to teach computers to make very specific distinctions between things.
The new taxonomy is made possible by two databases: Freebase and Dbpedia. Freebase is currently the biggest “data wiki”, and has about 600 million facts about 24 million things. These are things like people, species, chemical compounds, and cities. Dbpedia extracts information about the topics of roughly 4 million Wikipedia pages.These two databases go together like peanut butter and jelly…
Now, people are rapidly developing sites and applications that use this pool of data from Freebase and Dbpedia. They can interoperate because they’re using a common languageThese applications share building blocks, for instance, they all need to know about the names of things, and you need some way to take an idea in your head or something written down and figure out what topic it is.
This kind of knowledge is a kind of “commonsense knowledge.” We’re not trying to simulate a person, but we can simulate a narrow faculty, such as sense of what things are important or what things people find offensive.I’ll give you a glimpse of how we construct such faculties..
… and link this to the business and economics of semantic data. If everybody starts from zero when they build this kind of system, it’s going to take a long time for the ecosystem to emerge. Reusable tools and knowledge bases are essential
And then I’ll share my opinion on “proof” and “trust”, two things that are necessary if we’re going to share data. Researchers are trying to find a way to automate it, but today we need to do it on a human level.Suppose you find a data source. How do you know if it’s “good enough to be useful?” What if you think it is accurate in some ways but inaccurate in another? Although there aren’t standards to address this yet, we have to answer this question right now if we want to build systems “good enough” to put in front of the public.
So, where does this start?In 1999 I was living in Germany, and my wife and I were taking pictures of animals, mostly at zoos, throughout Europe.So, we’d filled up a shoebox with color prints and thought about making a web site where people could easily find a picture ofany animal that pops up in their head.But, I wasn’t really happy with the scanned images, and I got busy with other things…
… and pretty soon Flickr comes along and I see so many photos that are better than mine that I give up.But the idea came back to me a few years ago, when I started to see Flickr not as a competitor, but as a source of photos. This greatly extends what I can do because now I don’t need to take the photos.
This is all possible because many photos on Flickr are under Creative Commons . The Creative Commons foundation made a number of canned licenses that creators can apply to photos and other creative works. Because there are just so many of them, these are machine readable, so I can to filter a stream of images from a place like Flickr or Wikimedia Commons and accept only Creative Commons images that allow commercial use.
So this was the plan.All Creative Commons licenses require people who use a photo have to give credit to photographer. My sites give that credit, and require users to give that credit also; I also ask that they link to my site to preserve the provenance chain of the image.These links bring me traffic, which I turn into revenue with advertising. This lets me find more photographs, which people use, making links and forming a virtuous circle
So this is Animal Photos! This is really just a hacked Wordpress blog, but it became pretty popular. I started out building it by hand… actually building a list of animals and finding the pictures myself on Flickr
The way I was doing this wasn’t scientific at all, because scientific classificiation doesn’t reflect how people think about animals.If I asked you to name an animal, you’d probably think of a mammal. But mammals, birds, fish and reptiles are just a tiny part of the tree of life – they’re all in that circle with a frog in it that’s near the top.If you randomly picked something out of the tree of animals, you’d probably get something that the average person would think of as “squishy” or “crunchy”. It would be a big mistake to start the navigation out with concepts like “Eumetazoa” and “Porifera”…
***So I fell back on something so simple it’s embarrassing.I looked through books and Wikipedia and walked through a piece of the scientific tree by hand. I was looking for photos as I built the tree and I’d cover animals at the species or the genus level depending on the availability of photos and ability to identify them.I found the list was useful in alphabetical order if I split out primates and rodents – there are just so many of them.I started working on birds when I had a conversation with WaldirPiminir
He was an editor at Wikipedia, and he told me about Dbpedia, which is a database made from fromWikipedia.It’s a rich source of data because it covers all the topics in Wikipedia; it contains links between Wikipedia pages, links between pages and categories, and a wealth of specific information about dates of birth, molecular weights, populations and many other attributes things can have
DBpedia is published in a format called RDF, which is the most basic standard of the semantic web.RDF can be loaded into a special kind of database, called a triple store, and you can write queries against this database in a language called SPARQL, which is a lot like SQL. Because SPARQL is a standard of the W3C, there are many vendors of triple stores competing, just like there are many relational databases in the market.Although SPARQL supports joins, like SQL, you can use RDF in a schemaless way, like a NoSQL database. It may be counterintuitive, but you can actually make a schema after you load your data, and that’s a key capability for the new taxonomy.
I think the most important thing about Dbpedia, however, is that is assigns unique identifiers to concepts, and that these unique identifiers are hyperlinks.Since different organizations own domain names, this gives us a universal namespace that we all own a piece of, but we share.Rather than using words, like “gear”, that can mean several things, we can share precise concepts like “the kind of gear used in a machine”.
So how is this connected with pictures?We can process data from Dbpedia to identify and organize topics that are relevant for the site we’re building.For each of these topics, we can run queries on Flickr that will find images for it. Sometimes we get a lot of images, sometimes we don’t get any. Almost always we get some unwanted pictures, so we need to look at them to filter out the irrelevant ones. Once we’ve got accurate and categorized images, the last step is to describe the images so we can present good captions.
I found the last two could be done with Amazon’s Mechanical Turk.Mechanical Turk is a market for work. With it, I can ask people questions like “Is this a picture of a bear?” and “Describe what you see in this picture?” Some people do a really remarkable job, and I can reward them with a bonus.Turk lets you automate things that computers still can’t do, but it’s got the problem that the people involved don’t have specific expertise, so you need to design tasks around what they can do…
carpictures.ccwas built from the ground up on this process. The image gathering process went quite well, and I was really happy with how the visible appearance came out. It’s even got a “semantic web” logo.Now, to do this, I needed to build a list of cars that could be identified by the Turks
Now, the EPA publishes a public domain database about fuel economy. It covers any “normal” cars sold in the U.S. since 1985 with excellent quality. My plan was to start with this and ad other cars later.If you’ve looked up parts for your car in a parts database, you’ll be familiar with how this database is organized. At the top level, cars are divided by model years
And then the nameplate under which the car is sold; this is something like “Chevy”, “Toyota” or “Volkswagen”
If we drill down on a nameplate, we get a list of vehicle models.I like nameplates and vehicle models a lot because they’re distinctive. It’s easy to recognize a car like the…
“Volkswagen New Beetle”. That’s not good enough for the government because cars come in different trim levels with different powertrains, so the fuel economy is different. They also take different parts.It’s not realistic, however, to expect people to look at a photo of a beetle and know if it has an automatic transmission, any more than it is to tell a 2008 from a 2009.
We squash this down to a taxonomy that is just the nameplate and model name. We lose some things here, for instance, we can’t tell the difference between a 1960’s Chevy Malibu and the very different Malibu for sale today. That’s a big difference because one was rear wheel drive and the other is front wheel.It’s imperfect, but we can classify cars accurately to this standard, and the categories mean something to people.
Now, there are classic cars, race cars, concept cars and exotic cars that aren’t in the EPA database. They look cool, so we’d like to have them.I found them using Dbpediacategories. At the bottom of Wikipedia pages, you find this list of categories. They’re not terribly accurate or well organized, but there are a lot of them.One sign of poor organization is that we don’t have broad categories at the top; although this Lamborhini Gallardo is in 10 different categories, it’s not in the category that we really want “Automotive Model”.
So, I had to construct that category using the categories already there.This data is messy, but there’s a lot of it, so errors can be averaged out. Here’s one category (cars with 10 cylinder engines) that the Lamborghini is in. This is a good category, because everything in it is a car.If we think of a network between categories and topics, we can follow links from this category to discover new automobile models
Not all categories are this good, however.“Police Vehicles” is a bad category because it contains concepts like “Aerial roof Markings” and “Emergency Vehicle Lighting”. It’s not useful because knowing a topic is in this category doesn’t tell if it is a car or not!
Other categories, the red ones in this picture, always point to things that aren’t cars. If we put a little attention into the categories, we can analyze the category-topic network and use them to construct the categories we need.The algorithm is a lot like the “Hubs and Authorities” algorithm than Jon Kleinberg, of Cornell, has applied to the web. and it’s very general. I’ll show another application shortly, but…
After doing carpictures I turned to Freebase because it has better data quality that Dbpedia. Dbpedia’s got the problem that it reads the “markup” that people use to write Wikipedia pages. Wikipedia is designed to be read by humans, not machines, so this error prone process.If you want to fix an error in Dbpedia, you’ve usually got to fix the problem in Wikipedia – you can’t do this automatically because Wikipedia’s markup is complicated.Freebase, on the other hand, is a Data Wiki, where facts can be edited by people and machines. Freebase loads data from many sources, so you can get EPA Fuel Economy data, FDA data about drugs, and data on 8 million tracks from Musicbrainz – one great thing is that Freebase topics are linked up with Wikipedia, so it works together with Dbpedia.
If you look at Wikipedia and Freebase, you see a lot of locations. Wikipedia has about 700,000 and Freebase about a million. My day job was connected with GIS at the time, so I decided to do something based on locations.I figured that New York City had the highest density of significant things anywhere in the world, so I pointed my targeting system at it
This involved combining three sources of data.The U.S. Census creates detailed digital maps of the U.S. and they share these freely with the public. Both Dbpedia and Freebase have rich information about locations, and I was able to pick and choose the data I wanted out of them – filling in holes in the data
I used two methods were used to find things inside New York City.Freebase and Dbpedia have coordinates for most locations, so I was able to use U.S. Census data to draw lines around the 5 boroughs. A simple geospatial calculation finds points inside the city. If we had an address for something, we could also use that.Some things don’t have coordinates or addresses in the database, however, or are distributed across the city. The NYPD, for instance, is part of the city’s visual branding, but it’s not all at one place.I used the things with coordinates to seed the discovering things that share Wikipedia categories with them, and I doubled my number of topics this way.
This taxonomy construction is efficient, but I was getting more interested in scaling up the process to get photos.Getting pictures of places all over the Earth, for instance, isn’t that different than getting them in New York City, in fact, it’s easier because I don’t need to draw a line around anything.Once I’d let go of that limitation, I found it took little extra work to get pictures of people, fire hydrants, or anything at all…
In lieu of a taxonomy, I can find a bunch of network relationships between things. For instance, the system know that “Sylvester Stallone” wrote “Rocky” and that he’s got a star on the Hollywood walk of fame. It know that “London” is part of the “United Kingdom” and that cows eat plants.This network can be used on a retail level, to present specific facts, but it’s also useful on a wholesale level, where algorithms can extract useful knowledge from the graph.Rather than classifying stuff up front, we can use our graph to create categories when we need them.
Let’s stop for a second and see what kind of things are in this graph.This is the kind of thing you need to investigate when you start a project of this sort, and a triple store tis a good tool for answering this kind of question.I made this table with a SPARQL query against :BaseKB, my own knowledge base that covers the intersection of Freebase and Dbpedia. Out of 4 million topics, this is a list of the most common things.What we see is no big surprise. People are the most common topic, followed by locations, and then organisms. Organizations are important too and so are creative works like books, music, and films. Even going to more obscure topics, we find more than 23,000 ships
So, with a mostly automated system I was able to gather about a million photos of 500,000 different topics.Ookaboo is more precise than other image sites because it indexes images under precisely defined concepts. The word “Jaguar” might mean an animal, a car, or a handheld video game console, but Ookaboo treats these all as separate things – it even knows the different kinds of cars sold on the Jaguar brand. On it’s own terms, Ookaboo’s accuracy is very close to perfect. You can use pictures from Ookaboo to illustrate a web site and run very little risk of getting images that are completely irrelevant.
Ookaboo’sautomated pipeline mines user contributed content from communities like Wikipedia and Flickr – but it creates a new community because Ookaboo visitors can add pictures.Ookaboo users can browse photos from Flickr, pick out topics in Ookaboo, and then tag pictures with the topics. The process is a lot like submitting a story to a Diggor Reddit.We benefit, thus, from the communities standards on Flickr while letting Ookaboo users tag pictures much more quickly than they could take pictures…
Ookaboo also has a semantic API.If you can get a Linked Data ids for a topic (if you know the Wikipedia page), you can ask Ookaboo to give you pictures of that topic. You get back a little bit of JSON with all the data you need to use the photos in a web site, blog or app. The accuracy is so high that you can use the pictures without supervision.The one thing you need is the semantic identifier. So, where do you get these? What kind of sites use these identifiers?
Well, I’ll show you a few sites that use these identifiers.These sites use semantic data, but they’re also social because they take human input. This makes them social-semantic.They’ve got a few characteristics.
They depend on Linked Data. They use identifiers from sources like Freebase and Dbpedia to specify things. This makes them compatible with other systems that depend on the pool of linked data. They can use my API. Because of this, they form a definite ecosystem
They accept human contributions; in some cases this is a carefully chosen committee, but it can also be a broad group of Internet users, like the people who edit Wikipedia
They connect closely with other online communities in many ways
And they take a pinch of knowledge engineering to get it all together.I’d like to show you a few examples:
This is Xen.com, a site that’s run by a startup from Los Angeles which has just launched. Last year, I helped Xen incorporate Freebase data into an interest graph.Xen lets people express their preferences for things that exist in Freebase, it’s a bit like Pinterest in some ways. This is a good application for Freebase data, because Freebase has very strong coverage of popular culture.
Xen, Ookaboo and many other sites connect with Facebook. Facebook is a leader in social semantic because, with the open graph protocol, they’ve gotten people to add RDFa metadata to millions of web pages. There are Facebook pages on every topic imaginable, so the Facebook graph is a parallel universe that overlaps with Linked Data…
Last month, Google launched the Google Knowledge Graph. Sometimes when you search for a specific topic, you get a box on the side. This is proprietary, but we know that Freebase technology and data are a part of this.It’s fair to say that Google can understand Freebase and Dbpedia identifiers, and so should you if you’re interested in having your content correctly indexed…
Ranker, which is also from L.A. is one of the thousand largest sites on the web.Ranker uses Freebase data to help users make “top 10” lists; lists like this are a constant source of fresh content that goes viral on sites like Digg and Reddit – because of this, it gets millions of page views a week.
Seevl is a much recommendation engine that was developed by Alex Passant from D.E.R.I.If there’s a band you like, seevl can find you a list of similar bands – it’s all based on an algorithm that mines Dbpedia and Freebase and to find things that share common characteristics. Because this knowledge is explicity, Seevl can tell you why it things two bands are similar.Alex developed this with the fraction of resources that go into something like Last.FM or Pandora, so this is a good example of how semantic can make a big job smaller
This stuff is going to get mainstream because tools like Drupal, a popular CMS, are starting to support RDF. Like I did with my sites, you can build out a taxonomy with semantic tools, then load it into a CMS to build out the skeleton of your site. This process will be getting easier and easier…
At the other end of the lifecycle, here’s a different kind of application. Governments and other organizations are interested in mining social media so to better understand the environment they work in. The folks at Veda are building a social media analysis system for the Indian government, and here we see sentiment analysis about parliamentary elections.For something like this you want a database of the candidates and the issues, and that’s a lot like the database for other sites we’ve seen.
All of these applications benefit from a form of text analysis called “named entity recognition”. This is the abstract of a scientific paper, but you could do this with an article about a football game, the caption of a photo, or any kind of document.A named entity is some specific thing talked about in a text. “Sir John Herschel” is a named entity…
… and so is the planet Saturn, which we’ll focus on, because this case is ambiguous. Let’s suppose a system thinks the word “Saturn” here might be a named entity…
If we look up “Saturn” in a database of names, we get a god who ate his kids, a defunct automotive brand, and the 6th planet from the sun.
Even if you’re not an astronomer, the answer is obvious to you. The capability comes, not from grammar or a “language instinct”, but the fact that you know millions of things about the world you live in.
So how does a computer get this knowledge?An early approach is to create an upper ontology, which describes fundamental concepts using logic.Lenatan Guhawrote a book in 1990 about the Cyc project, which used an extension of first order to logic to model “commonsense” back in the early 1990’s. Cyc had about 3 million facts.Cyc could do some neat things like “prove a donut can’t talk” but it didn’t have a broad impact. A later project, SUMO described in the other book, tries to model our world of experience with just 1000 concepts. This makes it easy to understand SUMO, but it’s just not enough…
Freebase has 600 million facts, about 200 times as many as Cyc. Dbpedia knows of 4000 times as many things as SUMO. Both Wikipedia and Dbpedia are vastly larger that WordNet, another effort to capture the meaning of words.
I think here we’re reaching a critical mass. Today we’ve got enough facts that we can understand know something about any concept which would be fair to bring up in a game of “20 questions.”The methods we’re using, however, are really different. First-order logic doesn’t scale to 600 million facts and we’re not trying to formalize exactly what a chemical compound is, or how that relates to a chemical element. What we do have, however, is have the chemical formulas for thousands of chemical compounds
MarvinMinsky wrote a book, “The Society of Mind” based on a project that started in the 60’sHe concluded that a system that does something “intelligent” has to be composed of smaller parts that do simple, not so intelligent things.If we think this way, we develop “agents” that do very specific things – it doesn’t matter so much how they work, what kind of technology use, only so much as that they work. Once we’ve got them working, we can snap them together with other agents to build bigger things.
Given that, let’s step back, and imagine the design of an “agent” that could identify the planet Saturn in this text
If we look at the rest of the document, we find words like “astronomy” and “ring” and “planet”
From Wikipedia and Freebase we can derive a network of how concepts are related. We could pull up the record for the God and see that he’s tied to concepts like “Deity”, “Rome” and “Mythology”
But if we pull up the record for the planet Saturn
We find that it’s closely related to other topics in this article
So it’s a good bet that this article is talking about the planet Saturn.There’s more to it, such as using clues grammar, but I think any system that succeeds at this needs to have this kind of knowledge. Products are rapidly advancing in this space, and in the next year we’ll see things that work this way.
In the case of named entity recognition we’ve got some concept in a text, and we’re trying to find the unique identifier for it.Autocompletion is a common UI trope for social and semantic sites, and it’s very similar to N.E.R. because now there’s a concept in your mind and you’re trying to look it up in the system. It’s important because it works for all kinds of topics with little customization.Here’s the autocompletion box for Ookaboo, which is the primary way to find pictures
Freebase has an autosuggestion box, this one is kindof fancy because it when you hover over a concept, it shows details about it in a flyover box.
You can find autosuggestion on many sites, but LinkedIn is a particular good example because (i) they divide the world up into things like “companies”, “people” and “groups” and (ii) they filter and rank based on your connections, and that’s no small task for a site that has 150 million users!
So I’ll take you into the lab and show you how we develop something like this. If you want to prototype autocompletions, you can start by writing SPARQL queries..Because SPARQL is a lot like SQL, you can think up questions and get answers fast – although programmers can write SPARQL queries, SPARQL is good for business analysts and other people who’d like to interactively play with data.This query, here, for instance, returns things that have a name in the English language that start with John
This query works. But the results aren’t so good – I don’t recognize any of the Johns here. :BaseKB knows about 35,000 things that start with ‘John’, and the typical one is somebody you haven’t heard of, like John “Eck” Rose, who ran for Governor of Kentucky in 1995.We need some way to find the John you’re thinking of,
And here’s a first draft of that.In this query, we change two lines to sort by a score that we call “gravity”Gravity is made from Freebase data, and it’s a measure of subjective importance that’s under development we can see this is a big improvement: right here I recognize “Johnny Cash” and “John Williams”, “John Coltrane”, “John Denver” and “John Cage” and “John Lennon”… see the pattern here? This system is going to pick out the “John” you have in mind if that “John” is a musician!
If we look at the results on Ookaboo, we see something that’s a lot more balanced: politicians like “John F. Kennedy” and “John McCain” show up, we get the astronaut “John Glenn”, General John Pershing as well as John the Baptist and John Calvin.The difference is that Ookaboo’s rankings are based on Dbpedia. With 8 million music tracks, Freebase has a bias towards music. To improve gravity, we need to counteract this bias, which can be easily done.This is an interesting example though, because it shows that subjective importance is subjective… You can certainly say that one score is better than another, but many specific questions don’t have a correct answer.
For instance, out in the Northeast, people can’t agree about the Yankees vs the Red Sox.
What element is more important, carbon or silicon?
What’s a better band “Aerosmith” or “The Ramones?”
IBM Watson plays Jeopardy, but we’re playing family feud, feeling out the judgement of human crowds – it’s a very different game. The discovery of a score that has a bias towards music suggests we can build importance scores that reflect different people’s points of view
For instance, the scores shown so far are calculated by counting or summing over links, but Freebase is full of specific facts that relate to importance. For instance, we’ve got population numbers….
We’ve got information about quantities of money, such as GDP, Revenues, and market capitalizations.
Freebase also has information about Nobel Prizes, Oscars, Grammys, and about 10,000 other kinds of awards that people and things can win.So no matter how you keep score, it’s possible to mine Freebase and create an importance score that reflects what you think is important
Before we move on, I’ll point out that an RDF database isn’t the right choice to put autocompletionsinto production. This is because autocompletionghas to be really fast, so that result pop up under your fingers as you type.RDF databases can answer a wide range of queries, but for something like autocompletionyou need a specialized database. LinkedIn has made an Open Source project called Cleo, which is a specialized database just for autocompletion – what’s great is that this scales from a simple instance that runs on your dev box to something that runs on a cluster and can handle a world-class site.
A closely related application of importance scores is to maps. Whenever you look at a map in Ookaboo, Ookaboo displays the most important topics on the map. When you pan or zoom the map, it sends an AJAX request to the server, which redraws the points. This uses a geospatial index with a few tricks to make it really fast, which helps the interface feel really magical…
I’d like to show you another feature of the RDF and Linked Data world that’s pretty amazing.I call this the “airports query” and it’s the first test query I wrote for :BaseKB; this query pulls up the three letter codes and names of the top 25 airports in the world
Gravity ranks airports really well. JFK, LAX, and Heathrow are all big. There might be one or two airports in Canada that are ranked too high, but it’s hard to argue with this list.Because :BaseKB is broad-spectrum, you could write a very similar query for asteroids, shipwrecks or anything else that comes to mind…
If we zoom in on the SPARQL query, note that there’s one line that filters the output so we only get names in English. This is important because the database contains names in many languages, and we’ll get them all if we don’t filter.If we just replace the language code for English with another language code, say for Japanese…
We get the airports in Japanese with just a few seconds of work!This is big, because internationalization can be time consuming and expensive. With RDF tools and a multilingual database, you get names for things in multiple languages for free. There’s a lot more to international support, but you can go amazingly far without knowing anything about the target language. For instance, people can type Japanese, Chinese, and Arabic names into the autocompletion box on Ookaboo and get good results.
The next example gives a sense of how far we can go in making subjective distinctions.This graph shows one of those moments of terror in the life of a webmaster. One day I noticed that Ookaboo’s ad revenue had dropped to zero, and when I checked my email I found a message from my ad network warning me that they’d found a picture of somebody with an unspeakably strange body piercing.So, I was wrong when I thought I didn’t need to classify topics going into Ookaboo. It didn’t matter if I was getting a picture of a ski lift or a Frank Sinatra, but getting pictures of obscene things was a big problemThere’s one thing to do in a case like this, and that’s to find the offending images and pluck them out, but that takes some work…
I’ve got a lot of topics and images, so I can’t find all the bad ones myself. I had to teach my system to recognize offensive topics, which at first seems a bit tricky, because even a Supreme Court Justice will resist the effort to define them precisely
Well, it turned out to not be so bad. I used the Wikipedia Categories, which were also useful for cars and things in New York. Wikipedia categories are nested into super-categories, and I found that 50 or so categories covered topics relating to sexuality, illegal drugs, and such
Following connections from the categories, I found about 1000 offensive topics
These topics were illustrated, in turn, by 1800 offensive images. This represents a huge amount of force multiplication because by looking a few categories, I can clean out 40 times as many images.Censoring bad topics, instead of bad images, works well for Ookaboo because it draws from sources that have community standards. Even though I can get pictures of people’s privates, labeled as such, I’m not waging a war against people tagging innocent concepts with bad images.
Now, these 1800 offensive images are in a pool of 950,000 good images, and that’s really remarkable…
… because we’re trying to pick out something that’s very hard to find. If I looked at a random sample of 500 images, I may or may not find an offensive images. That, combined with the fact that I wasn’t looking for that sort of thing, was the reason why I never noticed them.
… although it’s very hard to find and characterize offensive images with statistical sampling, it’s very possible to find these things with great speed and accuracy using the network of Dbpedia connections…
And this is a challenge in evaluating this kind of system. If we do nothing at all about offensive pictures, we’re classifying them with 99.81% accuracy – that’s better than Ivory soap, but it’s not clean enough.It’s even more tricky because some questions have ambiguous answers: going back to the importance score, we certainly want a way to know if one importance score reflects our point of view better than another one… but even then, it’s not fair to require it to like the Yankees better than the Red Sox.
So this puts us in a funny place where we need to mock up subjective sensibilities where there isn’t a right answer, but we also need to have systems that have a really high standard of performance, that we can let loose to roam around and know that we’re hardly ever going to catch them making an embarrassing mistake.It’s funny, because people studying NLP systems with conventional methodologies are often happy to get 85% accuracy – but if we focus on what kind of errors we can and can’t accept in an application, we can often get exceptional, nearly perfect, precision in the sense that we don’t see markedly false statements that make people lose trust in the system
This story has a happy ending because I got reinstated in my ad network and the revenue came back. It’s a good thing too, because students and teachers use images from Ookaboo in K-12 education. Because I can keep offensive content out, this makes an environment that’s safe from kids, so it’s gotten endorsements from conservative sorts of people like the Utah State Board of Education.
So we’ve got a kid-friendly web site, but what we can do to share Ookaboo’s precise information with other computers?I’ll talk about four ways that semantic knowledge can be published. Two of them, SPARQL Endpoints, and Dereferencing, have a lot of traction in the semantic web community, but I’m skeptical that they’re really useful for people trying to build applications. I’ll talk about why, and talk about publishing data through an API and an RDF Dump
We’ve published Ookaboo as a SPARQL endpoint using the Kasabi platform. They’ve got a really slick system where you upload your data, and then people can write SPARQL queries against it. There’s a browsing interface, search, and other great stuff…
TheKasabi people get a lot of things right, and one of them is that they encourage you to publish sample queries that show people how to use the database and that also have a test function – it’s very reassuring to know you run these queries and get the right answer.One test query looks for pictures of Kendamas…
… which are a particular kind of Japanese toy. Kasabi answers this question easily. You can register at Kasabi and write your own queries against it
A trouble with any public endpoint, however, is it can’t accept any query from anybody.This query is small in some sense, because it’s very short, but it’s big in the sense that it involves a million pictures. It finds the 20 topics that have the most pictures, or at least it tries…
But this query times out, and because the SPARQL endpoint is a shared service, there’s nothing you can do about this rather than write your own query.Kasabi runs 6 out of 7 of our sample queries. They could fix this by using more hardware, better algorithms, and a longer timeout, but then I’m just free to write bigger queries that time out.These big queries are what makes SPARQL fun, and you write a lot of them when you’re exploring a database because you don’t understand it enough to write little ones…The challenge of SPARQL endpoints is providing an SLA that keeps everyone happy at an affordable price.
One of the big ideas behind Linked Data is something called Dereferencing
Note that a linked data identifier is an IRI, which is basically a URI. Here we’ve got the Freebase identifier for Graphene, which is a special form of Carbon.Inside a triple store we typically treat this as a “cookie”; it’s a unique identifier that we can use without thinking at all about it’s structure. If we use it consistently, it’s a unique id. However, looking more closely, this IRI leads a double life, because…
... we can do an http request for this IRI and …
… and we might get back a lump of RDF data about this topic. In this case, Freebase sent me back data in the RDF/XML format, and I reformatted it in a syntax called Turtle that’s easier to read. Just looking at the first few facts, we can see that somebody won an award in connection with it, who invented it, and other relevant stuff.If we got curious about the inventor, Andre Geim, we could dereference his URL and, in a process like crawling the web, we could explore the universe of facts and concepts around graphene.It’s a lot like what you’d do if you researched a topic and consulted web pages, library books, and other sources to learn about things…
Now if the information is organized exactly how you need it, this is simple. If it isn’t, your client gets complex quick. If you have to dereference thousands of URIs, you’re running something with all the complexity of a webcrawler.If you’ve got a Dbpedia id, my API can give you pictures for it instantly. Because I don’t control Dbpedia, you can’t dereference a URI at Dbpedia and discover that there are pictures of it at Ookaboo. You need to crawl Ookaboo looking for them and to do that you need something that looks like…
Sindice.Sindice was produced by a team from D.E.R.I. Sindice is like a web crawler, but it crawls the semantic web. Doing so, it’s gathered more than 50 million triples and loaded them into a triple store and also a Lucene index that supports full text search.The Sindice people are business minded and they’re very interested in selling you your own version of this that you could use on enterprise data, but they’re also academics, so they’ve written a lot of papers about how it works, and these are great to read if you’re interested in that kind of system.
That leaves two more methods of publishing.In the case of Ookaboo, the API is a simple web service that takes a linked data identifier and gives you JSON with information about the images. You can write up a client for this in a few minutes and start using images…. It’s all very simple.Some people might want to do deeper kind of analysis though, and they benefit from having a complete copy of the RDF database, or an RDF Dump.
The first RDF Dump I created was for Ookaboo, and it gives some idea of the scale of this world. We’ve got metadata for almost a million pictures of half a million topics, and it just about fills a CD-ROM.What can do you with it?You can load this file into an RDF database and ask all sorts of questions. For instance, you can make a report of how many images we have for people vs images of bridges, or count how many images have each kind of creative commons license. It’s overkill if you just want a few pictures, but it lets you look at the data as a wholeI think people are more likely to download this than actually burn it on a disk, but I like the image of the physical media because it really makes the database look like a product.
The Ookaboo dump was a prelude to a bigger project that addressed two big problems I found with Ookaboo.One of them was that the tools that I used to do all the work I did earlier were really ad hoc. They worked well enough, but they weren’t based on any real principle and every time I had to extend them to do something new I was just inventing new ugly hacks that I couldn’t maintain.I wanted to build out better navigation for Ookaboo, and to do that, I needed some way to get data out of Freebase that was correct, complete, and systematic.Also, I kept getting emails from people who wanted a “keyword search” API. This puzzled me at first, because I knew a keyword search API could never be as accurate as a semantic API, but because people didn’t have easy tools for working with this data, they don’t have a way to find precise semantic Id’sI also had customers asking me for a “keyword search” API; this was weird to me, because keyword search is never going to work as well as semantic search. People who might use Ookaboo don’t have a way to look up Linked Data concepts because they’ve got the same data import problems I have
So that was the beginning of :BaseKB, which converts Freebase data into RDF. Although Freebase data is free, it’s kept in a proprietary database in a proprietary format and queried with a proprietary language called MQL.:BaseKB lets you load Freebase data into a RDF database, together with data from Dbpedia and other sources and write queries with SPARQL, a standard language which has evolved to become much stronger than MQL
:BaseKB is bigger than Ookaboo, about 2.8 gigabytes compressed… you could fit it on a flash drive. It’s not the biggest data set around, but it takes some effort to handle. It takes an hours to load it into a triple store on my big workstation back home which has 24G of RAM and a special storage array. It could easily take more than a day to load it on a typical laptop.The challenge it poses is that it’s complex instead of big. :BaseKB knows about 10,000 different kinds of thing, and it knows about 100,000 attributes that these things could have. Because of the way Freebase works, relationships can be expressed in different ways, so there are 139 ways to say “A is a part of B” – it’s possible to figure this out because Freebase contains a “metaschema” that organizes these properties into groups
:BaseKB is delivered in a file format called N-Triples. This is the lowest common demoninator of RDF because each fact is on a separate line and the three parts of the fact that make it a triple are separated by spaces.This N-triples file can be used in many ways. If you load it into an RDF Database, you can ask questions about it with SPARQL queries
Because the file has a simple organization, you can do useful things with Unix tools such as awk, sed, grep and wc …
You can process the data with parallel tools like Hadoop
Or load it into a specialized index. The Sindice guys have written an extension called SIREn for Lucene that can construct a custom full-text index out of an N-Triple files with much better speed than loading it into a triple store
Now, quality is a big issue for any kind of data, but it’s particularly important for a data set that is really complex.A lot of people try Linked Data projects and they quickly run into trouble because they try queries and they don’t work. If you ask Dbpedia, for instance, to tell you the 10 largest cities, you’ll quite likely get a strange answer. Projects that should take 5 minutes can take 2 weeks, and if you can’t fix the problems, they become impossible.The good news is that people in business have been dealing with data quality problems for years, so there are some very good answers…
When we package data in an RDF Dump, we can establish a quality perimeter, because the dump is a testable object. This is important, because we think that data cleaning should be done up front. If we deliver you clean data, you can start writing queries and building your app.We’re developing open source test suites for :BaseKB and Ookaboo that prove correct operation not just in our system, but also in your environment, with your tools. Whenever we add a query to the documentation, we add it to the test suite, so that we know that the query is correct and returns the right answer…
Now, some data quality problems can be addressed one fact at a time… We know, for instance, a person can’t be 12 feet tall. But there are other data cleaning tasks that involve looking at the data as a whole, and I’ll show one to you now.The U.S. Government publishes a database called ITIS, which contains the official tree of living things. Like other government databases, it’s free and in the public domain. ITIS is maintained by professionals, so it’s really a tree. For instance, if you start with some particular organism and work up through the parents, you’ll always end up at the same root.This is important because algorithms assume things about the structure of data; if an algorithm expects a tree and it isn’t a tree, it may fail or give a wrong answer
Wikipedia also has the tree of life, and it’s version has advantages over ITIS. For one thing, the Wikipedia tree has Linked Data identifiers that connect to Freebase, Ookaboo and all these new data sources. Another thing about it is that it’s maintained by volunteers who keep it up to date with the latest taxonomic information – taxonomists are always changing their minds, and when I looked at 10 cases where ITIS and Wikipedia disagreed, I found that Wikipedia reflected a more modern and correct view 8 times.That’s a big strength of folksonomies; a lot of people are working on them so they keep up to date with broad coverage… but there’s a dark side.The tree of life in Wikipedia has 200,000 nodes and it’s edited by amateurs who don’t have the right tools. One single error that breaks the relationship between two taxa breaks the whole tree, because we lose all the nodes below the mistake. Once again 99.99% isn’t good enough.This is just one case where it makes sense to look at the data as a whole, maybe compare it with some other data, and fix it so we get good data structures that work the way we need.
In a lot of businesses there’s this process of data warehousing, which is a good analogy for what we do.They take data from the operational systems, like the cash registers, CRM, ERP, etc. and put it through an extract-transform-load process, into a data warehouse, and then do analytics.Even in production systems that work day in and day out, you’ll often be shocked at how dirty the data is – you might find, for instance, that people don’t spell U.S. states correctly in addresses, so instead of 50 states, you have 500.
This kind of ambiguity is exactly the thing that semantic systems are supposed to fix, and that’s part of the reason why semantic technology is finding a place in the data warehouse. In fact, this is where most of the spend on semantic stuff is expected to go.It turns out that RDF, OWL and other semantic standards are really good for complex cases of data integration. Just as we can merge Dbpedia and Freebase, semantic systems can merge multiple databases and data sources and give people a few of the business as a whole…There’s a way, for instance, to map a relational database to RDF so that you can query a relational database with SPARQL queries, and then you can do a federated query that combines several databases. This is exciting…
… and it’s very similar to what we do with semantic-social systems, but the difference is that we’re doing this data cleaning at the beginning of the project, not the end. We gather up information from different sources, clean it, transform it, and then we load it into some production system that we send off on a mission to do something. Even though this talk has centered on public-facing internet apps, there’s a lot in common with what people do behind the firewall.
… we face the same data quality problems that other businesses face. This is a quote from an old article, but it’s pretty relevant.If you’ve got an idea for some application, you really want to start writing it; nobody really wants to clean data up. That’s a good reason why you should use a reusable knowledge base where cleaning has been done up front
… and so here’s a scenario that I made up but is still very plausible. Suppose there’s some data set that 25 organizations can use – if all of those organizations need to clean this up, that work gets duplicated 25 times. If the data is cleaned at the source, the total cost is 25 times less. In reality the difference could be more than that because there could be hundreds or thousands of users.Now, this cleaning could be done as an in-house project by the publisher, or we could run a data wiki that lets the consumers fix problems they find individually. Either way, it’s important that the benefits of cleaning be shared by all users. If the publisher can capture just a fraction of the value that cleaning creates, the community as a whole saves money and builds applications faster…
… and reusable knowledge bases have an excellent effect on schedules.If you need to develop a knowledge base in house it will take a lot of time to plan and execute. When you’re working with subjective things, there’s a tendency to throw things at a wall and see what sticks… You typically have to try a few things that don’t work until you find something that does, so there’s some risk things will take longer than you think.If you adopt a reusable knowledge base, however, you can save development cost, plus months of time. This lets you focus on what you’re good at and get products to market quicker.
So in a big picture, what drives data quality is a virtuous circle.If we use data to create profitable applications, we put the data into confrontation with reality, which smokes out problems. There’s an incentive to improve the applications and an incentive to improve the knowledge base, so we have this tendency, over time, for things to get continuously better.Big players, like Amazon, Facebook and Google, have harnessed this circle to make excellent products…
… so we’ve got to find the right balance.If Dbpedia and Freebase weren’t free, I think they’d be no chance of getting people to use them as a shared vocabulary to understand our world. It would be like having to pay to use the English language.On the other hand, the profit motive is the best incentive to get people investing in data quality, and that investment is necessary to build things that work.
So to wrap up, I want to put this in context of the “semantic web stack”This slide is an image that’s been going around for years in different forms, and something about it is that some of these things are real standards, like OWL, Unicode and SPARQL, and some of these things are still considered research topics…
If we want to build user interfaces and applications and 2012 we can’t wait for research into Proof and Trust, because they’re very linked up with these problems of data quality…
Automatic theorem proving has been a topic in A.I. for a long time, so researchers hope that someday, computers can use logic to create proofs over semantic data. This is quite interesting. There’s also interest in building big social-semantic system that lots of people contribute data to, such as the FOAF project, which lets everybody add data to a distributed social network. So there’s interest in “trust” in the sense of stopping people from putting malicious data in.
In the commonsense domain, though, most proof isn’t really logical. For instance, in mathematics, it’s sufficient to have just one chain of argument to prove something. In the real world, you’ve often got a lot of circumstantial evidence, and there’s some difference between what you can convince yourself of and what you can prove in court.I think today, “proof” is the burden of a publisher, to demonstrate that a product has certain properties and that it’s suitable for some purpose. This kind of proof is more closely related to evaluation and software testing than proving theorems…
A consumer, on the other hand, is going to look at this “proof” and decide, under what conditions, it wants to use data. In :BaseKB, for instance, we mostly trust Freebase, but we don’t like the text labels that Freebase uses for things, so we have a rulebox that generates something more descriptive. That’s something we decided to do to enforce our point of view.Whenever data crosses an institutional boundary, these issues of proof and trust are going to come up, because you can’t build a working system from parts that don’t work.It’s really just a partial answer, but :BaseKB addresses this by including a test suite you can run on your own copy that checks that it gets the right answers for common queries.
So where does this go?The “common” kind of intelligence that people have is “common” to many different domains and useful in different apps. Without a theory of the mind or any plans to pass the Turing test, we’ve can use Linked Data and a big bag of tricks to clone narrow faculties that apply over a broad range of topics.Ookaboo’s ability to find pictures of topics would be useful for blogs and social sites that know the topics of a page, so you can think of it as a faculty that’s useful in a larger system. If you look closer at Ookaboo, however, it depends on knowledge about language, importance, and what things are offensive.I think the short path to intelligent apps is packaging up this kind of knowledge and making it reusable…
and if we put them together, we can grow systems that are really big and ambitious.

Linked Data, Free Pictures, and Markets For Semantic Data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais de Paul Houle

Mais de Paul Houle (20)

Último

Último (20)

Linked Data, Free Pictures, and Markets For Semantic Data

Notas do Editor