1. The Mysteries of Metadata
Workshop at Content World 2001, Burlingame, CA. May 15, 2001
Amit Sheth
amit@taalee.com
Founder/CEO, Taalee (www.taalee.com)
[Taalee is now Semagix: www.semagix.com ]
Also, Director, Large Scale Distributed Information Systems (LSDIS) Lab, University Of Georgia
(lsdis.cs.uga.edu)
Metadata Extraction is a patented technology of Taalee, Inc.
Semantic Engine and WorldModel are trademarks of Taale. Inc.
Confidential HP
2. Workshop Agenda
What is Metadata ?
Metadata Descriptions and Standards
Metadata Storage/Exchange/Infrastructure
(Automated) Metadata Creation/Extraction/Tagging
Metadata Usage/Applications
HP 2
3. What is Metadata?
Data about data
Statements, contexts
Recursive – data about “data about data”
Applications
Content management
Cataloguing
Information retrieval, search
…
"A Web content repository without metadata is like a
library without an index," - Jack Jia, IWOV
HP 3
6. A metadata classification
User
Ontologies
Classifications
Move in this Domain Models
direction to
Domain Specific Metadata
tackle area, population (Census),
information land-cover, relief (GIS),metadata
overload!! concept descriptions from ontologies
Domain Independent (structural) Metadata
(C++ class-subclass relationships, HTML/SGML
Document Type Definitions, C program structure...)
Direct Content Based Metadata
(inverted lists, document vectors, WAIS, Glimpse, LSI)
Content Dependent Metadata (size, max colors, rows, columns...)
Content Independent Metadata (creation-date, location, type-of-sensor...)
Data (Heterogeneous Types/Media)
HP 6
7. Types of Metadata for digital media
Media type-specific metadata
eg.,texture of images,font size…
Media processing-specific metadata
eg.,search, retrieval, personalized filtering
Content Specific metadata
eg.,rocket related video and documents
HP 7
8. Metadata for Digital Data
Metadata Data Type Metadata Type
Q-Features [Jain and Ham papur] Im age, Video Dom ain Specific
R-Features [Jain and Ham papur] Im age, Video Dom ain Independent
M eta-Features [Jain and Ham papur] Im age, Video Content Independent
Im pression Vector [Kiyoki et al.] Im age Content Descriptive
NDVI, Spatial Registration [Anderson and Stonebraker] Im age Dom ain Specific
Speech Feature Index [Glavitsch et al.] Audio Direct Content Based
Topic Change Indices [Chen et al.] Audio Direct Content Based
Docum ent Vectors [ Deerwester et al.] Text Direct Content Based
Inverted Indices [Kahle and M edlar] Text Direct Content Based
Content Classification M etadata [Bohm and Rakow] M ultiM edia Dom ain Specific
Docum ent Com position M etadata [Bohm and Rakow] M ultiM edia Dom ain Independent
M etadata Tem plates [Ordille and M iller] M edia Independent Dom ain Specific
Land Cover, Relief [Sheth and Kashyap] M edia Independent Dom ain Specific
Parent Child Relationships [Shklar et al.] Text Dom ain Independent
Contexts [Sciore et al., Kashyap and Sheth] Structured Dom ain Specific
Concepts from Cyc [Collet et al.] Structured Dom ain Specific
User’s Data Attributes [Shoens et al.] Text, Structured Dom ain Specific
Dom ain Specific Ontologies [M ena et al.] M edia Independent Dom ain Specific
HP 8
9. Types of Specs and Standards
(or MetaModels)
Domain Independent: (MCF), RDF, MOF, DublinCore
Media Specific: MPEG4, MPEG7, VoiceXML
Domain/Industry Specific (metamodels): MARC (Library),
FGDC and UDK (Geographic), NewsML (News), PRISM
(Publishing)
Application Specific: ICE (Syndication)
Exchange/Sharing: XCM, XMI
Orthogonal/(Other): RDFS, namespaces, ontologies,
domain models, (DAML, OIL)
HP 9
10. what RDF can do for metadata ?
Designed to impose structural constraint on syntax to
support consistent encoding, exchange and processing
of metadata.
Domain Independent Metadata standard.
HP 10
11. RDF (Resource Description Format)
Property
Resource Value
•RDF data consists of nodes and attached attribute/value pairs
•Nodes can be any web resources (pages, servers,
basically anything for which you can give a URI), even
other instances of metadata.
•Attributes are named properties of the nodes, and their
values are either atomic (text strings, numbers, etc.) or
other resources or metadata instances.
HP 11
12. RDF Example 1
dc:title
Mysteries of Metadata
URI:TALK
dc:creator
URI:AMIT
<?XML version=‘1.0’?>
<rdf:RDF xmlns:rdf = “http://www.w3.org/TR/REC-rdf-syntax#”
xmlns:dc = “http://purl.org/dc/elements/1.0”>
<rdf:Description rdf:about = “URI:TALK”>
<dc:title>Mysteries of Metadata</dc:title>
<dc:creator rdf:resource = “URI:AMIT”/>
</rdf:Description>
</rdf:RDF>
HP 12
13. RDF Example 2
dc:title
Mysteries of Metadata
URI:TALK
dc:creator
URI:AMIT
BIB:Aff BIB:Email
BIB:Name
URI:LIB amit@taalee.com
Amit Sheth
HP 13
14. RDFS (RDF Schema)
Enables resource description communities to define
(and share) vocabularies (museum, library, e-
commerce…)
Vocabulary (in RDFS) = the meaning, characteristics,
and relationships of a set of properties.
HP 14
15. RDF Based Web
RDF
Schemas
RDF/XML
Descriptions
Resources
HTML
Source:http://www.w3c.rl.ac.uk HP 15
16. Dublin Core Metadata Initiative
Simple element set designed for resource description
International, inter-discipline, W3C community
consensus
“Semantic” interface among resource description
communities (very limited form of semantics)
Source:www.desire.org HP 16
17. Dublin Core RDF
<xml>
<?namespace href = "http://w3.org/rdf-schema" as = "RDF">
<?namespace href = "http://metadata.net/DC" as = "DC">
<RDF:Abbreviated>
<RDF:Assertion RDF:HREF = http://www.mysite.com/mydoc.html
DC:Title = "I've Never Metadata I've Never Liked“
DC:Creator = "Mary Crystal“
DC:Subject = "Metadata, Dublin Core, Stuff"/>
</RDF:Abbreviated>
</xml>
HP 17
18. MOF (Metadata Object Facility) and XMI
MOF models metadata using a subset of UML that is
relevant to modeling metadata (class models - classes,
associations and subtyping), a set of rules for mapping
the elements of the MOF Core to CORBA IDL
XML Metadata Interchange (XMI) is an extension of the
MOF into the XML space
HP 18
19. NewsML
NewsML is a packaging and metadata format for news
content.
NewsML is developed by the International Press
Telecommunications Council (IPTC), a consortium of
news providers, mostly in the print or wire-service
industries.
Since it deals only with packaging and metadata,
NewsML is complementary both to news content
formats like NITF and to syndication protocols like ICE.
HP 19
20. NewsML…
It can be used by news providers to combine their
pictures, video, text, graphics and audio files in news
output available on web sites, mobile phones, high end
desktops interactive television and any other device.
accurate, objective set of description tools, which help
qualify the information and make the search more
precise.
NewsML allows a range of metadata to be attached to a
multi-media story, including a detailed computer-
readable description of what an item is about.
HP 20
21. Example of the end-to-end flow -
NewsML
The content provider The operator receives Consumers sign up for the
supplies NewsML packaged NewsML data from the news service directly on the
media content to the content provider. The device. When using the news
operator. The content is content server automatically service, the user browses
categorized as current pushes updated news articles through the categories and
events, finance, sport, etc. to all news service reads the news articles. The
and updated hourly. subscribers. news articles are presented in a
continuous flow (one after the
other) without end-user
interaction.
Source:http://www.mediabricks.com HP 21
22. PRISM
Publishing Requirements for Industry Standard
Metadata
Version: 1.0, April 2001
Authors: IDEAlliance (Adobe, Vignette, Kinecta et al.)
Idea: “a standard for interoperable content
description, interchange, and reuse in both
traditional and electronic publishing contexts”
Web site: http://www.prismstandard.org
HP 22
23. PRISM Design
Built on existing standards like Dublin Core (DC),
RDF, XML
Designed to be used in a simple, straightforward way
over the Internet
Compatible with NewsML
Integrates easily with ICE (for syndication)
Vocabulary:
Basic: DC
Extensions: “Controlled Vocabularies”, e.g., “North
American Industrial Classification System“ (NAICS)
HP 23
24. PRISM Example
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:prism="http://prismstandard.org/1.0#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://wanderlust.com/2000/08/Corfu.jpg">
<dc:identifier rdf:resource="http://wanderlust.com/content/2357845" />
<dc:description>Photograph taken at 6:00 am on Corfu with two models
</dc:description>
<dc:title>Walking on the Beach in Corfu</dc:title>
<dc:creator>John Peterson</dc:creator>
<dc:contributor>Sally Smith, lighting</dc:contributor>
<dc:format>image/jpeg</dc:format>
</rdf:Description>
</rdf:RDF>
(Source: PRISM spec v. 1; http://www.prismstandard.org/techdev/prismspec1.asp)
HP 24
25. VoiceXML
A language for specifying voice dialogs.
Voice dialogs use audio prompts and text- to- speech
(TTS) for output; touch- tone keys (DTMF) and automatic
speech recognition (ASR) for input.
Goal is to bring the advantages of web-based
development and content delivery to interactive
voice response applications.
High- level voice-specific language simplifies
application development.
Source: http://www.voicexml.org HP 25
27. Voice XML Metadata
Voice Specific metadata
Supports Syntactic interoperablity
Text data to voice data
Voice XML = XML + Voice Metadata
HP 27
28. VoiceXML – Possible Services
Information retrieval – News, sports, traffic, stock quotes.
e- Transactions (e- commerce, e- tailing, etc.)
Financial: banking, stock trading.
Catalog browsing (generally as an adjunct to paper).
Telephone services
Personal voice dialing, One- number find- me services.
Intranet – Inventory, HR services, corporate portals.
Unification – My Whatever: personal portals, personal
agents, unified messaging.
Source: http://www.voicexml.org HP 28
29. MPEG7
set of description scheme and descriptors to describe
the content of multimedia data.
Provides a language to specify description schemes
A scheme for coding the description
HP 29
30. Application Examples for MPEG7
A few application examples are:
Digital libraries (image catalog, musical dictionary,...)
Multimedia directory services (e.g. yellow pages)
Broadcast media selection (radio channel, TV
channel,...)
HP 30
31. Information and Content
Exchange (ICE)
Main Goal: efficient and extensible Content Syndication
protocol for the Internet, using XML syntax
Authors: Adobe, Kinecta, MS, Sun, Vignette et al.
Status: latest spec version 1.1, May 2000; submitted to
W3C for review
Implementations: Vignette Syndication Server, MS
BizTalk, Kinecta Interact, …
Web Site: http://www.icestandard.org
HP 31
32. What is the ICE Protocol?
Syndication Protocol for communication between
Syndicators and Subscribers
Metadata to define
roles and responsibilities of involved parties: Subscriber vs.
Syndicator, Requestor vs. Responder, Sender vs. Receiver
format and method of content exchange (e.g., sequenced
packages, pull vs. push model)
HP 32
33. ICE Applications
ICE vocabulary + domain vocabulary = complete
application
ICE
establishes and manages the syndication
delivers data
logs events
=> content-independent metadata
industry-specific vocabulary defines the content =>
domain-specific metadata
Source: http://www.icestandard.org HP 33
34. ICE Explained
ICE: Information and Content Exchange protocol
Syndicator: A content aggregator and distributor
Subscriber: A content consumer
Subscription: An agreement between a subscriber and a syndicator
for the delivery of content according to the delivery policy and other
parameters in the agreement
Collection: The current content of a subscription
ICE Package: A delivery of commands to update a collection such
as the addition of content items
ICE Payload: The XML document used by ICE to carry protocol
information. Examples include requests for packages, catalogs of
subscription offers, usage logs and other management information
Sources: InternetWeek; "ICE Cookbook, version 1.0"
http://www.internetweek.com/ebizapps01/ebiz050701-3.htm
HP 34
36. XCM (eXtended Content Management)
a framework that allows customers to classify content
management offerings according to the business problems
they address. The segments of XCM are
Content Development - Developing static content and managing the
process of its subsequent approval, versioning, storage, and retrieval.
Application Content Management (Vignette) - Deploying content
dynamically to a Web site and managing that content throughout its
online lifecycle.
Content Delivery - Delivering content through multiple channels to
minimize customer waiting time and improve Web site stability and
scalability.
Source :http://www.vignette.com/CDA/Site/0,2097,1-1-30-1458-1146-1743,00.html HP 36
37. XCM
eXtended Content Management
Content Development Application Content Content Delivery
Management Management
Content Authoring Metadata Management Edge Network
Digital Asset Management Recombination Delivery
Software Configuration Personalization Streaming Media
Management Delivery
Document Process Caching
Management
Source :http://www.vignette.com/ HP 37
38. Multiple heterogeneous metadata models with different
tag names for the same data in the same GIS domain
Kansas State
FGDC Metadata Model UDK Metadata Model
Theme keywords: digital line graph, Search terms: digital line graph,
hydrography, transportation... hydrography, transportation...
Title: Dakota Aquifer
Title Topic: Dakota Aquifer
Online linkage: Adress Id:
http://gisdasc.kgs.ukans.edu/dasc/ http://gisdasc.kgs.ukans.edu/dasc/
Direct Spatial Reference Method: Vector Measuring Techniques: Vector
Horizontal Coordinate System Definition: Co-ordinate System:
Universal Transverse Mercator Universal Transverse Mercator
… … … ... … … … ...
HP 38
39. Different views of Metadata
Domain Independent Specifications (RDF)
Frameworks/Infrastructures (XCM)
Application Specific Media Specific
Metadata
ICE MPEG7, VoiceXML
Domain Specific
NewsML, FGDC/UDK
HP 39
40. Creating and Serving Metadata to
Power the Life-cycle of Content
Taalee Infrastructure Services Taalee Content Applications
Produce Catalog/ Integrate Interactive
Personalize
Aggregate Index Syndicate Marketing
Where is the What other What is the right What is the
best way to
content? content is it content for this monetize this
Whose is it? related to? user? interaction?
Broadcast,
Wireline,
Taalee Semantic MetaBase Wireless,
Interactive TV
HP 40
42. Metadata Creation and
Semanticization
• Automatic Content
Classification/Categorization
• Metadata Creation/Extraction:
Types of metadata created
Semantic Engine and WorldModel are trademarks of Taalee, Inc.
Metadata Extraction is a patented technology of Taalee, Inc.
HP 42
43. Forms/Types/Ingest of Content
Sources: Web Sites, Content Feeds and Private
Repositories
Types: Text, Graphics, Audio, Video, Multimedia
Forms: Unstructured text, Semi-structured text,
Structured text (+Media); Static or Dynamic
Ingest: Feed (push), Web (pull),
Repository/Database (usually pull)
HP 43
45. Information Extraction for Metadata Creation
Nexis Digital Videos
UPI
AP ... ...
Documents Data Stores
Global/Enterprise Digital Maps
Web Repositories
...
Digital Images Digital Audios
EXTRACTORS
METADATA
HP 45
46. Extracting a Text Document:
Syntactic approach
INCIDENT MANAGEMENT SITUATION REPORT
LAYOUT Friday August 1, 1997 - 0530 MDT
NATIONAL PREPAREDNESS LEVEL II
CURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have been
staffed for structure protection.
SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between Galena and McGr
The fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The
Date => day month int ‘,’ int
fire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is
35% contained, while protection of the historic cabit continues.
CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehking) is
assigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire is
contained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fire
burned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend,
HP 46
47. Traditional Text
Categorization
Customer
Training Statistical/AI
Set Techniques
d
fee
Classify Place in
a taxonomy
Routing/Distribution
Customer
Article Feed
4715
Standard Metadata
Classification of
Article 4715 Feed Source: iSyndicate
Posted Date: 11/20/2000
48. Taalee’s Categorization & Automatic Metadata Creation
Knowledge-base &
Statistical/AI Techniques
Taalee
Training Place in
Automated Content
Catalog Metadata
Set Classify a taxonomy Enrichment (ACE)
FTE
Company Analysis
Conference Calls
Article 4715 Metadata
Earnings
Customer Standard Feed Source: iSyndicate Stock Analysis
Training ed metadata Posted Date: 11/20/2000
Set Company Name: France Telecom, ENT
fe
Equant Company Analysis
Semantic Conference Calls
metadata Ticker Symbol: FTE, ENT
Earnings
Exchange: NYSE Stock Analysis
Topic: Company News
NYSE
Member Companies
Market News
IPOs
Classification
of Article 4715
Taalee Enterprise
Content Manager Customization Suite
Precise
syndication/filtering
Article Feed
4715 Routing/Distribution
Map to another taxonomy
49. Automatic Categorization & Metadata
Tagging (unstructured text/transcript of A/V)
Video Segment
with Associated Text
ABSOLUTE CONTROL OF THE SENATE IS
STILL IN QUESTION. AS OF TONIGHT, THE
REPUBLICANS HAVE 50 SENATE SEATS AND
THE DEMOCRATS 49. IN WASHINGTON STATE,
THE SENATE RACE REMAINS TOO CLOSE TO
CALL. IF THE DEMOCRATIC CHALLENGER
UNSEATS THE REPUBLICAN IUMBENT THE
SENATE WILL BE EVENLY DIVIDED. IN
Segment Description
MISSOURI, REPUBLICAN SENATOR JOHN
ASHCROFT SAYS HE WILL NOT CHALLENGE
Auto HIS LOSS TO GOVERNOR MEL CARNAHAN
Categorization
WHO DIED IN A CRASH THREE WEEKS AGO.
GOVERNOR CARNAHAN'S WIFE IS EXPECTED
TO TAKE HIS PLACE. IN THE HIGHEST PROFILE
SENATE EVENT OF THE NIGHT, HILLARY
CLINTON WON THE NEW YORK SENATE SEAT.
SHE IS THE FIRST FIRST LADY TO RUN MUCH
LESS WIN.
Semantic
Metadata
HP 49
50. Automatic Categorization & Metadata
Tagging (Web page)
Video with
Editorialized
Text on the Web
Auto
Auto
Categorization
Categorization
Semantic Metadata
Semantic Metadata
HP 50
51. Automatic Categorization & Metadata
Tagging (Feed)
Text
From
Bllomberg
Auto
Auto
Categorization
Categorization
Semantic Metadata
Semantic Metadata
HP 51
52. Taalee Extraction and Knowledgebase
Enhancement
Web Page Enhanced Metadata Asset
Extraction
Agent
HP 52
53. Basis for Semantics
A. Facts/Concepts/Terms/Entities
Dictionary, Thesaurus, Reference Data,
Vocabulary
B. Facts with Relationships
Taxonomy/(Categories), Ontology
Domain Modeling (e.g., Golf = golfer, tournament name, golf
course, event)
Knowledge Base
HP 53
54. Basis for Semantics
C. Reasoning/Inference
(Statistical)
(Information Retrieval)
Statistical Learning/AI (Bayesian, Neural
Networks, HMM,…)
Logic Based (Description Logic)
Natural Language/Grammar (part of speech,..)
HP 54
55. Alternatives for Metadata
Extraction
Statistical methods/Cluster Analysis
Learning/AI and Collab. Filtering
Word or Phrase Reference data/Concept-terms/
Dictionary/Thesaurus
By topic/industry/subject/domain
Ontologies/Domain Models
deeper KnowledgeBase
understanding By Entities and Relationships
HP 55
57. Ontology
Standardize meaning, description,
representation of involved attributes
Capture the semantics involved via domain
characteristics
Allow knowledge sharing and reuse
(Ontological Commitment)
HP 57
58. Ontology
Description includes
Attributes
Domain Rules
Functional Dependencies
HP 58
60. Example: Interrelated ontologies
RECREATIONAL MILITARY
LANDFILL
LAND SITE
(SITE)
CULTIVATED
AREA
LAND AGRICULTURAL
GREENLAND ZONING USE
AREA COMERCIAL
LAND
BANK
INDUSTRIAL RESIDENTIAL
WASTE RURAL
DISPOSAL
STORM
SOLID SEWAGE FLOOD
HAZARDOUS TSUNAMI
RESOURCE REC. FIRE
LANDFILL causes
NATURAL
RECYCLING VOLCANO
DISASTER
AVALANCHE
washing
shredding causes
causes
magnetic screening
separation LANDSLIDE
EARTHQUAKE
causes
61. Large Vocabularies/
Taxonomies/Ontologies
WordNet
The Medical Subject Headings (MeSH): NLM's
controlled vocabulary used for indexing articles, for
cataloging books and other holdings, and for searching
MeSH-indexed databases, including MEDLINE. MeSH
terminology provides a consistent way to retrieve
information that may use different terminology for the
same concepts. Year 2000 MeSH includes more than
19,000 main headings, 110,000 Supplementary Concept
Records (formerly Supplementary Chemical Records),
and an entry vocabulary of over 300,000 terms.
HP 61
63. Metadata Usage:
Impact on Search & Query processing
traditional queries based on keywords
attribute based queries
content-based queries
HP 63
64. Oingo.com
Oingo Ontology – ODP based(?), the database of millions
of concepts and relationships that powers Oingo's
semantic technology
Oingo Seek - the database of millions of concepts and
relationships that powers Oingo's semantic technology
Oingo Sense - the knowledge extraction tool that
uncovers the essential meaning of information by sensing
concepts and context
Oingo Lingua - the language of meaning used to state
intent. The basis for intelligent interaction
Assets catalogued are Web sites or Web pages.
HP 64
66. Metadata is the basis of making
Content Intelligent
Precisely what the user asked for
Closely-related, high-value information beyond what
was requested
Ability to explore any dimension around the immediate
point of interest
Intelligent content helps the user
“think” about and fulfill their information needs with less effort.
Intelligent content can be
more effectively managed, packaged and distributed
HP 66
67. Metadata and Intelligent Content
Taalee makes content more “intelligent” through automatic analysis of every
individual asset to generate a catalog containing:
• Context of the Content
• Semantic Metadata describing entities (i.e., Company, Industry, etc.), and
• Relationships (semantic associations) among all entities
Based on a “Semantic” or “domain” model describing how the user thinks
about the subject matter, supported by a knowledgebase.
“Normal” Content can only be “found” if the
user enters a keyword that exists within it
+ = Intelligent Content
Adding related metadata and relationships
dramatically increases the ability to
automatically access needed content via
multiple dimensions HP 67
68. More than metadata
Taalee makes content more “intelligent” through automatic analysis of
every individual content item to create:
Context of the Content
Semantic Metadata describing entities (i.e., Company, Industry,
etc.), and
Relationships (semantic associations) among all entities
Based on a “Semantic” or “domain” model describing how the user
thinks about the subject matter, supported by a knowledgebase.
HP 68
69. Metadata & Search
Metadata can improve search significantly, but
metadata enables much more than search
Alternatives for improving search: clustering, link
and other analysis (e.g., Google’s Link Flux
analysis), classification as context, ontologies,
metadata, knowledgebases …
HP 69
71. Keyword Search vs Attribute
Search with Semantic metadata
Taalee Metadata on
Football Assets
Metadata from Typical
Virage Search on Rich Media Reference Page
Cataloging of Football
football touchdown Baltimore 31, Pit 24
Assets
http://www.nfl.com
Brian Griese Interview Part Four Quandry Ismail and Tony Banks hook up for their third long
Brian Griese talks about the touchdown, this time on a 76-yarder to extend the Raven’s
first touchdown he ever threw. lead to 31-24 in the third quarter.
URL: http://cbs.sportsline... League: Professional
Teams: Ravens, Steelers
Jimmy Smith Interview Part Seven Score: Bal 31, Pit 24
Jimmy Smith explains his Players: Quandry Ismail, Tony Banks
philosophy on showboating. Event: Touchdown
URL: http://cbs.sportsline... Produced by: NFL.com
Posted date: 2/02/2000
HP 71
72. Taalee’s Semantic Search
Highly customizable, precise and freshest A/V search
Delightful, relevant information,
exceptional targeting opportunity
Uniform Metadata for Content from Multiple
Context and Domain Specific Attributes
Sources, Can be sorted by any field
HP 72
73. What can a context do?
Creating a Web of
related information
HP 73
74. Taalee Directory
Georgia Bulldogs
System recognizes ENTITY & CATEGORY
77. Metadata Application Example
Semantic Applications for highly relevant
and fresh content:
Personalization and
Targeting/interactive marketing
Please contact Taalee for live demonstrations
HP 77
78. Personalized Directory
Change
Context
Obtain a whole universe of information (that you may not even
have thought of) about some entities that have always been of
interest to you.
Please enter such semantic keywords below.
79. Personalized Queries & Hot Topics
Personalized Queries
1. My Stock Portfolio
Microsoft suffers serious hack attack
Cisco Systems Inc
PERSONALIZATION Analyst Safa Rashtchy on Yahoo!
PeopleSoft, Inc
AT&T Corp.
more…
2. My Football Fantasy Team
Gators' Spurrier ready for 'big' game
Tech's Vick looks to become complete QB
Bucs excited about Hamilton
HOT Topics!!!
Jasper Sanks rumbles into the end zone…
Edwards explains reasons for leaving BYU
1. Election 2000
more…
Video: Explaining the electoral map
3. Julia Roberts Collection
Race for White House hots up
Movie Trailer: "Notting Hill" Gore Florida Edge
Seniors Give more…
Trailer - Runaway Bride
2. Middle East Peace Conflict
Patrick
Movie Trailer: "Stepmom" Israel steps up security
More die as
Israel braces for suicide bombs
Conspiracy Theory more…
Pentagon probes Cole's security more…
4. Pink Floyd Collection
3. Napster Controversy
Set the Controls for the Heart of the Sun…
Wish You Were Here Brain Behind Napster
The
Napster Lawsuit
Round And Around
Keep Talking Creative Nomad II more…
The Post War Dream
more…
81. Semantic/Interactive Targeting
Buy Al Pacino Videos
Buy Russell Crowe Videos
Buy Christopher Plummer Videos
Buy Diane Venora Videos
Buy Philip Baker Hall Videos
Buy The Insider Video
Precisely targeted through the use of Structured Metadata and integration from multiple sources
82. Web: Extreme Personalization
Realtime Interests,
Feeds Preferences
Web sites
Time-Shifted
and Pages Content Aggregator
Content
Personalized
Databases
Content
Content
Personalized
Content
Semantic EngineTM
Structured,
Hi-Quality
Semantic Metabase
HP 82
83. Application of Semantic Metadata and
Automatic Content Enrichment
User has already completed Web
MyMedia Based registration and
$ MyStocks
personalization at Voquette’s
News
Sports
Enterprise Customer site.
Music
User’s “Wireless Home page”
shows the categories for his
interests. There is an alert (new
content) for his stock and sports
categories.
HP 83
84. Application of Semantic Metadata and
Automatic Content Enrichment
Clicking on MyStocks brings
My Stocks
down user’s Personal Portfolio
MyMedia
list. The user wants to see news
$ MyStocks CSCO items about Cisco (see next
News
NT slide).
Sports
IBM Search at the bottom is a
Music
Market semantic search that
understands the financial
domain, and the knowledge of
user’s portfolio. Typically
search can be done by typing
one word or selecting from a
dynamic, personalized menu.
HP 84
85. Application of Semantic Metadata and
Automatic Content Enrichment
Different types of recent
audio content about
CSCO Cisco are available.
My Stocks
MyMedia Analyst Call The user clicks to see a
$ MyStocks CSCO Conf Call listing of Analyst Calls
News NT Earnings on Cisco (next slide).
Sports
IBM
Music
Market
Icons at the bottom of
the screen enable
contextually relevant
functions: listen, set
alert on story, add to
playlist.
HP 85
86. Application of Semantic Metadata and
Automatic Content Enrichment
CSCO Analysis
CSCO
My Stocks 11/08 ON24 Payne
MyMedia
Analyst Call
11/07 ON24 H&Q
$ MyStocks CSCO
Conf Call 11/06 CBS Langlesis
News NT
Earnings
Sports
IBM
Music
Market
Clicking on the link for Cisco Analyst Calls displays a listing
sorted by date. Semantic filtering uses just the right metadata to
meet screen and other constrains. E.g., Analyst Call focuses on
the source and analyst name or company. The icon denote
additional metadata, such as “Strong Buy” by H&Q Analyst.
HP 86
88. Metadata for Automatic Content
Enrichment
Interactive Television
Part of the screen can be
automatically customized to
This screen is customizable show conference call specific
with interactivity feature information– including transcript,
using metadata such as whether participation, etc. all of which are
there is a new Conference relevant metadata
Call video on CSCO.
Conference Call itself can have
embedded metadata to
support personalization and
interactivity.
This segment has embedded or referenced metadata that is
used by personalization application to show only the stocks
that user is interested in.
HP 88
89. Metadata in Enterprise Apps
Collection Processing Production Support
Sony
Network
Content
Categorize
Affiliate
Feeds Catalog
Integrate
Public
Sources
Rich Data
Metabase
Filter, Search, Consolidate,
Personalize, Archive,
Licensing, Syndication
HP 89
90. Customize: Page Settings | Content | Layout | Color Video A leaking gasoline pipeline burst into flames Thursday, killing
-- Breaking News for 11/30/2000 -- more than 60 people near Nigeria's commercial capital of Lagos.
Many of the dead were fisherman in wooden canoes engulfed in
Gore Demands That Recount Restart (9:40 PM) the inferno.
Gore Says Fla. Can't Name Electors (4:50 PM)
Bush Meets Colin Powell at Ranch (1:22 PM) More than a dozen burned bodies lay on a beach at the village
Market Tumbles on Earnings Warning (9:27 AM) of Ebute-Oko facing the central business district of Lagos across
a lagoon.
Barak Outlines His Peace Plan (6:30 AM)
"At least 60 people died in this needless fire," senior local official
Karimu Alabi said.
Fire crews from state-run Nigerian National Petroleum Corp
(NNPC), which owns the pipeline, were joined by other firemen
from construction company Julius Berger in battling the blaze.
t Residents said the fire started near Ebute-Oko at daybreak and
spread rapidly along the line of the oil leak, ravaging a cluster of
huts and log houses.
Sixty Die In Nigeria Blast
At about the same time, a second fire razed Makoko shantytown
Produced by: Euronews where thousands of fishermen and their families live in wood
Posted Date: 11/30/2000 cabins erected on stilts in the lagoon near Lagos University.
Event : Election 2000
Location : Tallahassee, Florida, USA Residents said fishermen from Makoko had been scavenging for
People : Al Gore, George W. Bush gasoline from the leaking pipeline and storing it in cans in the
wooden huts for days. Many victims of the Ebute-Oke fire were
• Greatly enhances news-room productivity and time-to-market
• Value-add for production, broadcast & syndication
• Taalee’s semantic metadata enables powerful access to content used by Enterprise’s customers
HP 90
91. Description
Produced by : CNN
Posted Date : 12/07/2000
Reporter : David Lewis
Event : Election 2000
Location : Tallahassee, Florida, USA
(1.33) – 12/06/00 - ABC People : Al Gore
TALLAHASSEE, Florida (CNN) –
Though the two presidential candidates
(2.53) - 12/06/00 - CBS have until noon Wednesday to file briefs in
Al Gore's appeal to the Florida Supreme
(5.16) - 12/06/00 - ABC Court, the outcome of two trials set on the
same day in Leon County, Florida, may
offer Gore his best hope for the presidency.
(2.46) - 12/06/00 - FOX
Democrats in Seminole County are seeking
to have 15,000 absentee ballots thrown out
(1.33) - 12/06/00 - NBC in that heavily Republican jurisdiction -- a
move that would give Gore a lead of up to
(5.33) - 12/06/00
-- Breaking News -- 5,000 votes statewide.
Gore Demands That Recount Restart (1.33) - 12/06/00 - CBS Lawyers for the plaintiff, Harry Jacobs, claim
the ballots should be rejected because they
(1.33) - 12/06/00 - ABC say County Elections Supervisor Sandra
Gore Says Fla. Can't Name Electors (3.57) - 12/06/00 - CBS Goard allowed Republican workers to fill out
(2.33) - 12/06/00 - CBS
voter identification numbers on 2,126
incomplete absentee ballot applications sent
Bush Meets Colin Powell at Ranch (4.27) - 12/06/00 - ABC in by GOP voters, while refusing to allow
(3.12) - 12/06/00 - NNS Democratic workers to do the same thing for
Democratic voters.
Market Tumbles on Earnings Warning (3.44) - 12/06/00 - FOX
(0.32) - 12/06/00 - CBS
The GOP says that suit, and one similar to it
Barak Outlines His Peace Plan (7.24) - 12/06/00 - CBS from Martin County, demonstrates
(1.33) - 12/06/00 - CBS Democratic Party politics at its most
desperate. Gore is not a party to either of
those lawsuits. On Tuesday, the judge in the
HP 91
92. Metadata’s role in emerging
iTV infrastructure
Video Enhanced
Digital Cable
MPEG-2/4/7
MPEG MPEG ☺☺☺
GREAT
Encoder Decoder USER
EXPERIENCE
Create Scene Description Tree Retrieve Scene Description Track
Channel sales Node = AVO Object License metadata decoder and
through Video Server Vendors, semantic applications to
Video App Servers, and Broadcasters device makers
Scene
Description
Tree
Enhanced
XML
Produced by: Fox Sports Description
Creation Date: 12/05/2000
League: NFL
Taalee Teams: Seattle Seahawks,
“Cisco Systems” Semantic Atlanta Falcons “Cisco Systems”
Engine Players: John Kitna
Node Coaches: Mike Holmgren,
Dan Reeves Metadata-rich
Location: Atlanta Value-added Node
Object Content Information (OCI)
HP 92
93. Intelligent Metadata Creation
Usage
Metadata for Intelligent Content
Content which does Content which does not Content the user did
contain the words
the user asked for
+ contain the words
the user asked for, but
+ not think to ask for, but
which he needs to
is about what he asked know.
for.
Extractor Agents Value-added Metadata Semantic Associations
HP 93
95. Value-added Metadata
Traditional methods rely solely on (syntactic) indexing of keywords to enable
users to access content
• If a keyword is not in the content, it cannot be found.
• The burden is on the user to think of and ask for the “right” keyword.
For example: If a story is about “Roger Clemens” but does not contain the
words “New York Yankees”, that story cannot and will not be found if the user
searches for “New York Yankees” or “Yankees”.
Understanding of the content is needed to create new metadata.
Taalee understands Roger Clemens is a PERSON who Plays a SPORT called
Baseball for a TEAM from New York called the Yankees.
Taalee uses these Semantic Associations (COMPANY participates in INDUSTRY)
to add missing metadata to describe content more completely.
HP 95
96. Guided Demo for Value Added Metadata –
Example one
• Go to http://www.mediaanywhere.com/Football.html & search for Player = Jamal Anderson.
• Click on the first result (titled “Week 3 Top10: Anderson TD Run”) and view the metadata
on the following RMR page
• Here is what you see:
Produced by: NFL.com Posted Date: 9/20/2000 League : NFL
Teams : Atlanta Falcons Players : Jamal Anderson
• Now click on the button to play the asset (button marked “REAL”)
• View the source HTML page that has the original story, and locate this story with the
heading “Week 3 top 10: Anderson TD run”
• Verify that Team=Atlanta Falcons or League=NFL was not present in the source content.
• Taalee attached this value-added metadata to this asset’s existing metadata so that a user
searching for Atlanta Falcons will find this story on Jamal Anderson, who is a player of
Atlanta Falcons team
HP 96
97. Guided Demo for Value Added Metadata –
Example Two
• Go to http://www.mediaanywhere.com/Baseball.html & search for Player = Gary Sheffield
• Click on the first result (titled “I want out!”) & view the metadata on the following RMR page
• Here is what you see:
Produced by: ESPN Posted Date: 3/03/2001 League : National League
Teams : Los Angeles Dodgers Players : Gary Sheffield
• Now click on the button to play the asset (button marked “REAL”)
• View the source HTML page that has the original story, and locate this story with the
heading “I want out!”
• Verify that Team=Los Angeles Dodgers or League=National League was not present in
the source content.
• Taalee attached this value-added metadata to this asset’s existing metadata so that a user
searching for Los Angeles Dodgers will find this story on Gary Sheffield, who is a player of
Los Angeles Dodgers team
HP 97
98. Example 1 – Snapshots (“Jamal Anderson”)
Search for ‘Jamal
Anderson’ in ‘Football’
Click on first result for
Jamal Anderson
View the original source
HTML page. Verify that
the source page contains
no mention of Team name
and League name. They
were Taalee’s value-
additions to the metadata
to facilitate easier search.
View metadata. Note that
Team name and League
name are also included
in the metadata
HP 98
99. Example 2 – Snapshots (“Gary Sheffield”)
Search for ‘Gary
Sheffield’ in ‘Baseball’
Click on first result for
Gary Sheffield
View the original source
HTML page. Verify that
the source page contains
no mention of Team name
and League name. They
were Taalee’s value-
additions to the metadata
to facilitate easier search.
View metadata. Note that
Team name and League
name are also included
in the metadata
HP 99
100. Intelligent Content – Value-Added Metadata
Some Metadata are obtained explicitly from the
asset. Others (not present in the asset) are added
by Taalee using its semantic relationships. League Name of league to which the
Name payer’s team belongs – Not
mentioned explicitly in asset – Value-
The asset is richly, fully described in the many
added by Taalee’s processing based on
ways the users chose to interact. semantic associations.
Posted Rich Media
Date Team Name
Sports Asset
Date of asset posting – Name of team for which
Extracted automatically player plays – Not
mentioned explicitly in asset
– Value-added using Taalee’s
Sport semantic relationships
Name of content Name of
provider that Producer sport
produced the Name
asset
Legend:
Name of players X Y means
mentioned explicitly in Player Taalee uses X to add Y
the asset – Extracted Names as value-added metadata
to the asset
automatically HP 100
102. Semantic Associations
• Traditional search engines rely solely on (syntactic) keywords to find content.
• They do not understand the meaning, context, or relationships of keywords.
For example: a search engine may see that the word “Commerce One” occurs,
but it does not know that Commerce One is a COMPANY which Participates in
the Corporate, Professional & Financial Software INDUSTRY and COMPETES
WITH Ariba.
As a result, search engines cannot go beyond returning a list (or directory view)
of what the user has asked for. Their ability to provide associated information is
extremely limited, static, and difficult to scale.
Taalee’s Semantic Content Model
goes beyond indexing keywords and classifying assets to
Understand and Associate all content it catalogs. HP 102
103. Example (test on http://directory.mediaanywhere.com)
Links to news on companies
that compete against
Commerce One
Crucial news on
Links to news on companies
Commerce One’s
Commerce One competes
competitors (Ariba) can
against
Search for company be accessed easily and
(To view news on Ariba, click
‘Commerce One’ automatically
on the link for Ariba)
HP 103
104. ASP/Enterprise
hosted
Internal Source 1
Research
Extractor 2
Agent 1 World Model Semantic Semantic
Consults Engine Application
Knowledge
Base
for Cisco’s
competition
Lucent story
from external 4
feeds picked for
Internal Source 2 publishing as
Returns result:
Extractor Lucent is a “semantically
Agent 2
3 competitor of related” to Cisco
Cisco story – passed
on to Dashboard
Story on
Cisco 1
Cisco story from
PW Source 1
passed on to add
semantic
External feeds/Web
associations
(e.g. Reuters)
Extractor Story on
Agent 3 Lucent Taalee Third-party
Metabase Content Mgmt
And
Syndication
XCM-compliant
Metadata centric metadata, XML or
other format
Content Management Architecture HP 104
105. Semantic Associations
supported by Taalee Semantic Engine
Intelligent Content = What You Asked for + What you need to know!
Related
Stock
COMPANY Competition
COMPANIES in
News
INDUSTRY with
COMPANIES in Same or
Competing PRODUCTS
Related INDUSTRY
Regulations
Technology Impacting INDUSTRY
Products EPA
EPA or Filed By COMPANY
Important to INDUSTRY Industry SEC
or COMPANY
News
HP 105
106. Semantic Web Application Example:
Financial Advisor Research Dashboard
Automatic
Collation of
semantically Research
related digital Inferred
media information Automatically
from Multiple
Sources
Semantically
Related News
Not Semantic Search/
Specifically Personalization, etc.
Asked For
HP 106
107. A vision for future
Semantic Web, Complex Relationships
and Knowledge Discovery,
E.g., InfoQuilt project at LSDIS Lab, Univ. of Georgia
108. Beyond RDF
– one proposal (cf: Ora Lassila)
Structural modeling obviously not enough
we need a “logic layer” on top of RDF
some type of description logic is a possibility
Exposing a wide variety of data sources as RDF is
useful, particularly if we have logic/rules which allow us
to draw inference from this data
RDF + DL = “Frame System for WWW”
Source : www.ontoknowledge.org/oil
HP 108
109. Semantic Web - next step in Web evolution
“A Web in which machine reasoning will be
ubiquitous and devastatingly powerful.” [Berners-Lee]
“A place where the whim of a human being and the
reasoning of a machine coexist in an ideal, powerful
mixture.” [Berners-Lee]
“A semantic Web would permit more accurate and
efficient Web searches, which are among the most
important Web-based activities.” [Berners-Lee]
A personal definition
Semantic Web: The concept that Web-accessible
content can be organized semantically, rather than
though syntactic and structural methods.
HP 109
110. What is DAML (DARPA Agent Markup Language)
a proposal to create technologies that will enable
software agents to dynamically identify and
understand information sources, and to provide
interoperability between agents in a semantic
manner.
Based on RDF+XML
Agent readable Tags
www.daml.org
112. Three layered Architecture Of
Semantic Web
Logical Layer
Formal Semantics and Reasoning
Support – OIL, DAML-O
Schema Layer
Definition of Vocabulary
RDF Schema
Data Layer
Simple data model and syntax for
metadata - RDF
114. DAML and OIL – Evolving
towards Semantic Web
OIL Mission
OIL is a Web-based representation and inference
layer for ontologies, which combines the widely used
modeling primitives from frame-based languages with
the formal semantics and reasoning services provided
by description logics
115. Knowledge Discovery -
Example
Earthquake Sources Nuclear Test Sources
(USGS, NEIC) (Oklahoma Observatory, etc.)
Nuclear Test May Cause Earthquakes
Is it really true?
116. Complex Relationships
A nuclear test could have caused an earthquake
if the earthquake occurred some time after the
nuclear test was conducted and in a nearby region.
NuclearTest Causes Earthquake
<= dateDifference( NuclearTest.eventDate,
Earthquake.eventDate ) < 30
AND distance( NuclearTest.latitude,
NuclearTest.longitude,
Earthquake,latitude,
Earthquake.longitude ) < 10000
117. Knowledge Discovery -
Example
When was the first recorded nuclear test conducted?
1950
Find the total number of earthquakes with a magnitude
5.8 or higher on the Richter scale per year starting from 1900
Increase in number of
earthquakes since 1945
118. Knowledge Discovery -
Example…
For each group of earthquakes with magnitudes in the ranges
5.8-6, 6-7, 7-8, 8-9, and >9 on the Richter scale per year
starting from 1900, find average number of earthquakes
Number of earthquakes with
magnitude > 7 almost constant.
So nuclear tests probably only
cause earthquakes with
magnitude < 7
119. Knowledge Discovery -
Example…
Find pairs of nuclear tests and earthquakes such that the earthequake
occurred within 30 days after the test was conducted and in a radius of
10000 miles from the epicenter of the earthquake
Demo
120. Resources/References
RDF:www.w3.org/TR/REC-rdf-syntax/
ICE: www.icestandard.org
Meta Object Facility (MOF) Specification, Version 1.3, September 27, 1999:
http://cgi.omg.org/cgi-bin/doc?ad/99-09-05
XML Metadata Interchange (XMI) Specification, Version 1.1, October 25, 1999:
http://cgi.omg.org/cgi-bin/doc?ad/9910-02
http://cgi.omg.org/cgi-bin/doc?ad/99-10-03
DAML: www.daml.org
NEWSML: newsshowcase.reuters.com
PRISM: www.prismstandard.org/techdev/prismspec1.asp
XCM: www.vignette.com
OIL: www.ontoknowledge.org/oil
SEMANTICWEB: www.semanticweb.org
VOICEXML: www.voicexml.org
MPEG7: www.darmstadt.gmd.de/mobile/MPEG7/
Taalee: www.taalee.com
Oingo: www.oingo.com
121. Multimedia Data Management: Using
Metadata to Integrate and Apply
Digital Media,
Amit Sheth and Wolfgang Klas, Eds.,
McGraw Hill, ISBN: 0-07-057735-8,
1998.