Hi, my name is Yuliya. I am working for Yandex at Semantic Web Project. Today I intend to discuss The Main Trends in the Use and Development of Semantic Markup
Firstly I want to talk about the reasons for using semantic markup in Yandex. Then we'll talk a little bit about the basic terms. Finally in general discuss the development of semantic markup an example schema.org
So, why do we need all this stuff?
There is a huge pile of raw data in the Internet. But it's not enough for give an answer to our users. To give them good answer we need knowledge rather than raw data.
We can extract knowledge automatically (using machine learning, language technologies or specialized parsers). And we can get knowledge about content of web pages directly from the webmasters. Both methods have their advantages and disadvantages.
Self data mining allows us not be dependent on webmasters. Furthermore, this method is more is technological. But sometimes we need special parser for each web site. An important disadvantage of this method is the lack of webmasters the opportunity to influence our knowledge of their site.
On the other side the receipt of data from the webmasters also have advantages and disadvantages. It is good that we get information about the contents of pages from the people who really know what is written on it. In addition, we need to make less effort to use those knowledge in search. But from the other hand many people is not so honest as I'd wish to. And they may try to fraud the system. And, of course, not all webmasters want to make an effort to give us any information.
In view of the above at the end of 2009 we started to use in our services the additional information sent by webmasters.
How we can collect information from webmasters? First of all by using special tools. Second, by using XML-files special formats. And other files. Even excel. Another variance does not involve something other than HTML code of pages. Semantic markup is included directly in page's source code.
Let's talk about semantic markup.
I want to say some words about syntax and vocabulary, tell about usage of semantic markup and bring some statistics.
Semantic markup consist of syntax and vocabulary. First is about how we put information into pages. Second is about what information we give.
There are for main syntax of semantic markup: RDFa, Microformats, Microdata and the newest - JSON-LD. And then there are some dictionaries that can be used with these syntaxes. The oldest one is DublinCore. Originally it was created in 1995. In Russia there is even a Standard, describing the Dublin Core. It is very simple and contains only 15 elements. Do not be surprised that microformats are listed as a vocabulary.This is because there are mixed form and meaning. GoodRelations is a specialized vocabulary that describes the goods and services. Open Graph Protocol is an initiative of Facebook. It is a simple way to convey the most important information about content of page. Schema.org is the most promising dictionary, supported by Google, Bing, Yahoo, and by Yandex.
Some history. A long long time ago far far away in the Galaxy... wait! It's another story. We begun using semantic markup in late 2009. We start makin rich snippet and services based on semantic markup. In the next year W3C announced HTML5 and microdata. And we started usage this method in our products. We even wrote a dictionary of data about encyclopedias. Than Facebook has announced The Open Graph Protocol. The following year was created schema.org. And the world has changed. We came up with new ways to use this markup. As well as changes in the schema.org. The first Yandex proposal in schema.org was PeopleAudience. Now it is accepted and published, but it takes a lot of time to do this. From the outside it seems that there is nothing easier than to add a few new properties. But you should predict what people think and what they might think. How will webmasters and consumers use this data. Isn't it too difficult? Do you want to specify the gender of the target audience? Be ready to think about that it might offend people belonging to one sex but identify themselves with the other. Do you want to specify the age of the target audience of the content? It's might to offend adults who love to read children's books. To date, we have actions and JSON-LD syntax . And we use it in Yandex.Islands.
According to our base 24% documents in the internet contains some semantic markup. A lot or a little? Of course, this is far from 100%, but over the past three years, the number has risen to more than twice.
Here you can see our statistics of semantic markup distribution. The most popular vocabulary is The Open Graph Protocol. Next is schema.org. And those small bar is GoodRelations.
How can this data be used? The major consumer is Search Engines. It uses this data for creation rich snippets and reception content from webmasters to some services. For example, Yandex creates rich snippets for recipes, dictionary articles, movies, chords, etc. And uses information extracted from microdata in Video, Auto, Images and other services. But not only search engines consume semantic markup. Other internet companies also can do this. For example, pinterest uses OG and Schema.org for creating Rich Pins. Facebook, Google , twitter and other social network can create rich snippets for shared links.
Schema.org does not stand still. There two level of changing: 1) Public feedback and discussion. The most important point from publick discussion goes to work group 2) Work group consist of delegate from 4 search engines (Yandex, Google, Bing and Yahoo). They decide wether to make changes or not.
If you have some idea, problem or question you can send it to Public-vocabs@w3.org You also can read this mail list and reply to the questions and help to solves someone's problems.
If the idea has sense it will work through the working group. First of all we explore the idea. What the idea is? Where we sould place this change? How often is this use case? What are the challenges we face? Than we should discuss this idea. When all are agreed formulated idea sends to Public-vocabs@w3.org. Next step is collecting feedback from community. If there is a significant comments we need to repeat the cycle. It seems that no idea will never be accepted. But it is not true. And here are some new updates.
Actions - it's like a verb in the vocabulary
GoodRelations - this is about integration between schema.org and GoodRelations
Integration with vocabulary for learning resources metadata
Health and Medical vocabulary - this is about including Health and Medical vocabulary
JSON-LD - it's about using schema.org in new syntax.
And there are some future work
Potential actions - how describe an action that will happened in future