We’re giving away 1,500 more DJI Tello drones. Enter to win ›
by Uche Ogbuji | Published December 5, 2017
With the rise of artificial intelligence (AI) and cognitive computing, there’s an increasing need for a structured data format that other computers can easily understand. To meet that need, in 2011 a group of search engine companies and large-scale web publishers created an initiative called Schema.org to describe objects that web pages are actually about.
In this four-part series, I introduce you to Schema.org and show you how to use it to create more searchable web pages. In Part 1, I begin by explaining the history of the project.
To begin with, let’s look at some of the benefits of Schema.org. Why add Schema.org markup to your pages? The bottom line is that doing so will make your pages more accessible and easier to find for search engines, AI assistants, and related web applications. You don’t have to learn any new development systems or tools to use the markup an can broadly get up to speed in a couple hours. Other benefits include:
In the first days of the web, everything you wanted to see was on a home page. Those initial web pages were like a personal bulletin pinned on a public board, but with hyperlinks. The goal was to have humans looking at the pages.
Before long, the Mosaic browser made it possible to embed images among the text, which made the web more enticing for users. Embedded media objects opened the door to audio, video, and application objects. Quickly, other industries besides information and communication started to use — and eventually to dominate — the web.
We have less useful automation than we would have if there were a common language. The web might seem an amazingly innovative place, but we are missing out on many more possibilities.“
With the explosion of data on the Internet, it quickly became necessary to categorize and tag content so that humans could more easily find the information they were looking for.
Early web inventors wanted to spread organizational tools more broadly on the web. In the 1990s, work on the “web of data” technology began. The initial predictions for data on the web were grand. A May 2001 story in Scientific American, by Sir Tim Berners-Lee and colleagues, entitled “The Semantic Web,” set forth their ambitions for a new technology that would provide a common language for data on the web, making automation easier.
While much of this envisioned automation is now a reality, it’s primarily due to the extraordinary feats of intense data munging by large search engines and tech companies, and not because the common language for data on the web ever took off. As a result, the automation we have now is not as useful as it would be if there were a common language. The web might seem an amazingly innovative place, but we are missing out on many more possibilities.
The advent of Schema.org will bring to life the promise of the semantic web. Through the efforts of the big players, even smaller players can now benefit.
In 2000, I wrote an article for IBM developerWorks, “An introduction to RDF,” that explained the technology that the Worldwide Web Consortium (W3C) was advocating to provide a common language for data on the web. The Resource Description Framework (RDF) is a set of specifications for modeling data on the web, to make work easier for autonomous agents and improve search engines and service directories. RDF was originally conceived as a simple model for expressing bits of data on the web.
Unfortunately, the W3C ended up piling so many complicated specifications on top of RDF (including full-blown AI facilities) that they were never really clear on how to boil the semantic web down to something simple enough that a typical web developer could easily learn.
To counteract these complicated specifications, an initiative called “Linked Open Data” began to push for a simplified set of principles. The name shortened to “Linked Data” as it became clear that the principles were useful even for enterprise and in private contexts. Linked Data basically recommends using HTTP URLs to identify things, rather than, say, plain text strings, and using conventions such as simple RDF to provide associated information for the identified things. This information might consist, for example, of labels that make use of plain text strings.
At first this metadata was provided separately from the web page itself, but web developers quickly began advocating for the use of simple HTML conventions to encode metadata right in the web page. These were called microformats.
All these developments crystallized over the course of a decade into Schema.org in 2011. The high-minded semantic web was simplified into Linked Data, while the need for separate file representations was eliminated by using microformat techniques.
So, what does all this mean to today’s web developer? For one thing, it means you have to ask, “What is my content actually about?”
Let’s say you maintain a web site for a book club. What are your pages about? They are probably about books, meetings, and members, and you describe these things with a conventional set of descriptions. For instance:
A person might be a member of the club and also a book’s author. In that case, some elements of a member’s description could be shared with that of an author. With that in mind, you might visualize the data describing your club as similar to the kind of data organization found in object-oriented programming.
Figure 2 shows part of this mental map, in which I’ve made up what I call the Geo Book Club.
So, what are we looking at?
The ovals are web resources (a little bit analogous to object-oriented instances). The most important thing about this mindset is that you think about URLs as much in terms of things they describe as you do the content they offer. http://example.com/geobookclub is the Geo Book Club’s website. In this model, I also consider it a thing, that is, a club. The resource type describes the type of thing that it is, and I use a leading line in capital letters to indicate this in the diagram.
Resource types organize the conventions for properties that are associated with specific things. For example, a person wouldn’t be associated with an ISBN. Resource types place controls over the data patterns, making it more efficient for applications to understand the data.
The arrows show the relationships or links between objects. It’s important to label every link that you wish to elevate to an explicit relationship. You don’t just say that the book “Things Fall Apart” is related to the person “Chinua Achebe.” Instead, be more specific: The book “Things Fall Apart” is authored by the person “Chinua Achebe.” Because a book could have other related people, such as editors or illustrators, labeling the specific relationships helps web applications accurately process the data.
Sometimes the value of a relationship is just text rather than another web resource. The diagram shows these as rectangles, and they are called literals. Literals can also be numbers, dates, Booleans, and other sorts of fundamental data.
The cloud shape is just a convenient marker for detail we don’t need for this tutorial. I used them to show that a club can have multiple meetings, but in this series we care only about the details of the second one. The clouds are meant to show that there can be multiple meetings, each a separate relationship.
You could imagine a way of modeling this with some sort of container object, say “membership” to hold the members, or “schedule” to hold the events. However, containers get complex quickly. Schema.org emphasizes simplicity, so conventions are more often to merely express multiple instances of a relationship.
The book cover is an interesting special case. For one thing, it is a web URL linking to an image file. Schema.org allows you to include different sorts of web URLs in relationships, including images and other non-text media objects. There is also no resource type specified. In a few cases such as this, you can let the relationship carry the weight, though Schema.org does also provide a more thorough way of expressing such media relationships where needed.
If the model described above makes sense to you, you are close to understanding RDF well enough to start using Schema.org. Keep in mind just two considerations.
Figure 3 shows a subset of the Geo Book Club model illustrating the fully expressed predicates and type/class relationships. You can imagine how cluttered it would be if I carried all that data through the entire diagram.
There is no Schema.org class specifically for a book club, so I used the one for an organization. Incidentally, Schema.org is not meant to provide a comprehensive model of anything everyone might wish to express on the web. However, if enough book club organizers got together and decided to come up with Schema.org extensions to suit their needs, they might eventually get them into the core Schema.org model. Rough consensus and actual use are the most important drivers in the evolution of Schema.org.
The following diagram shows a Schema-org conforming model. I use two abbreviations to reduce clutter:
Resource type abbreviation: The second abbreviation is to specify the resource type in parenthesis underneath the resource identifier itself.
Besides the change to schema:Organization, there is another vocabulary change to match Schema.org. The cover relationship is given as schema:image.
Schema.org supports a class inheritance capability similar to what you might know from object-oriented programming. It has one ancestral class schema:Thing, from which all the classes derive.
Even properties are subclasses of schema:Thing, but this is a bit of an arcane detail.
More interestingly, Schema.org makes much use of subproperties, which are analogous to subclasses. For example, the Schema.org model doesn’t directly specify schema:isbn as a recognized property on schema:Book. Rather, it specifies schema:identifier. However, there are several subproperties of schema:identifier, including:
These different sorts of identifiers make sense in specific contexts.
Subproperties follow the Liskov Substitution principle, which you might remember from object-oriented programming. In basic terms, that means that you can substitute any subproperty for its parent. So since schema:identifier is recognized on schema:Book, you are free to substitute schema:isbn, as I do in the Geo Book club example.
If you run a web site, you already deal with models and frameworks for how web pages should look and behave. It’s becoming increasingly important to define what the content means, and, in particular, to describe the things discussed in the website. Schema.org provides a framework that’s growing in popularity for expressing such information.
In this part, you learned how to create models that take the first step toward Schema.org. Now that you understand the Schema.org-based diagram described here, you are ready to implement this model in your own HTML web pages. There are several syntax options for doing so, and I get to these options in the next part.
The second part of this four-part series shows you how to translate the abstract information model for data in your…
When you use Schema.org vocabularies and metadata to describe your content, it makes the content more useful and findable to…
Using Schema.org to describe the content on your webpages enables search engines and machines to more easily find and index…
Back to top