Introduction to the Schema.org information model
Make your data more useful in automated applications
With the rise of artificial intelligence (AI) and cognitive computing, there’s an increasing need for a structured data format that other computers can easily understand. To meet that need, in 2011 a group of search engine companies and large-scale web publishers created an initiative called Schema.org to describe objects that web pages are actually about.
In this four-part series, I introduce you to Schema.org and show you how to use it to create more searchable web pages. In Part 1, I begin by explaining the history of the project.
Benefits of Schema.org
To begin with, let’s look at some of the benefits of Schema.org. Why add Schema.org markup to your pages? The bottom line is that doing so will make your pages more accessible and easier to find for search engines, AI assistants, and related web applications. You don’t have to learn any new development systems or tools to use the markup an can broadly get up to speed in a couple hours. Other benefits include:
- Aid in contextual search. Search engine companies and specialists are increasingly guiding users based on particular interests, rather than through blanket search terms. They are understanding intent and surfacing content that answers a user’s intent. Is the user shopping? Looking for a film to watch? Searching to solve a technical problem? If you use Schema.org markup, you allow search engines to include your sites according to contextual features, even more so if they are searching by voice or on mobile devices.
- Signal updated, quality content. When it comes to increasing your search engine ranking, there’s no replacement for creating great, quality content and cultivating legitimate links to your content. But using Schema.org markup signals to search engines that your content is well updated and of good quality.
- Increase click-through rates. When your Schema.org-enriched sites do show up in search engine rankings, they do so with the modern contextual features of the listing, called rich snippets. Rich snippets stand out from other search results, leading to better click-through rates by users.
- Improve content’s maintainability. When planning a site’s content, many people forget to plan for how to deal with content when it is out-of-date or irrelevant. Having pages that include Schema.org markup makes it easier to identify these pages and implement a plan during times of transition. Adding Schema.org markup makes it much easier to develop tools to work with your existing page and incorporate them into successive sites and software projects. It also makes it easier for you to collaborate with partners on new, joint projects based on your existing sites.
Home pages for eyeballs
In the first days of the web, everything you wanted to see was on a home page. Those initial web pages were like a personal bulletin pinned on a public board, but with hyperlinks. The goal was to have humans looking at the pages.
Before long, the Mosaic browser made it possible to embed images among the text, which made the web more enticing for users. Embedded media objects opened the door to audio, video, and application objects. Quickly, other industries besides information and communication started to use — and eventually to dominate — the web.
We have less useful automation than we would have if there were a common language. The web might seem an amazingly innovative place, but we are missing out on many more possibilities.
With the explosion of data on the Internet, it quickly became necessary to categorize and tag content so that humans could more easily find the information they were looking for.
Early web inventors wanted to spread organizational tools more broadly on the web. In the 1990s, work on the “web of data” technology began. The initial predictions for data on the web were grand. A May 2001 story in Scientific American, by Sir Tim Berners-Lee and colleagues, entitled “The Semantic Web,” set forth their ambitions for a new technology that would provide a common language for data on the web, making automation easier.
While much of this envisioned automation is now a reality, it’s primarily due to the extraordinary feats of intense data munging by large search engines and tech companies, and not because the common language for data on the web ever took off. As a result, the automation we have now is not as useful as it would be if there were a common language. The web might seem an amazingly innovative place, but we are missing out on many more possibilities.
The advent of Schema.org will bring to life the promise of the semantic web. Through the efforts of the big players, even smaller players can now benefit.
RDF, linked data, microformats, and more
In 2000, I wrote an article for IBM Developer, “An introduction to RDF,” that explained the technology that the Worldwide Web Consortium (W3C) was advocating to provide a common language for data on the web. The Resource Description Framework (RDF) is a set of specifications for modeling data on the web, to make work easier for autonomous agents and improve search engines and service directories. RDF was originally conceived as a simple model for expressing bits of data on the web.
Unfortunately, the W3C ended up piling so many complicated specifications on top of RDF (including full-blown AI facilities) that they were never really clear on how to boil the semantic web down to something simple enough that a typical web developer could easily learn.
Semantic web layer cake
To counteract these complicated specifications, an initiative called “Linked Open Data” began to push for a simplified set of principles. The name shortened to “Linked Data” as it became clear that the principles were useful even for enterprise and in private contexts. Linked Data basically recommends using HTTP URLs to identify things, rather than, say, plain text strings, and using conventions such as simple RDF to provide associated information for the identified things. This information might consist, for example, of labels that make use of plain text strings.
At first this metadata was provided separately from the web page itself, but web developers quickly began advocating for the use of simple HTML conventions to encode metadata right in the web page. These were called microformats.
All these developments crystallized over the course of a decade into Schema.org in 2011. The high-minded semantic web was simplified into Linked Data, while the need for separate file representations was eliminated by using microformat techniques.
An information model for your web pages
So, what does all this mean to today’s web developer? For one thing, it means you have to ask, “What is my content actually about?”
Let’s say you maintain a web site for a book club. What are your pages about? They are probably about books, meetings, and members, and you describe these things with a conventional set of descriptions. For instance:
- Books are described in terms of titles, authors, ISBNs, cover images, and so on.
- Meetings are described in terms of times/dates, locations, and attendees.
- Members are described in terms of their names, contact information, and photos.
A person might be a member of the club and also a book’s author. In that case, some elements of a member’s description could be shared with that of an author. With that in mind, you might visualize the data describing your club as similar to the kind of data organization found in object-oriented programming.
Figure 2 shows part of this mental map, in which I’ve made up what I call the Geo Book Club.
Book club raw information model
So, what are we looking at?
The ovals are web resources (a little bit analogous to object-oriented instances). The most important thing about this mindset is that you think about URLs as much in terms of things they describe as you do the content they offer.
http://example.com/geobookclub is the Geo Book Club’s website. In this model, I also consider it a thing, that is, a club. The resource type describes the type of thing that it is, and I use a leading line in capital letters to indicate this in the diagram.
Resource types organize the conventions for properties that are associated with specific things. For example, a person wouldn’t be associated with an ISBN. Resource types place controls over the data patterns, making it more efficient for applications to understand the data.
The arrows show the relationships or links between objects. It’s important to label every link that you wish to elevate to an explicit relationship. You don’t just say that the book “Things Fall Apart” is related to the person “Chinua Achebe.” Instead, be more specific: The book “Things Fall Apart” is authored by the person “Chinua Achebe.” Because a book could have other related people, such as editors or illustrators, labeling the specific relationships helps web applications accurately process the data.
Sometimes the value of a relationship is just text rather than another web resource. The diagram shows these as rectangles, and they are called literals. Literals can also be numbers, dates, Booleans, and other sorts of fundamental data.
The cloud shape is just a convenient marker for detail we don’t need for this tutorial. I used them to show that a club can have multiple meetings, but in this series we care only about the details of the second one. The clouds are meant to show that there can be multiple meetings, each a separate relationship.
You could imagine a way of modeling this with some sort of container object, say “membership” to hold the members, or “schedule” to hold the events. However, containers get complex quickly. Schema.org emphasizes simplicity, so conventions are more often to merely express multiple instances of a relationship.
The book cover is an interesting special case. For one thing, it is a web URL linking to an image file. Schema.org allows you to include different sorts of web URLs in relationships, including images and other non-text media objects. There is also no resource type specified. In a few cases such as this, you can let the relationship carry the weight, though Schema.org does also provide a more thorough way of expressing such media relationships where needed.
RDF version of the model
If the model described above makes sense to you, you are close to understanding RDF well enough to start using Schema.org. Keep in mind just two considerations.
- All relationships must be URLs, not just simple strings such as “member” and “author”. These are formally called predicates in RDF, but Schema.org uses the term properties, and provides a web page for each property it defines. That way, a person—or even a machine—can just go to a relationship’s URL and see a readable description.
- Resource types are expressed using a special RDF predicate,
http://www.w3.org/1999/02/22-rdf-syntax-ns#type, conventionally abbreviated as rdf:type. The value of this relationship is called an RDF class.
Figure 3 shows a subset of the Geo Book Club model illustrating the fully expressed predicates and type/class relationships. You can imagine how cluttered it would be if I carried all that data through the entire diagram.
Book club information model snippet with full RDF predicates and type info
There is no Schema.org class specifically for a book club, so I used the one for an organization. Incidentally, Schema.org is not meant to provide a comprehensive model of anything everyone might wish to express on the web. However, if enough book club organizers got together and decided to come up with Schema.org extensions to suit their needs, they might eventually get them into the core Schema.org model. Rough consensus and actual use are the most important drivers in the evolution of Schema.org.
Fitting the model to Schema.org
The following diagram shows a Schema-org conforming model. I use two abbreviations to reduce clutter:
- URL abbreviation convention from RDF: A prefix followed by a colon and the tail end of the URL.
- Resource type abbreviation: The second abbreviation is to specify the resource type in parenthesis underneath the resource identifier itself.
Book club Schema.org information model
Besides the change to
schema:Organization, there is another vocabulary change to match Schema.org. The
cover relationship is given as
Schema.org supports a class inheritance capability similar to what you might know from object-oriented programming. It has one ancestral class
schema:Thing, from which all the classes derive.
schema:Organizationis a subclass of
schema:Bookis a subclass of
schema:CreativeWorkwhich is in turn a subclass of
Even properties are subclasses of
schema:Thing, but this is a bit of an arcane detail.
More interestingly, Schema.org makes much use of subproperties, which are analogous to subclasses. For example, the Schema.org model doesn’t directly specify
schema:isbn as a recognized property on
schema:Book. Rather it specifies
schema:identifier. However, there are several subproperties of
These different sorts of identifiers make sense in specific contexts.
Subproperties follow the Liskov Substitution principle, which you might remember from object-oriented programming. In basic terms, that means that you can substitute any subproperty for its parent. So since
schema:identifier is recognized on
schema:Book, you are free to substitute
schema:isbn, as I do in the Geo Book club example.
If you run a web site, you already deal with models and frameworks for how web pages should look and behave. It’s becoming increasingly important to define what the content means, and, in particular, to describe the things discussed in the website. Schema.org provides a framework that’s growing in popularity for expressing such information.
In this part, you learned how to create models that take the first step toward Schema.org. Now that you understand the Schema.org-based diagram described here, you are ready to implement this model in your own HTML web pages. There are several syntax options for doing so, and I get to these options in the next part.