Taxonomy Icon

Artificial Intelligence

We all know Wikipedia, the multi-lingual, multi-cultural online encyclopedia created and edited by volunteers worldwide. It is the flagship project of the Wikimedia Foundation. But there are other related projects, and one of the more recent is Wikidata, which bills itself as a free knowledge base that anyone can edit.

As an encyclopedia is to people – a cross-referenced collection of information people can use to learn and to verify knowledge – a knowledge base is to machines. Knowledge bases can be used to drive machine learning and other AI techniques to provide core data to feed software projects, and as tools for cross-integration and cross-validation of databases, among many other uses.

Wikidata uses Creative Commons 0 to dedicate its material to the public domain, so it’s not just a comprehensive source of data, but one unencumbered by intellectual property as well. If you have any structured data many others might find useful, don’t mind offering it to the public domain, and don’t mind working on it with the collaboration of others, you can contribute to Wikidata, following control processes not unlike those of Wikipedia.

Wikidata in Wikipedia information boxes

You might have noticed the recent page previews feature in Wikipedia. The following image shows what happens when I load the Wikipedia page for Nigeria then hover over the link to its capital city, Abuja.

nigeria-abuja

You can see a window for Abuja overlapping the left margin somewhat. What you’ll see here is basic information about the page, its title, a brief summary or description, and sometimes an image.

This idea of having a few bits of content with a core regular structure actually goes way back in Wikipedia, and page previews is merely a particularly flashy manifestation of this. Wikipedia has always had a system of reusable information templates for series of data known as Infoboxes.

Before I move on, I want to make sure that if you like the page previews feature you can take advantage of it. Go to your Wikipedia preferences page, which should be linked along the top of any Wikipedia page after you’ve logged in. You’ll see the option and you can make sure it’s turned on, as illustrated in the following image.

page-preview-prefs

Data and Infoboxes

A Wikipedia Infobox is a summary description of a person, place, or thing in some sort of structure that allows it to be collected or contrasted with other similar things. The structure is edited by a Wikipedia editor as a set of attributes and values. For example, an Infobox about a book might have a title attribute with the value “Things Fall Apart” and an author attribute with the value “Chinua Achebe.” An editor or bot might then use this to build a handy table of books published in 1958, for example, relying on the comparable structure of attributes to simplify the process.

Making use of such Infoboxes has become more important as Wikipedia pushes to be an increasingly comprehensive reference for all sorts of things. The broad scope of the project makes it clear that there needs to be good management of the underlying structures between information in Infoboxes and other parts of Wikipedia, and even other Wikimedia projects. Such managed structure is just as important when Wikipedia references one or more resources from Wikiquote, for example.

It is important to have some sort of repository of knowledge that is structured to be reused within and across such projects, and this is Wikidata’s primary role. It is the central knowledge base for sister Wikimedia projects, which, as you might be aware, primarily use the MediaWiki software for content management. MediaWiki has a subproject Wikibase, which is all about managing structured data within MediaWiki.

Wikibase Repository is the main aspect of Wikibase, a MediaWiki extension that allows it to serve as a structured data repository. There is also Wikibase Client, which supports a MediaWiki installation in using structured data from a source Wikibase Repository. Wikidata is implemented by using installations of both.

MediaWiki and the Wikibase projects are all open source, and you can use them for private or separate applications if you like, but in this article I focus on how they are used for Wikidata.

Wikidata identifiers and properties

Wikidata is structured as a repository of items. An item is, in common knowledge base terms, a thing, an entity, a concept. In short, whatever can be described. A repository has properties, the most important of which in Wikidata terms are the label and description. An item might also have one or more aliases, alternative labels.

Each item has a unique identifier. It is always in the form of a letter Q followed by a number. Each property also has an identifier, a P followed by a number. A statement is the triple of an item ID, a property ID, and a value. A value can be another item, which is called a reference to another item. For example, the statement that the city of Denver is in the state of Colorado is one that forms a relationship between two items. The statement that the state of Colorado was founded in 1876 is between an item and a plain data value.

The following diagram is an example of some Wikidata statements in context of the item San Francisco (id Q62), by Wikipedian John Erling Blad.

linked-data-san-francisco

The values of mayor and in properties are references to other items. The values of the population and location properties are data values. One is a simple one of type Number, and the other a compound one of type Geolocation.

Finding and extracting data from Wikidata

The best way to get proficient with Wikidata is to dive in. I’ll start out using a couple of command-line tools. I use curl for interacting with web-based resources. Because Wikidata, like many web data interfaces uses JSON I use jq, a very handy command-line utility for processing JSON. Both are available for Linux, Mac, and Windows.

One simple thing to do is to get the information (ID, label, aliases, description, and other statements) about a Wikidata item with given ID. Try the following command line.

  curl "https://www.wikidata.org/w/api.php?format=json&action=wbgetentities&ids=Q16554" | jq

You will see a great deal of JSON (almost 5000 lines) printed out in nice spacing and color coding, as provided by jq. This is the information about Wikidata entity Q16554, the city of Denver, and there is a lot of information, in many languages, about this in Wikidata.

I mentioned Wikibase earlier, and one of the things it provides is an API that you can query from the web, which is what we’re using here through the URL. You specify an action as a URL parameter, in this case wbgetentities. This action retrieves the data for one or more Wikibase entities (Wikidata items in this case) according to criteria you provide in other URL parameters.

In particular, the ids=Q16554 parameter is what specifies the items to be retrieved. You can specify multiple items as well, for example ids=Q16554|Q192517, which would provide information on Denver as well as Boulder, Colorado.

There is also format=json, specifying JSON, the only output format that is still officially maintained for API-style use from MediaWiki, and so from Wikibase. You won’t want to leave out this parameter, though, because if you do you get an HTML format that would be a lot more work to deal with in an automated fashion.

Lookups by Wikipedia page

What if you don’t know the Wikidata ID, as will often be the case? This is one area where you get value out of the connection between Wikidata and other Wikimedia sites. There is an action, wbgetentities that allows you to look up a Wikidata item by the title of its corresponding page on a given Wikipedia, or other Wikimedia site.

But because these command lines start to get pretty long, let’s use the shell environment for a shortcut. Set up a variable, for example, the following code would work on most Linux systems.

export WBGETENTITIES="https://www.wikidata.org/w/api.php?format=json&action=wbgetentities"

Now use it in a curl request like in the following command line.

curl "$WBGETENTITIES&ids=Q16554" | jq

This is equivalent to the curl request above. You might have to tweak how you set or refer to environment variables in your environment, or you can just change all subsequent examples in this article to replace $WBGETENTITIES with https://www.wikidata.org/w/api.php?format=json&action=wbgetentities.

The following request pulls the Wikidata about an item from by Wikipedia page and title.

curl "$WBGETENTITIES&sites=enwiki&titles=Denver" | jq

This time there is no ids URL parameter but rather a pair of parameters, titles=Denver and sites=enwiki. Remember that the order of the parameters does not matter in URLs. The first is the title of a page to be looked up. There is a page in the English Wikipedia titled “Denver.” This is what we’re looking for, but we also must specify that site, and we do so by using sites=enwiki.

This restricts the search to the English Wikipedia (en.wikipedia.org). There are multiple Wikipedia sites, primarily separated by language. The basic idea is that a language Wikipedia site only has articles that have been translated to that language, so if there are 10 times as many contributors working in English as in German you would expect there to be many more pages in en.wikipedia.org than de.wikipedia.org.

You can also specify sites such as language versions of Wiktionary, Wikibooks, Wikiquote, and the very handy Wikimedia Commons. Specify multiple sites to be searched by using a pipe separator, for example, sites=enwiki|eswiki. The same with titles, though if you have multiple titles you can only have one site, and vice versa.

More parameters

Another useful parameter is languages, which lets you reduce some parts of the response only to languages you care about. Output from the following code is a few hundred lines shorter.

curl "$WBGETENTITIES&languages=en&sites=enwiki&titles=Denver" | jq

If you get the capitalization wrong in a site or title search you won’t get any results, but you can use the normalize=yes parameter to make it more forgiving.

$ curl "$WBGETENTITIES&languages=en&sites=enwiki&titles=denver" | jq
{
  "entities": {
    "-1": {
      "site": "enwiki",
      "title": "denver",
      "missing": ""
    }
  },
  "success": 1
}

The following code is what it looks like when you don’t get a matching result. The problem is the lowercase “d” in “Denver.” The target page title has the uppercase letter. You can get around this distinction by normalizing.

$ curl "$WBGETENTITIES&languages=en&normalize=yes&sites=enwiki&titles=denver" | jq
{
  "normalized": {
    "n": {
      "from": "denver",
      "to": "Denver"
    }
  },
  "entities": {
    "Q16554": {
      "pageid": 19277,
      "ns": 0,
      "title": "Q16554",
      "lastrevid": 681640180,
      "modified": "2018-05-19T18:31:58Z",
      "type": "item",
      "id": "Q16554",
      "labels": {
        "en": {
          "language": "en",
          "value": "Denver"
        }
      },
      "descriptions": {
        "en": {
          "language": "en",
          "value": "capital city of the state of Colorado, United States; consolidated city and county"
        }
      },
      ...

It shows what was normalized at the top of the response.

Slicing and dicing with jq

These commands produce a large volume of data. Luckily, you can use all of the power of the jq command line to get what you need from the JSON. For example, you can get just a list of the corresponding Wikidata IDs from a sites/titles query as follows.

$ curl "$WBGETENTITIES&languages=en&sites=enwiki&titles=Denver" | jq '.entities[].id'
"Q16554"`

The jq argument selects the top level entities object, then goes down one level with the [] part, and then selects the id.

You can see the object keys for each returned entity as follows.

$ curl "$WBGETENTITIES&languages=en&normalize=yes&sites=enwiki&titles=denver" | \
  jq  '.entities[] | keys'
[
  "aliases",
  "claims",
  "descriptions",
  "id",
  "labels",
  "lastrevid",
  "modified",
  "ns",
  "pageid",
  "sitelinks",
  "title",
  "type"
]

This is an example of a chain of two filters in jq. The first gets each object within entities, then the pipe feeds into another, special filter that gets just the object’s keys.

I’ve shown id and I’ve discussed labels, aliases, and descriptions. Let’s use jq’s ability to select one or more objects to show these values.

$ curl "$WBGETENTITIES&languages=en&normalize=yes&sites=enwiki&titles=denver" | \
  jq  '.entities[] | .labels, .descriptions, .aliases'
{
  "en": {
    "language": "en",
    "value": "Denver"
  }
}
{
  "en": {
    "language": "en",
    "value": "capital city of the state of Colorado, United States; consolidated city and county"
  }
}
{
  "en": [
    {
      "language": "en",
      "value": "City and County of Denver"
    },
    {
      "language": "en",
      "value": "Denver, Colorado"
    },
    {
      "language": "en",
      "value": "Mile High City"
    },
    {
      "language": "en",
      "value": "Queen City of the Plains"
    },
    {
      "language": "en",
      "value": "Queen City of the West"
    }
  ]
}

Claims and statements

I’ve also discussed statements, which are represented as claims in this API. Claims are the bulk of the typical API response, as each is represented in great detail. Let’s look at one example.

$ curl "$WBGETENTITIES&languages=en&normalize=yes&sites=enwiki&titles=denver" | \
  jq  '.entities[].claims.P571'
[
  {
    "mainsnak": {
      "snaktype": "value",
      "property": "P571",
      "hash": "8c0b8cba2c574ab60396aed21b1792960da405ae",
      "datavalue": {
        "value": {
          "time": "+1858-11-22T00:00:00Z",
          "timezone": 0,
          "before": 0,
          "after": 0,
          "precision": 11,
          "calendarmodel": "http://www.wikidata.org/entity/Q1985727"
        },
        "type": "time"
      },
      "datatype": "time"
    },
    "type": "statement",
    "id": "Q16554$f9c34373-46d8-9a19-741d-fb226780409d",
    "rank": "normal"
  }
]

I’ll focus on a few key bits of this. Most important is the property ID. Is this value claim/statement about the population of Denver? The elevation of its highest point? You can’t actually tell from the previous code. You must look up the property ID, P571. Use the same pattern as when looking up an item ID. Use jq to narrow down to the information you need.

$ curl "$WBGETENTITIES&languages=en&ids=P571" | \
  jq  '.entities[] | .labels, .descriptions'
{
  "en": {
    "language": "en",
    "value": "inception"
  }
}
{
  "en": {
    "language": "en",
    "value": "date or point in time when the organization/subject was founded/created"
  }
}

So now, we know that the property in question is the date of inception of Denver, or in other words, its founding date. The value is within datavalue and value.

"time": "+1858-11-22T00:00:00Z"

The following is the date that is expressed in the internet convention based on the ISO 8601 standard.

YYYY-MM-DDThh:mm:ss...

So the year is 1858, the month is November, and the date 22. Denver was founded on 22 November 1858. The time of day is not specified and is left zeroed out. The Z at the end is the usual default when the time information is not specified. It is basically Greenwich Mean Time, with the Z coming from aviation lingo “Zulu time.”

Exploring Wikidata through the sandbox

So far, I’ve covered the nuts and bolts of accessing Wikidata, but after you start really playing around with it and exploring it for your own uses, it can become a bit tedious to use the command line. Luckily, there is a handy API sandbox for Wikidata.

wikidata-api-sandbox1

The first field is the action, which to simulate our previous examples you would select as wbgetentities. For a link that already has that action selected you can use this shortcut URL.

Below the format there are many technical options that you can use to explore, but to find the parameters such as ids, sites, and titles I’ve been using, you must click the action=wbgetentities text that shows up in the left column, under main. After you do this, the page looks like the following image.

wikidata-api-sandbox2

Feel free to try entering IDs or sites/titles combinations and tweaking other parameters to get more comfortable with the Wikidata API. Make sure you press enter after each value you type into the ids, sites, or titles field or the web app won’t recognize your entry. This is a little different from typical web form behavior.

The following image shows what the page displays after I specified enwiki in sites and IBM in titles then clicked Make request.

wikidata-api-sandbox3

You can see on the right side the same JSON response you saw on the command line, and you can see that Q37156 is the WIkidata ID for the IBM corporation.

Conclusion

There is an enormous amount of data in Wikidata, a lot of which derives from the contributions to Wikipedia of more than more than 500,000 humans and bots, in more than 300 languages. It is a rich source of basic definitions and descriptions of data entities, which you can import into your own apps. It is also a source of lists and other structured data you can treat as references that you don’t have to maintain yourself.

I recommend playing around with the sandbox to get used to how the API works, and then experiment further with scripting on the command line. After you get used to navigating the structure of Wikidata, you can start to look into more tailored access into your applications by using the API from your favorite programming language or framework. That’s what I’ll discuss in the second part of this series.