Taxonomy Icon

Artificial Intelligence

In part 1 of this series, I introduced Wikidata and showed you how to explore it using the command line and the API sandbox. In this article, I’ll demonstrate a basic task to delve into some of the practicalities of working with Wikidata in other projects.

A Wikidata snapshot of South America

I’d like to build a simple web profile of the countries in South American by using Wikidata. I might want to use Wikidata for this to avoid having to maintain my own database repository of information about the continent. To shift the example to Africa, developers who relied on available authoritative information such as Wikidata would not have had to update a local database when Sudan politically split up and the nation of South Sudan was born.

Of course, delegating control of key information has its own risks. For example, vandalism is a well-known problem in Wikipedia, which is usually caught and addressed for more popular resources such as country entries. But there is a small possibility that such vandalism might persist and even find its way into Wikidata. The same thing could happen with unintentional errors.

To balance these benefits and risks, you will want to have a good sense of the lifecycle of your applications and its data flow, some of which comes with experience. Keep in mind that you can pursue hybrid approaches, for example, where data is pulled from Wikidata into a local cache that is used for the application, but any relevant changes that are detected in Wikidata are flagged for review and confirmation before the local cache gets updated.

I’ll proceed here with an example that uses a direct feed from Wikidata.

My first step is to search Wikipedia for the continent. The following image shows the page that comes up for a search of “South America.” The search box is highlighted in the upper right.

image of South America

I’ve also highlighted a link, “Wikidata item” near the lower left. You should find this link on all Wikipedia pages. Click it to display the Wikidata page for South America. The following image shows the top of this page. The very first thing to note is the Wikidata Q ID (Q18) specified as part of the title.

Wikidata page for South America

At the bottom of the image, see a very long section headed “Statements.” This is where you find claims. The one you can see in the previous image is the claim or statement that South America is a continent. Notice where it says “O references.” Wikipedia and related sites strongly encourage contributors to cite sources. This goes for structured data as well. Unfortunately, there is no cited source for the claim that South America is a continent in this case.

You might find 0 references sometimes in cases where the relevant sentence in the corresponding page, in say Wikipedia, is missing a citation. Sometimes there is a citation, but the person, bot, or template process that adapted the prose in Wikipedia into structured Wikidata didn’t capture the citation as a reference. Obviously, such a reference helps establish the authority of the data, so you must decide whether to use claims without references, or indeed how to process trust in any references available.

List of countries

If you scroll down the Wikidata page in the statements, that is, claims section, you’ll see a listing of countries in South America. They are related by the property with the label “contains administrative territorial entity.”

Wikidata page for South America

If you click through that property link “contains administrative territorial entity.” you’ll find a wealth of information about the property, and of course, a key detail, the ID of the property, which is P150.

Notice the “edit” and “add reference” links on the right side. This is an example of how people can contribute to Wikidata. However, for now scroll down through the relatively brief list of South American countries.

The peril of errors

It might have changed by the time you do so, but when I looked, there was something very interesting. Look at the following image.

Wikidata page for South America, scrolled down to Chao Zhongxun

You’ll notice that among the administrative territories is listed “Chao Zhongxun.” This is clearly an error. Following the link to the Chao Zhongxun entry it turns out that he is a person.

Wikidata page for Chao Zhongxun

I don’t know how this particular error came about, whether accidentally or through vandalism. The word “Zhongxun” does not appear in the South American Wikipedia page at all, and this problem statement was probably edited directly into Wikidata. It has no references, of course, but unfortunately, neither do any of the other “contains administrative territorial entity” statements.

There is actually a defined constraint on this property that values must be “administrative territorial entity” items, which would normally be enough to automatically catch this “Chao Zhongxun” issue, but unlike in most databases and much software, Wikidata constraints are not automatically enforced. You can have bots run consistency checks on the data, and the idea is that errors are allowed. Never forget that the Wikimedia projects are all about human curation, even in the case of Wikidata, the most machine-oriented.

Specialized ingest from Wikidata

Now that you have a high-level overview of the Wikidata area I’ll be working with and even a glimpse at some of the gotchas, the following snippet is some Python code to get started by loading the JSON for South America in Wikidata.

import requests #https://pypi.org/project/requests/

WIKIDATA_API = 'https://www.wikidata.org/w/api.php'
SOUTH_AMERICA_ID = 'Q18'

def wbgetentities_q(**kwargs):
    '''
    Returns a web response object for a query to Wikidata using the wbgetentities action

    kwargs - specific parameters for the query, for example `ids`, `sites` and `titles`, etc.
                you can also specify additional parameters such as `languages`
    '''
    #Fixed parameters used for every request
    params = {'action': 'wbgetentities', 'format': 'json'}
    #Merge in parameters passed into the function
    params.update(kwargs)
    #Make the HTTP request and return the response object
    response = requests.get(WIKIDATA_API, params=params)
    return response

resp = wbgetentities_q(ids=SOUTH_AMERICA_ID)
resp_obj = resp.json()

To use this, you’ll need to install the requests library. You can run the following.

pip install requests

The resp_obj at the end is a Python object that is parsed from the JSON string as you would have seen in curl output from the previous tutorial.

Narrowing in on claims

The main thing that I want from the South America entity is a list of the “contains administrative territorial entity” claims. The function that is defined in the following snippet facilitates getting this information.

def get_claims(obj, item_id, prop_id):
    '''
    Returns a list of claims structures matching an item and property

    obj - object structure returned from wbgetentities_q
    item_id - Wikidata id of the item to extract from the object structure
    prop_id - only extract claims with this Wikidata property id
    '''
    return [ claim
                for p, val in obj['entities'][item_id]['claims'].items()
                    for claim in val
                        if p == prop_id ]

It uses the list comprehension technique to process the JSON returned from the Wikidata request. The following is an example interpreter session that uses get_claims.

>>> CONTAINS_TERRITORY_ID = 'P150'
>>> territory_claims = get_claims(obj, SOUTH_AMERICA_ID, CONTAINS_TERRITORY_ID)
>>> import pprint
>>> pprint.pprint(territory_claims[0])
{'id': 'Q18$59a46ad8-4ec5-d46c-77bf-830cf9dddf3d',
 'mainsnak': {'datatype': 'wikibase-item',
              'datavalue': {'type': 'wikibase-entityid',
                            'value': {'entity-type': 'item',
                                      'id': 'Q414',
                                      'numeric-id': 414}},
              'hash': '99986895f190926ed4b41f90f7185432b7c03986',
              'property': 'P150',
              'snaktype': 'value'},
 'rank': 'normal',
 'type': 'statement'}
>>>

The pprint module helps make the JSON-derived object structure clear in the display. Here, one of the matching claims is displayed. The value is an entity of ID Q414, located within the mainsnak substructure. It’s worth explaining this cryptic name.

A “snak” in Wikidata parlance is a typed key/value pair, in other words, a data property. The main snak in a claim is the most important part of the claim being made. There can be qualifier snaks that give additional information clarifying the claim.

In the claim “Harry Potter and the Philosopher’s Stone starred Emma Watson in the role of Hermione Granger,” the main snak is “Harry Potter and the Philosopher’s Stone starred Emma Watson.” It also includes a “played character” snak with value “Hermione Granger.”

Getting the contained country IDs while filtering out the bogus ones

The following snippet shows how you can extract the list of territory IDs from the territory_claims list.

>>> [ claim['mainsnak']['datavalue']['value']['id'] for claim in territory_claims ]
['Q414', 'Q155', 'Q298', 'Q419', 'Q739', 'Q750', 'Q717', 'Q733', 'Q736', 'Q734', 'Q77', 'Q45377968']

This works, but the structures in Wikidata are intentionally not very rigid, so in general you might want to employ a bit of defensive coding, for example, to handle a claim later added without a proper data value. The following snippet uses the dictionary get method for greater safety.

>>> [ claim['mainsnak'].get('datavalue', {}).get('value', {}).get('id') for claim in territory_claims ]
['Q414', 'Q155', 'Q298', 'Q419', 'Q739', 'Q750', 'Q717', 'Q733', 'Q736', 'Q734', 'Q77', 'Q45377968']

If you want to figure out which item corresponds to each item in this list, you would query again. You can make a single query to get all the items because there is a manageable number in this case.

>>> ids = [ claim['mainsnak'].get('datavalue', {}).get('value', {}).get('id') for claim in territory_claims ]
>>> resp = wbgetentities_q(ids=','.join(ids))
>>> territories = resp.json()

The technique '|'.join(ids) creates a pipe-separated string of the IDs, as needed by the web query. The territories here is now a JSON structure with the information for all of the values of “contains administrative territorial entity.”

Getting and using basic Wikidata item data

You can readily extract item names and other information from the territories structure, but I’ll start with a function extracting the basic Wikidata information for a given item.

INSTANCE_OF_ID = 'P31'

def get_basic_info(obj, item_id, language='en'):
    '''
    Returns id, label, description & data type of an item.
    The returned id is always item_id, but is sent back for convenience

    obj - object structure returned from wbgetentities_q
    item_id - Wikidata id of the item to extract from the object structure
    '''
    label = obj['entities'][item_id]['labels'].get('en', {}).get('value')
    description = obj['entities'][item_id]['descriptions'].get('en', {}).get('value')

    #The type of an item is actually in itself just a claim
    instance_of_claims = get_claims(obj, item_id, INSTANCE_OF_ID)
    item_types = [ claim['mainsnak'].get('datavalue', {}).get('value', {}).get('id')
                        for claim in instance_of_claims ]

    return item_id, label, description, item_types

The following command-line session demonstrates.

>>> get_basic_info(territories, 'Q414')
('Q414', 'Argentina', 'federal republic in South America', ['Q3624078', 'Q6256'])

Q3624078 is the ID of “sovereign state.” Q6256 is “country.” You can use get_basic_info to get not only the labels but also to screen out the bogus item we know is in the list (Chao Zhongxun).

>>> territory_basics = [ get_basic_info(territories, item_id) for item_id in ids ]
>>> validated = [ (i, l, d, t) for (i, l, d, t) in territory_basics if SOVEREIGN_STATE_ID in t ]
>>> [ label for (item_id, label, desc, typ) in validated ]
['Argentina', 'Brazil', 'Chile', 'Peru', 'Colombia', 'Bolivia', 'Venezuela', 'Paraguay',
'Ecuador', 'Guyana', 'Uruguay']

Feeding your cognitive app

So you’ve learned how much and how varied the information is in Wikidata and have some ideas on how to extract this information. The following example is a full example of how you can use Wikidata to feed a cognitive application, in particular, one that determines clusters of countries based on GDP per capita and geographical area.

I start with a simple list of country Wikidata IDs, which I got by using the Wikidata query sandbox. This list is in the countries.json file, which I’ve placed on GitHub. You’ll also need to install some libraries for the cognitive features, as follows.

pip install requests scipy sklearn matplotlib numpy

The following code retrieves data and metadata through the Wikidata API for these countries. It uses many of the snippets I presented previously.

import time
import json

import requests

WIKIDATA_API = 'https://www.wikidata.org/w/api.php'
GDP_PER_CAP_ID = 'P2132' #nominal GDP per capita
AREA_ID = 'P2046' #Area
INSTANCE_OF_ID = 'P31'

def wbgetentities_q(**kwargs):
    '''
    Returns a web response object for a query to Wikidata using the wbgetentities action

    kwargs - specific parameers for the query, for example `ids`, `sites` and `titles`, etc.
                you can also specify additional parameters such as `languages`
    '''
    #Fixed parameters used for every request
    params = {'action': 'wbgetentities', 'format': 'json'}
    #Merge in parameters passed into the function
    params.update(kwargs)
    #Make the HTTP request and return the response object
    response = requests.get(WIKIDATA_API, params=params)
    return response


def get_claims(obj, item_id, prop_id):
    '''
    Returns a list of claims structures matching an item and property

    obj - object structure returned from wbgetentities_q
    item_id - Wikidata id of the item to extract from the object structure
    prop_id - only extract claims with this Wikidata property id
    '''
    return [ claim
                for p, val in obj['entities'][item_id]['claims'].items()
                    for claim in val
                        if p == prop_id ]


def get_basic_info(obj, item_id, language='en'):
    '''
    Returns id, label, description & data type of an item.
    The returned id is always item_id, but is sent back for convenience

    obj - object structure returned from wbgetentities_q
    item_id - Wikidata id of the item to extract from the object structure
    '''
    label = obj['entities'][item_id]['labels'].get('en', {}).get('value')
    description = obj['entities'][item_id]['descriptions'].get('en', {}).get('value')
    #The type of an item is actually in itself just a claim
    instance_of_claims = get_claims(obj, item_id, INSTANCE_OF_ID)
    item_types = [ claim['mainsnak'].get('datavalue', {}).get('value', {}).get('id')
                        for claim in instance_of_claims ]
    #item_types = claim['mainsnak'].get('datavalue', {}).get('value', {}).get('id')
    return item_id, label, description, item_types


countries_list_obj = json.load(open('countries.json'))

meta = []
data = []

for country_id in countries_list_obj:
    #For each country query Wikipedia for target cognitive app data
    #As well as useful metadata
    resp = wbgetentities_q(ids=country_id)
    resp_obj = resp.json()

    #Extract nominal GDP per capita
    gdp_pc_claims = get_claims(resp_obj, country_id, GDP_PER_CAP_ID)
    if not gdp_pc_claims: continue
    gdp_pc = float(gdp_pc_claims[0]['mainsnak']['datavalue']['value']['amount'])

    #Extract area
    area_claims = get_claims(resp_obj, country_id, AREA_ID)
    if not area_claims: continue
    area = float(area_claims[0]['mainsnak']['datavalue']['value']['amount'])

    #Gather up the data
    item_id, label, desc, typ = get_basic_info(resp_obj, country_id)
    meta.append([item_id, label, desc])
    data.append([gdp_pc, area])

    #Be polite. Give a 0.5 second breather between Wikidata API requests
    time.sleep(0.5)


with open('country_data.json', 'w') as fp:
    json.dump([meta, data], fp, indent=2)

I’ve separated the demonstration into the part that retrieves data from Wikidata and the part that implements the cognitive application. At the bottom of the code, the retrieved data and metadata are written to a JSON file, country_data.json.

Classification by clustering

In the following code, the country data gets put to work.

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

with open('country_data.json') as fp:
    [meta, data] = json.load(fp)

#Create numpy matrix of data
X = np.array(data)

#Try KMeans with 4 clusters
clf =  KMeans(n_clusters=4)
predictions = clf.fit_predict(X, None)

#Plot the result (clusters are color coded from predictions)
plt.scatter(X[:, 0], X[:, 1], c=predictions)

#Write the plot to image file
out_png = 'classified.png'
plt.savefig(out_png, dpi=150)

#Print a listing of the country metadata with the predicted cluster for each
import pprint; pprint.pprint(list(zip(meta, predictions)))

This code loads the data and metadata retrieved from Wikidata. It uses the k-means clustering technique to see whether countries can be classified by their Gross Domestic Product (GDP) and geographic area of territory. Think of this as a very rough indicator of how a country converts its land resources into individual wealth for its citizens.

I guessed at a target of four clusters, and you can see the resulting graph in the following image.

K-means classifier plot for GDP per capita versus area

There do seem to be a couple of reasonable clusterings, with the yellow dot at the top (Russia) all by itself, and then a group of large nations with similar GDP per capita. Ultimately, though, I would say that overall the clustering is not well-defined. An analyst might want to try some different instruments, but this is exactly the sort of rapid-fire experimentation made possible by having such an encyclopedic data source.

This code does gloss over some important details. For example, you should verify that units for the numbers are matching across the board. This information is in the claims structure. You should also consider cases where there are multiple claims, especially for GDP per capita. There can be some politicization of that number.

Conclusion

I mentioned that I used a Wikidata query sandbox to get the initial list of country IDs. Wikidata supports aggregate query through SPARQL Protocol and RDF Query Language (SPARQL). SPARQL is a big topic, and I don’t cover it in this tutorial, but if you do use Wikidata extensively, it’s something that you should probably learn.

Wikidata is an enormous trove of community-curated data. It is not without its problems, but as long as you are aware of issues there is a lot of value you can apply in driving new applications, including cognitive applications. Start with the techniques covered in these two articles, and don’t be shy about exploring. You might have heard of or experienced the Wikipedia rabbit hole, where reading one article leads you to click through to another, and then another, and so on, until you find yourself learning all sorts of things that are hardly related to where you started. Wikidata is much the same after you get some practice using it.