IBM Developer Day | Bengaluru | March 14th Register now
Article
By Philip McCarthy | Published May 10, 2005
Java
The Resource Description Framework, or RDF, enables data to be decentralized and distributed. RDF models can be merged together easily, and serialized RDF can be simply exchanged over HTTP. Applications can be loosely coupled to multiple RDF data sources over the Web. At PlanetRDF.com, for example, we syndicate weblogs from authors who provide their content as RDF in a RSS 1.0 feed. The URLs of the authors’ feeds are themselves held in an RDF graph, called bloggers.rdf.
But how can you find and manipulate the data you need within RDF graphs? The SPARQL Protocol And RDF Query Language (SPARQL) is currently under discussion as a W3C Working Draft. SPARQL builds on previous RDF query languages such as rdfDB, RDQL, and SeRQL, and has several valuable new features of its own. In this article, we’ll use the three types of RDF graph that drive PlanetRDF — FOAF graphs describing contributors, their RSS 1.0 feeds, and the bloggers graph — to demonstrate some of the interesting things SPARQL queries can do with your data. Implementations of SPARQL exist for a variety of platforms and languages; this article will focus on the Jena Semantic Web Toolkit for the Java platform.
This article assumes that you have a working knowledge of RDF, and are familiar with RDF vocabularies such as Dublin Core, FOAF, and RSS 1.0. It also assumes you have some experience with the Jena Semantic Web Toolkit. To get up to speed on all of these technologies, check out the links in the Related topics section below.
Let’s start by looking at PlanetRDF’s bloggers.rdf model. It is fairly straightforward, using the FOAF and Dublin Core vocabularies to provide a name, a blog title and URL, and an RSS feed description for each blog contributor. Figure 1 shows the basic graph structure for a single contributor. The full model simply repeats this structure for each blog that we aggregate.
Now let’s look at a very simple SPARQL query over the bloggers model. The query, in English, says, “Find the URL of the blog by the person named Jon Foobar,” and is shown in Listing 1:
(newline)PREFIX foaf: (less-than(http)xmlns.com/foaf/0.1/(greater-than(newline)SELECT ?url(newline)FROM (less-thanbloggers.rdf(greater-than(newline)WHERE {(newline) ?contributor foaf:name "Jon Foobar" .(newline) ?contributor foaf:weblog ?url .(newline)}
The first line of the query simply defines a PREFIX for the FOAF namespace, so that you don’t have to type it in full each time it is referenced. The SELECT clause specifies what the query should return — in this case, a variable named url. SPARQL variables are prefixed with either ? or $ — the two are interchangeable, but I’ll stick to ? in this article. FROM is an optional clause that provides the URI of the dataset to use. Here, it’s just pointing to a local file, but it could also indicate the URL of a graph somewhere on the Web. Finally, the WHERE clause consists of a series of triple patterns, expressed using Turtle-based syntax. These triples together comprise what is known as a graph pattern.
PREFIX
SELECT
url
?
$
FROM
WHERE
The query attempts to match the triples of the graph pattern to the model. Each matching binding of the graph pattern’s variables to the model’s nodes becomes a query solution, and the values of the variables named in the SELECT clause become part of the query results.
In this example, the first triple in the WHERE clause’s graph pattern matches a node with a foaf:name property of “Jon Foobar,” and binds it to the variable named contributor. In the bloggers.rdf model, contributor will match the foaf:Agent blank-node at the top of Figure 1. The graph pattern’s second triple matches the object of the contributor‘s foaf:weblog property. This is bound to the url variable, forming a query solution.
foaf:name
contributor
foaf:Agent
foaf:weblog
SPARQL support in Jena is currently available via a module called ARQ. In addition to implementing SPARQL, ARQ’s query engine can also parse queries expressed in RDQL or its own internal query language. ARQ is under active development, and is not yet part of the standard Jena distribution. However, it is available from either Jena’s CVS repository or as a self-contained download.
It’s simple to get ARQ up and running. Just grab the latest ARQ distribution, unpack it, and set the environment variable ARQROOT to point to the ARQ directory. You may also need to restore read and execute permissions to the contents of the ARQ bin directory. It’s convenient to add the bin directory to your execution path, as it contains wrapper scripts to invoke ARQ from the command line. To make sure that you are all set, call sparql from the command line, and make sure you see its usage message. All these steps are illustrated in Listing 2, which assumes that you are working on a UNIX-like platform, or with Cygwin under Windows. (ARQ also ships with .bat scripts to use under Windows, but their usage will vary slightly from the example shown here.)
ARQROOT
sparql
.bat
$ export ARQROOT=~/ARQ‑0.9.5 $ chmod +rx $ARQROOT/bin/∗ $ export PATH=$PATH:$ARQROOT/bin $ sparql Usage: [‑‑data URL] [exprString | ‑‑query file]
Now you’re ready to run a SPARQL query (see Listing 3). We’ll use the dataset and query from Listing 1. Because the query uses the FROM keyword to specify the graph to use, it is necessary only to provide the location of the query file to the sparql command. However, the query needs to be run from the directory containing the graph, because it is specified using a relative URL in the query’s FROM clause.
$ sparql ‑‑query jon‑url.rq ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ | url | ============================ | <http://foobar.xx/blog> | ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
Sometimes, it makes sense to omit the FROM clause from a query. This allows a graph to be passed to the query when it is executed. Avoiding binding the dataset to the query is good practice when using SPARQL from application code — it allows the same query to be reused with different graphs, for instance. A graph is specified at runtime on the command line with the sparql --data URL option, where URL is the graph’s location. This URL could either be the location of a local file or the Web address of a remote graph.
sparql --data URL
URL
While the command-line sparql tool is useful for running standalone queries, Java applications can call on Jena’s SPARQL capabilities directly. SPARQL queries are created and executed with Jena via classes in the com.hp.hpl.jena.query package. Using QueryFactory is the simplest approach. QueryFactory has various create() methods to read a textual query from a file or from a String. These create() methods return a Query object, which encapsulates a parsed query.
com.hp.hpl.jena.query
QueryFactory
create()
String
Query
The next step is to create an instance of QueryExecution, a class that represents a single execution of a query. To obtain a QueryExecution, call QueryExecutionFactory.create(query, model), passing in the Query to execute and the Model to run it against. Because the data for the query is provided programmatically, the query does not need a FROM clause.
QueryExecution
QueryExecutionFactory.create(query, model)
Model
There are several different execute methods on QueryExecution, each performing a different type of query (see the sidebar entitled “Other types of SPARQL queries” for more information). For a simple SELECT query, call execSelect(), which returns a ResultSet. The ResultSet allows you to iterate over each QuerySolution returned by the query, providing access to each bound variable’s value. Alternatively, ResultSetFormatter can be used to output query results in various formats.
execSelect()
ResultSet
QuerySolution
ResultSetFormatter
Listing 4 shows a simple way to put these steps together. It executes a query against bloggers.rdf and outputs the results to the console.
// Open the bloggers RDF graph from the filesystem InputStream in = new FileInputStream(new File("bloggers.rdf")); // Create an empty in‑memory model and populate it from the graph Model model = ModelFactory.createMemModelMaker().createModel(); model.read(in,null); // null base URI, since model URIs are absolute in.close(); // Create a new query String queryString = "PREFIX foaf: <http://xmlns.com/foaf/0.1/> " + "SELECT ?url " + "WHERE {" + " ?contributor foaf:name \"Jon Foobar\" . " + " ?contributor foaf:weblog ?url . " + " }"; Query query = QueryFactory.create(queryString); // Execute the query and obtain results QueryExecution qe = QueryExecutionFactory.create(query, model); ResultSet results = qe.execSelect(); // Output query results ResultSetFormatter.out(System.out, results, query); // Important ‑ free up resources used running the query qe.close();
To further refine the results of a query, SPARQL has DISTINCT, LIMIT, OFFSET, and ORDER BY keywords that operate more or less like their SQL cou nterparts. DISTINCT may be used only with SELECT queries, in the form SELECT DISTINCT. This will remove duplicate query solutions from the result set, leaving each remaining solution unique. The other keywords are all placed after the WHERE clause of a query. LIMIT n restricts the number of results returned from a query to n, while OFFSET n omits the first n results. ORDER BY ? var will sort the results by the natural ordering of ? var — lexically if var is a string literal, for instance. ASC[ ? var] and DESC[ ? var] can be used to specify the direction of the sort.
DISTINCT
LIMIT
OFFSET
ORDER BY
? var
var
Of course, it is possible to combine DISTINCT, LIMIT, OFFSET and ORDER BY in a query. For instance, ORDER BY DESC[?date] LIMIT 10 could be used to find the ten most recent entries in an RSS feed.
So far, you’ve seen two ways to run a simple SPARQL query: using the command-line sparql utility, and using Java code with the Jena API. In this section, I’ll introduce more of SPARQL’s features, and the more complex queries they enable.
RDF is often used to represent semi-structured data. This means that two nodes of the same type in a model may have different sets of properties. For instance, a description of a person in a FOAF model may consist of only an e-mail address; alternatively, it could incorporate a real name, IRC nicknames, URLs of photos depicting the individual, and so on.
Listing 5 shows a very small FOAF graph, expressed in Turtle syntax. It contains descriptions of four fictional people, but the descriptions each have different sets of properties.
@prefix foaf: <http://xmlns.com/foaf/0.1/> . :a foaf:name "Jon Foobar" ; foaf:mbox <mailto:jon@foobar.xx> ; foaf:depiction <http://foobar.xx/2005/04/jon.jpg> . :b foaf:name "A. N. O'Ther" ; foaf:mbox <mailto:a.n.other@example.net> ; foaf:depiction <http://example.net/photos/an‑2005.jpg> . :c foaf:name "Liz Somebody" ; foaf:mbox_sha1sum "3f01fa9929df769aff173f57dec2fe0c2290aeea" :d foaf:name "M Benn" ; foaf:depiction <http://mbe.nn/pics/me.jpeg> .
Suppose you want to write a query that returns the name of every person described by the graph in Listing 5, along with a link to a photograph for each, if one’s available. A SELECT query whose graph pattern included foaf:depiction would only find three solutions. Liz Somebody would not form a solution, because she has a foaf:name but no foaf:depiction property, and needs both to match the query.
foaf:depiction
Help is at hand in the form of SPARQL’s OPTIONAL keyword. Optional blocks define additional graph patterns that do not cause solutions to be rejected if they are not matched, but do bind to the graph when they can be matched. Listing 6 demonstrates a query that finds the foaf:name of each person in the FOAF data in Listing 5, and then optionally finds the accompanying foaf:depiction.
OPTIONAL
PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name ?depiction WHERE { ?person foaf:name ?name . OPTIONAL { ?person foaf:depiction ?depiction . } . }
Listing 7 shows the results of running the query in Listing 6. While all query solutions contain the person’s name, the optional graph pattern is only bound when a foaf:depiction property exists; otherwise, it is simply omitted from the solution. In this regard, the query is similar to a left outer join in SQL.
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ | name | depiction | ============================================================ | "A. N. O'Ther" | <http://example.net/photos/an‑2005.jpg> | | "Jon Foobar" | <http://foobar.xx/2005/04/jon.jpg> | | "Liz Somebody" | | | "M Benn" | <http://mbe.nn/pics/me.jpeg> | ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
An optional block can contain any graph pattern, not just a single triple as shown in Listing 6. The whole query pattern in an optional block must be matched in order for the optional pattern to form part of a query solution. If a query has multiple optional blocks, they act independently of one another — each one may be omitted from, or present in, a solution. Optional blocks can also be nested, in which case the inner optional block is only considered when the outer optional block’s pattern matches the graph.
FOAF graphs use people’s e-mail addresses to uniquely identify them. In the interests of privacy, some people prefer to use hashcodes of e-mail addresses. Plain text e-mail addresses are expressed using the foaf:mbox property, while hashcodes of e-mail addresses are expressed using the foaf:mbox_sha1sum property; the two are usually mutually exclusive in a FOAF description of a person. In situations like this, you can use SPARQL’s alternative matches feature to write queries that return whichever of the properties is available.
foaf:mbox
foaf:mbox_sha1sum
Alternative matches are defined by stating multiple alternative graph patterns, with the UNION keyword between them. The query shown in Listing 8 finds the name of each person in the FOAF graph of Listing 5, along with either their foaf:mbox or their foaf:mbox_sha1sum. M Benn is not a query solution, because he has neither a foaf:mbox or a foaf:mbox_sha1sum property. In contrast with OPTIONAL graph patterns, exactly one of the alternatives must be matched by any query solution; if both branches of the UNION match, two solutions will be generated.
UNION
PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdf: <http://www.w3.org/1999/02/22‑rdf‑syntax‑ns#> SELECT ?name ?mbox WHERE { ?person foaf:name ?name . { { ?person foaf:mbox ?mbox } UNION { ?person foaf:mbox_sha1sum ?mbox } } } ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ | name | mbox | ===================================================================== | "Jon Foobar" | <mailto:jon@foobar.xx> | | "A. N. O'Ther" | <mailto:a.n.other@example.net> | | "Liz Somebody" | "3f01fa9929df769aff173f57dec2fe0c2290aeea" | ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
The FILTER keyword in SPARQL restricts the results of a query by imposing constraints on values of bound variables. These value constraints are logical expressions that evaluate to boolean values, and may be combined with logical && and || operators. For instance, a query that returns a list of names could be modified with a filter to return only names that match a given regular expression. Or, as shown in Listing 9, items in an RSS feed published between two dates can be found with a filter that places bounds on items’ publication dates. Listing 9 also shows how SPARQL’s XPath-style casting feature is used (here to cast the date variable into an XML Schema dateTime value), and how you can specify the same datatype on literal date strings with ^^xsd:dateTime. This ensures that the date comparison is used in the query, rather than standard string comparison.
FILTER
&&
||
date
dateTime
^^xsd:dateTime
PREFIX rss: <http://purl.org/rss/1.0/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?item_title ?pub_date WHERE { ?item rss:title ?item_title . ?item dc:date ?pub_date . FILTER xsd:dateTime(?pub_date) >= "2005‑04‑01T00:00:00Z"^^xsd:dateTime && xsd:dateTime(?pub_date) < "2005‑05‑01T00:00:00Z"^^xsd:dateTime }
So far, all of the queries I’ve demonstrated have involved datasets consisting of a single RDF graph. In SPARQL terminology, these queries have run against the background graph. The background graph is what is specified by a query’s FROM clause, by the sparql command’s --data switch, or by passing a model to QueryExecutionFactory.create() when using Jena’s API.
--data
QueryExecutionFactory.create()
In addition to the SELECT queries used in this article, SPARQL supports three other query types. ASK simply returns “yes” if the query’s graph pattern has any matches in the dataset, and “no” if it does not. DESCRIBE returns a graph containing information related to the nodes matched in the graph pattern. For instance, DESCRIBE ?person WHERE { ?person foaf:name “Jon Foobar” } would return a graph containing triples from the model about Jon Foobar. Finally, CONSTRUCT is used to output a graph pattern for each query solution. This allows a new RDF graph to be created directly from the results of the query. You can think of a CONSTRUCT query on an RDF graph as somewhat analogous to an XSL transformation of XML data.
ASK
DESCRIBE
CONSTRUCT
In addition to the background graph, SPARQL can query any number of named graphs. These additional graphs are identified by their URIs, and are each distinct within a query. Before examining the ways that named graphs can be used, I’ll explain how to provide them to a query. As with the background graph, named graphs can be specified within the query itself, using FROM NAMED <URI>, where URI specifies the graph. Alternatively, named graphs can be provided to the sparql command with --named URL, with the URL giving the location of the graph. Finally, Jena’s DataSetFactory class can be used to specify named graphs to be queried programmatically.
FROM NAMED <URI>
URI
--named URL
DataSetFactory
Named graphs are used within a SPARQL query with the GRAPH keyword, in conjunction with either a graph URI or a variable name. This keyword is followed by a graph pattern to match against the graph in question.
GRAPH
When the GRAPH keyword is used with a graph’s URI (or a variable already bound to a graph’s URI), the graph pattern is applied to whichever graph is identified by that URI. If matches are found in the specified graph, they form part of a query solution. In Listing 10, two named FOAF graphs are passed to the query. Query solutions are those people’s names that are found in both of the graphs. Note that the nodes representing people in each of the two FOAF graphs are blank nodes, and have scope only within the graph that contains them. This means that the same person node can not exist in both named graphs in the query, and so different variables (x and y) must be used to represent them.
x
y
PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdf: <http://www.w3.org/1999/02/22‑rdf‑syntax‑ns#> SELECT ?name FROM NAMED <jon‑foaf.rdf> FROM NAMED <liz‑foaf.rdf> WHERE { GRAPH <jon‑foaf.rdf> { ?x rdf:type foaf:Person . ?x foaf:name ?name . } . GRAPH <liz‑foaf.rdf> { ?y rdf:type foaf:Person . ?y foaf:name ?name . } . }
The other way to use GRAPH is to follow it with an unbound variable. In such a case, the graph pattern is applied to each of the named graphs available to the query. If the pattern matches against one of the named graphs, then that graph’s URI is bound to the GRAPH‘s variable. Listing 11’s GRAPH clause matches on each person node in the named FOAF graphs given to the query. The matched person’s name is bound to the name variable, and the graph_uri variable binds to the URI of the graph that matched the pattern. The results of the query are also shown. One name, A. N. O’Ther, is matched twice, because that person is described in both jon-foaf.rdf and liz-foaf.rdf.
name
graph_uri
PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rdf: <http://www.w3.org/1999/02/22‑rdf‑syntax‑ns#> SELECT ?name ?graph_uri FROM NAMED <jon‑foaf.rdf> FROM NAMED <liz‑foaf.rdf> WHERE { GRAPH ?graph_uri { ?x rdf:type foaf:Person . ?x foaf:name ?name . } } ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ | name | graph_uri | ======================================================== | "Liz Somebody" | <file://.../jon‑foaf.rdf> | | "A. N. O'Ther" | <file://.../jon‑foaf.rdf> | | "Jon Foobar" | <file://.../liz‑foaf.rdf> | | "A. N. O'Ther" | <file://.../liz‑foaf.rdf> | ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
SPARQL allows query results to be returned as XML, in a simple format known as the SPARQL Variable Binding Results XML Format. This schema-defined format acts as a bridge between RDF queries and XML tools and libraries.
There are a number of potential uses for this capability. You could transform the results of a SPARQL query into a Web page or RSS feed via XSLT, access the results via XPath, or return the result document to a SOAP or AJAX client. To output query results as XML, use the ResultSetFormatter.outputAsXML() method, or specify --results rs/xml on the command line.
ResultSetFormatter.outputAsXML()
--results rs/xml
Queries may also use background data in conjunction with named graphs. Listing 12 uses the live aggregated RSS feed from PlanetRDF.com as background data, and queries it in conjunction with a named graph containing my own FOAF profile. The idea is to create a personalized feed by finding the most recent ten posts by bloggers that I know. The first part of the query finds the node that represents me in the FOAF file, and then finds the names of the people that the graph says I know. The second part looks for items in the RSS feed that are created by those people. Finally, the result set is sorted by the items’ creation date, and limited to return only ten results.
PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX rss: <http://purl.org/rss/1.0/> PREFIX dc: <http://purl.org/dc/elements/1.1/> SELECT ?title ?known_name ?link FROM <http://planetrdf.com/index.rdf> FROM NAMED <phil‑foaf.rdf> WHERE { GRAPH <phil‑foaf.rdf> { ?me foaf:name "Phil McCarthy" . ?me foaf:knows ?known_person . ?known_person foaf:name ?known_name . } . ?item dc:creator ?known_name . ?item rss:title ?title . ?item rss:link ?link . ?item dc:date ?date. } ORDER BY DESC[?date] LIMIT 10
While this query simply returns a list of titles, names, and URLs, a more complex query could extract all of the data present in the matched items. Using SPARQL’s XML results format (see the sidebar titled “Query results in XML format“) in conjunction with an XSL stylesheet, it would be possible to create a personalized HTML view of the blog items, or even produce another RSS feed.
The examples in this article should help you to understand the fundamental features and syntax of SPARQL, along with the benefits it brings to RDF applications. You’ve seen how SPARQL allows you to approach the semi-structured nature of real-world RDF graphs, with the aid of optional and alternative matches. Examples using named graphs have shown you how combining multiple graphs in SPARQL opens up your querying options. You’ve also seen how simple it is to run SPARQL queries from Java code using Jena’s API.
There’s a lot more to SPARQL than it is possible to cover here, so do use the Resources to find out more about SPARQL’s features. Look at the SPARQL spec to learn in detail about SPARQL’s built-in functions, operators, query forms, and syntax, or to see many more example SPARQL queries.
Of course, the best way to learn about SPARQL is by writing some queries of your own. Just grab some RDF data from the Web, download the Jena ARQ module, and start experimenting!
Learn why it makes sense to use JavaScript (Node.js), Swift, and the Java language on enterprise servers.
API ManagementJava+
Learn how to enable OpenJ9's class sharing functionality in a containerized environment.
JavaJava Platform
Which platforms, particular standards, and runtimes should enterprise developers base their applications?
Jakarta EEJava
Back to top