SemApps - Creating Semantic Applications

Wednesday, December 9, 2015

Exploratory RDF SPARQL Queries

RDF is a schema-less technology for modeling data. That is, no schema need be defined before you start asserting RDF data ("facts").

Not that it is really schema-less. That is, even though you don't have to pre-declare a schema (such as a table definition in a SQL database), you generally follow some schema (set of properties and "types") as you define data. (Without this, the resulting RDF base might be nearly unusable.)

When I find myself handed a new RDF base, I find it helpful to do some exploratory queries to find out what schema (formally declared or de facto) is being used. If the creator of the RDF base is really nice, she will add schema information directly into the triple store, e.g. using OWL class or property declarations (a.k.a. the Ontology or Tbox). If this is the case, I can just query for what classes and properties are defined, e.g.:

## Find declared classes
prefix owl: <http://www.w3.org/2002/07/owl#>
select distinct ?class
where {
?class a owl:Class
}
limit 200 # optional limit

and

## Find declared properties

prefix owl: <http://www.w3.org/2002/07/owl#>

select distinct ?prop

where {

{ ?prop a owl:DatatypeProperty }

UNION

{ ?rrop a owl:ObjectProperty }

}

limit 200 # optional limit

However, often the RDF base is not that friendly and such queries return nothing. The next approach is to find everything used as a type or property by brute force, e.g.:

# list types used
select distinct ?type {
?s a ?type .
}
order by ?type

and

# list properties used
select distinct ?prop {
?s ?prop ?o .
}
order by ?prop

This does work. However, it runs into problems if the number of triples is large (e.g. tens of millions or more), because this does a full scan through every triple in the store! Not good for large stores--it puts a huge load on the store, and may never return (at least not before killing your store).

So what if the triple store is large, and does not contain the Tbox declarations that make it easy to find the properties and classes used? Read on...

Large Scale Triple Store Exploration using Samples

Well, here's an approach inspired by the mongoDB Compass tool. The Compass tool uses a sample of an overall document database to provide insight into its schema.

The SPARQL queries below take the same approach for an RDF store: they seek to get a feel for the full de facto schema of a large RDF data set by approximating the schema using only a small sample of the overall triples in the store.

The queries are:

# Count type instances in a sample (for large TS)
select distinct ?type (count (?type) as ?count)
{ ?s a ?type .
{
select *
{?s ?p ?type.}
limit 10000
}
}
group by ?type
order by ?type

and

# Count property instances in a sample (for large TS)
select distinct ?prop (count (?prop) as ?count)
{ ?s ?prop ?o .
{

Properties from sample query

select *
{?s ?prop ?o.}
limit 10000
}
}
group by ?prop
order by ?prop

These queries do a SPARQL subquery first, to get only a limited number of triples. This limits the table scan to only a very small subset of the overall triple store. Then they analyze only that small sample for type or property usage.

In these queries, we get only the first 10,000 triples the triple store wants to give us. We then extract the de facto types (1st query) or properties (2nd query), with a count of how many times each is used in the sample. (These counts could be used to approximate relative usage of each of the types / properties.)

The example on the right shows the properties used in the LUBM-100 data set, as returned by the query. The counts in the right column help give us a quick feel for relative level of usage of each property.

Yes, this is only an approximation of the actual schema.
There is no guarantee how much of the schema this approach will actually discover. But in tests I ran against a fairly large triple store (LUBM 100) it seemed to do a pretty good--and very fast!--job when using 10,000 triples as the sample size. Your sample size could vary, depending on how performant your store is.

Happy exploring!

Thursday, May 31, 2012

SPARQL query from JavaScript

There are JavaScript libraries out there for SPARQL, but it's actually quite simple to query SPARQL from JavaScript without using any special library. Here is an example of making a SPARQL query directly from a web page using JavaScript.

<html> 
  <head> 
    <title> SPARQL JavaScript </title>
    <script>
    /**
     * Author: Mark Wallace
     *
     * This function asynchronously issues a SPARQL query to a
     * SPARQL endpoint, and invokes the callback function with the JSON 
     * Format [1] results.
     *
     * Refs:
     * [1] http://www.w3.org/TR/sparql11-results-json/
     */
    function sparqlQueryJson(queryStr, endpoint, callback, isDebug) {
      var querypart = "query=" + escape(queryStr);
    
      // Get our HTTP request object.
      var xmlhttp = null;
      if(window.XMLHttpRequest) {
        xmlhttp = new XMLHttpRequest();
     } else if(window.ActiveXObject) {
       // Code for older versions of IE, like IE6 and before.
       xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
     } else {
       alert('Perhaps your browser does not support XMLHttpRequests?');
     }
    
     // Set up a POST with JSON result format.
     xmlhttp.open('POST', endpoint, true); // GET can have caching probs, so POST
     xmlhttp.setRequestHeader('Content-type', 'application/x-www-form-urlencoded');
     xmlhttp.setRequestHeader("Accept", "application/sparql-results+json");
    
     // Set up callback to get the response asynchronously.
     xmlhttp.onreadystatechange = function() {
       if(xmlhttp.readyState == 4) {
         if(xmlhttp.status == 200) {
           // Do something with the results
           if(isDebug) alert(xmlhttp.responseText);
           callback(xmlhttp.responseText);
         } else {
           // Some kind of error occurred.
           alert("Sparql query error: " + xmlhttp.status + " "
               + xmlhttp.responseText);
         }
       }
     };
     // Send the query to the endpoint.
     xmlhttp.send(querypart);
    
     // Done; now just wait for the callback to be called.
    };
    </script>
  </head>

  <body>
    <script>
      var endpoint = "http://dbpedia.org/sparql";
      var query = "select * {?s ?p ?o} limit 5" ;

      // Define a callback function to receive the SPARQL JSON result.
      function myCallback(str) {
        // Convert result to JSON
        var jsonObj = eval('(' + str + ')');

        // Build up a table of results.
        var result = " <table border='2' cellpadding='9'>" ;
        for(var i = 0; i<  jsonObj.results.bindings.length; i++) {
          result += " <tr> <td>" + jsonObj.results.bindings[i].s.value;
          result += " </td><td>" + jsonObj.results.bindings[i].p.value;
          result += " </td><td>" + jsonObj.results.bindings[i].o.value;
          result += " </td></tr>"; 
        } 
        result += "</table>" ;
        document.getElementById("results").innerHTML = result;
     }
      
     // Make the query.
     sparqlQueryJson(query, endpoint, myCallback, true);
      
    </script>

    <div id="results">
      It may take a few moments for the info to be displayed here...
      <br/><br/>
      Run me in Internet Explorer or I get Cross Domain HTTP Request errors!
    </div>
  
  </body>
</html>

In the head section, the code defines a function, sparqlQueryJson(), that takes a SPARQL query string, a SPARQL endpoint URL, and a function to call when the result is ready. (The optional fourth parameter will show you the raw JSON SPARQL results in an alert window if you set it to true.) In the body section, the code specifies the query string, endpoint, and callback function, and then calls sparqlQueryJson() to issue the request.

Put the above code in a file called sparql.htm, and give it a try in your browser!

A few things to note:

Most browsers won't let you run this code because it makes a cross-domain request (calls a service on a different host than the HTML was served from). Use IE and if/when prompted, allow the content.
I use an asynchronous XMLHttpRequest to perform the query to the SPARQL endpoint.
It would be best to put the sparqlQueryJson() function in a separate file to make it reusable from multiple pages. I put everything in one file here just to simplify the example slightly.

Friday, February 3, 2012

What Makes a Wiki Semantic?

A wiki is a web site that allows users to create and edit pages in an easy-to-format way (not HTML). They can easily create links on those pages to other pages in the wiki (and to pages on other web sites). The wiki usually keeps history of page edits, and allows rollback of pages to previous versions. New users can usually create accounts for themselves, and therefore page edits can be tracked based on user. Wikipedia is undoubtedly the most famous example of a wiki.

But what is a semantic wiki? I believe that there are four basic features that, when taken together, transform a wiki into a semantic wiki. While there can certainly be more features than just these four in a semantic wiki, I think that it is these four that must minimally be there to make a wiki semantic.

The first feature is that pages can be typed. That is, they can be marked as representing a certain "type" of thing, e.g. a book or a person or a city or an event. (Another word for type could be "category" or "class".) This type can simply be a word that has meaning to the wiki users, e.g. "Person", "City", etc. Different wiki technologies can differ on how this type is associated with a page (e.g., it can simply be another markup element that can be added to the wiki text of a page).

The second is that page links are assigned meaning. That is, hyperlinks from one page to another can be assigned more meaning than just "this is a link to a page"; the link can be assigned a "type". E.g. a link from a page about a book to a page about the author of that book might be assigned a type "authored-by", or "has-author". Again, this link type can simply be a word that has meaning to the wiki users, e.g. "authoredBy", "located-in", etc.

The third is that data values within a page can be assigned a meaning, e.g. the number 200,000 in a page about a city could be assigned the meaning of "population". Once again, "meaning", at its simplest level, is just associating a word from some vocabulary with the value. This can be thought of as an "attribute name" that goes with the value.

Finally, all of this semantic information can be used in dynamic queries that build tables (or other content) on the page, on-the-fly, by querying the semantic information. E.g., a table could be created that lists the top ten cities in a particular country by population by embedding a query into the page. This is certainly preferable to a person having to manually keep a summary of most populous cities up to date, requiring them to periodically review the wiki for new cities in a particular country, and determining if the top ten most populous cities has changed, and hand editing the changes into a summary table. In contrast, a table built on dynamic queries will always be accurate and instantly up-to-date (given that the semantics on the city pages are accurate), even as new city pages are added or population numbers are changed over time!

In summary, the four key elements that I believe make a wiki semantic are:

The ability to type pages
The ability to assign meaning to links between pages
The ability to assign meaning to data values within a page, and
The ability to query this knowledge to dynamically generate content

While there can certainly be more features than just these four in a semantic wiki, I think that it is these four that must minimally be there to make a wiki semantic.

Wednesday, December 22, 2010

Custom Rules for Jena Reasoner

Here is an example of creating a custom RDFS++ reasoner using Jena 2.6.2. By RDFS++, I mean the following key rules in RDFS, which are:

rdfs:range, rdfs:domain, rdfs:subClassOf, rdfs:subPropertyOf

and the addition of these lightweight but useful OWL rules:

owl:inversOf, owl:TransitiveProperty, owl:sameAs

The code uses Jena GenericRuleReasonser. This example was done with Jena 2.6.2.

Here is the code to the generic inference (ginfer) program:

$ type ginfer.java


import com.hp.hpl.jena.rdf.model.*;
import com.hp.hpl.jena.reasoner.*;
import com.hp.hpl.jena.vocabulary.*;
import com.hp.hpl.jena.reasoner.rulesys.*;

/** Read RDF XML from standard in; infer and write to standard out. */
class ginfer {
    public static void main (String args[]) {

        // Create an empty model.
        Model model = ModelFactory.createDefaultModel();

        // Read the RDF/XML on standard in.
        model.read(System.in, null);

        // Create a simple RDFS++ Reasoner.
        StringBuilder sb = new StringBuilder();
        sb.append("[rdfs2:   (?x ?p ?y), (?p rdfs:domain ?c) -> (?x rdf:type ?c)] ");
        sb.append("[rdfs3:   (?x ?p ?y), (?p rdfs:range ?c) -> (?y rdf:type ?c)] ");

        sb.append("[rdfs6:   (?a ?p ?b), (?p rdfs:subPropertyOf ?q) -> (?a ?q ?b)] ");
        sb.append("[rdfs5:   (?x rdfs:subPropertyOf ?y), (?y rdfs:subPropertyOf ?z) -> (?x rdfs:subPropertyOf ?z)] ");

        sb.append("[rdfs9:   (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)] ");
        sb.append("[rdfs11:  (?x rdfs:subClassOf ?y), (?y rdfs:subClassOf ?z) -> (?x rdfs:subClassOf ?z)] ");

        sb.append("[owlinv:  (?x ?p ?y), (?p owl:inverseOf ?q) -> (?y ?q ?x)] ");
        sb.append("[owlinv2: (?p owl:inverseOf ?q) -> (?q owl:inverseOf ?p)] ");

        sb.append("[owltra:  (?x ?p ?y), (?y ?p ?z), (?p rdf:type owl:TransitiveProperty) -> (?x ?p ?z)] ");

        sb.append("[owlsam:  (?x ?p ?y), (?x owl:sameAs ?z) -> (?z ?p ?y)] ");
        sb.append("[owlsam2: (?x owl:sameAs ?y) -> (?y owl:sameAs ?x)] ");

        Reasoner reasoner = new GenericRuleReasoner(Rule.parseRules(sb.toString()));

        // Create inferred model using the reasoner and write it out.
        InfModel inf = ModelFactory.createInfModel(reasoner, model);
        inf.write(System.out);
    }
}

Here is some data for demonstration.


$ type data.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl:  <http://www.w3.org/2002/07/owl#> .
@prefix demo: <http://example.com/demo#> .

demo:Person a owl:Class.
demo:hasAncestor rdfs:range demo:Person ; rdfs:domain demo:Person .
demo:parentOf rdfs:subPropertyOf demo:ancestorOf ; owl:inverseOf demo:childOf .

demo:ancestorOf owl:inverseOf demo:hasAncestor ; a owl:TransitiveProperty .
demo:Trilby demo:parentOf demo:MarkB .
demo:Mark demo:parentOf demo:Elizabeth .
demo:MarkB owl:sameAs demo:Mark .

Here we use the jena.rdfcat program to convert the before and after reasoning data to a sorted N-Triples format so we can compare the two.


$ java jena.rdfcat -out ntriples data.ttl   | sort  >before.nt

$ java jena.rdfcat data.ttl   | java ginfer  | java jena.rdfcat -out ntriples -x -   | sort  >after.nt

And here is the comparison. Everything shown is a triple that was not in the original data, but was inferred by executing the rules.


$ diff before.nt after.nt
2a3,9
> <http://example.com/demo#childOf> <http://www.w3.org/2002/07/owl#inverseOf> <http://example.com/demo#parentOf> .
> <http://example.com/demo#Elizabeth> <http://example.com/demo#childOf> <http://example.com/demo#Mark> .
> <http://example.com/demo#Elizabeth> <http://example.com/demo#childOf> <http://example.com/demo#MarkB> .
> <http://example.com/demo#Elizabeth> <http://example.com/demo#hasAncestor> <http://example.com/demo#Mark> .
> <http://example.com/demo#Elizabeth> <http://example.com/demo#hasAncestor> <http://example.com/demo#MarkB> .
> <http://example.com/demo#Elizabeth> <http://example.com/demo#hasAncestor> <http://example.com/demo#Trilby> .
> <http://example.com/demo#Elizabeth> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/demo#Person> .
4a12,15
> <http://example.com/demo#hasAncestor> <http://www.w3.org/2002/07/owl#inverseOf> <http://example.com/demo#ancestorOf> .
> <http://example.com/demo#Mark> <http://example.com/demo#ancestorOf> <http://example.com/demo#Elizabeth> .
> <http://example.com/demo#Mark> <http://example.com/demo#childOf> <http://example.com/demo#Trilby> .
> <http://example.com/demo#Mark> <http://example.com/demo#hasAncestor> <http://example.com/demo#Trilby> .
5a17,24
> <http://example.com/demo#Mark> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/demo#Person> .
> <http://example.com/demo#Mark> <http://www.w3.org/2002/07/owl#sameAs> <http://example.com/demo#Mark> .
> <http://example.com/demo#Mark> <http://www.w3.org/2002/07/owl#sameAs> <http://example.com/demo#MarkB> .
> <http://example.com/demo#MarkB> <http://example.com/demo#ancestorOf> <http://example.com/demo#Elizabeth> .
> <http://example.com/demo#MarkB> <http://example.com/demo#childOf> <http://example.com/demo#Trilby> .
> <http://example.com/demo#MarkB> <http://example.com/demo#hasAncestor> <http://example.com/demo#Trilby> .
> <http://example.com/demo#MarkB> <http://example.com/demo#parentOf> <http://example.com/demo#Elizabeth> .
> <http://example.com/demo#MarkB> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/demo#Person> .
6a26
> <http://example.com/demo#MarkB> <http://www.w3.org/2002/07/owl#sameAs> <http://example.com/demo#MarkB> .
9a30,33
> <http://example.com/demo#Trilby> <http://example.com/demo#ancestorOf> <http://example.com/demo#Elizabeth> .
> <http://example.com/demo#Trilby> <http://example.com/demo#ancestorOf> <http://example.com/demo#Mark> .
> <http://example.com/demo#Trilby> <http://example.com/demo#ancestorOf> <http://example.com/demo#MarkB> .
> <http://example.com/demo#Trilby> <http://example.com/demo#parentOf> <http://example.com/demo#Mark> .
10a35
> <http://example.com/demo#Trilby> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/demo#Person> .

$

Sunday, May 30, 2010

Speaking at SemTech 2010

I'll be speaking this year at SemTech 2010. The full list of presentations is available online.

My one-hour talk is Tuesday, June 22, at 5pm. It's part of the Ontology Design and Engineering track, in the Technical-Advanced category, and is entitled "Rapid Prototyping with the Jena Command Line Utilities".

It should be a fun talk that includes demonstrations of how to use the utilities for file conversion (see previous post), RDF file merging, SPARQL queries, and filtering triples. There is also some leave-behind code in the slides so you can write your own Jena inference utility.

I hope to see you there!

Monday, May 24, 2010

Using Jena to convert RDF/OWL file formats

As a way of getting started with this blog, let me show how to use the Jena Utilities to convert your RDF/OWL files from one format to another.

On the Web, most OWL and RDF content are formatted as as RDF/XML. This format works great for the web, and most OWL/RDF tools support this as the native format. However, it is less human-readable than other formats such as Turtle.

For me, Turtle is the easiest form to read, and the quickest format for generating test data when I am prototyping. But not all tools support Turtle format.

But that's OK! If you want to easily convert from RDF/XML to Turtle and back, you can use the freely available Jena utilities to do this. These are command line utilities, so you run them from a command window.

First: Download Jena. The latest version at the time of this writing is version 2.6.2. Unpack it anywhere - though I usually make sure the directory path has no spaces in it. E.g., let's say you unpack it on a Windows system under C:\Programs, making the Jena top level folder the C:\Programs\Jena-2.6.2 folder.

Second: Set up your Java classpath, e.g.:


set CLASSPATH=
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\arq-extra.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\arq.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\icu4j_3_4.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\iri.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\jena.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\jenatest.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\json.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\junit-4.5.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\log4j-1.2.12.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\lucene-core-2.3.1.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\slf4j-api-1.5.6.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\slf4j-log4j12-1.5.6.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\stax-api-1.0.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\wstx-asl-3.0.0.jar
set CLASSPATH=%CLASSPATH%;C:\Programs\Jena-2.6.2\lib\xercesImpl.jar

Third: Convert stuff! E.g. I have the file vehicle.owl, with these contents:


<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:e="http://example.org/ex#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
  <rdfs:Class rdf:about="http://example.org/ex#Vehicle"/>
  <e:Civic rdf:about="http://example.org/ex#Civic_8717383">
    <e:hasVin>1H0243098430987</e:hasVin>
  </e:Civic>
  <rdfs:Property rdf:about="http://example.org/ex#hasVin"/>
  <rdf:Description rdf:about="http://example.org/ex#HondaCar">
    <rdfs:subClassOf>
      <rdf:Description rdf:about="http://example.org/ex#Car">
        <rdfs:subClassOf rdf:resource="http://example.org/ex#Vehicle"/>
      </rdf:Description>
    </rdfs:subClassOf>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.org/ex#Civic">
    <rdfs:subClassOf rdf:resource="http://example.org/ex#HondaCar"/>
  </rdf:Description>
</rdf:RDF>

To convert it, I use the Jena rdfcat program, as follows:


java jena.rdfcat -out ttl vehicle.owl

And it prints this this out:


@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix e:       <http://example.org/ex#> .
@prefix owl:     <http://www.w3.org/2002/07/owl#> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

e:Civic_8717383
     a       e:Civic ;
     e:hasVin "1H0243098430987" .

e:Vehicle
     a       rdfs:Class .

e:HondaCar
     rdfs:subClassOf e:Car .

e:hasVin
     a       rdfs:Property .

e:Car
     rdfs:subClassOf e:Vehicle .

e:Civic
     rdfs:subClassOf e:HondaCar .

I can redirect this to a file for safe keeping:


java jena.rdfcat -out ttl vehicle.owl > vehicle.ttl

Now I have my turtle file! I can view and edit it as I see fit. Then, if I need to publish back to RDF/XML format, I simply use the rdfcat utility again, but with no "-out" option:


java jena.rdfcat vehicle.ttl > vehicle-new.owl

That's it. If this was new to you, post a comment and let me know if it helped.

Wednesday, April 28, 2010

Hello, World!

Hello all you semantics-focused technology buffs out there!

I'm Mark Wallace, and I plan use this blog to write about different Semantic Applications I am working on.

In my role as Principal Engineer, Semantic Applications at Modus Operandi, I am continually working with new and fun semantic technologies. Be it RDF/OWL, Triple-stores, Semantic Wikis, or Text Extraction, I am learning more all the time and want to share my experiences in hopes of helping others along with these technologies.

I hope to be posting a new article about every week or two, so check back every so often and see what's cooking!