Hi. I'm Mark Wallace. In my role as a an Ontologist and Software Architect, I am continually working with new and fun semantic technologies. Be it RDF/OWL, Triple-stores, Semantic Wikis, or Text Extraction, I am learning more all the time and want to share my experiences in hopes of helping others along with these technologies. I hope post a new article every month or two, so check back in every so often to see what’s cooking!

Wednesday, December 9, 2015

Exploratory RDF SPARQL Queries

RDF is a schema-less technology for modeling data.  That is, no schema need be defined before you start asserting RDF data ("facts").

Not that it is really schema-less.  That is, even though you don't have to pre-declare a schema (such as a table definition in a SQL database), you generally follow some schema (set of properties and "types") as you define data.  (Without this, the resulting RDF base might be nearly unusable.)

When I find myself handed a new RDF base, I find it helpful to do some exploratory queries to find out what schema (formally declared or de facto) is being used.   If the creator of the RDF base is really nice, she will add schema information directly into the triple store, e.g. using OWL class or property declarations (a.k.a. the Ontology or Tbox).  If this is the case, I can just query for what classes and properties are defined, e.g.:

  ## Find declared classes
  prefix owl: <http://www.w3.org/2002/07/owl#>
  select distinct ?class
  where {
   ?class a owl:Class 
  }
  limit 200  # optional limit

and

  ## Find declared properties 
  prefix owl: <http://www.w3.org/2002/07/owl#>
  select distinct ?prop
  where {
   { ?prop a owl:DatatypeProperty }
   UNION
   { ?rrop a owl:ObjectProperty }
  }
  limit 200  # optional limit

However, often the RDF base is not that friendly and such queries return nothing.  The next approach is to find everything used as a type or property by brute force, e.g.:

  # list types used
  select distinct ?type {
   ?s a ?type .
  }
  order by ?type

and

  # list properties used
  select distinct ?prop {
   ?s ?prop ?o .
  }
  order by ?prop

This does work.  However, it runs into problems if the number of triples is large (e.g. tens of millions or more), because this does a full scan through every triple in the store!    Not good for large stores--it puts a huge load on the store, and may never return (at least not before killing your store).

So what if the triple store is large, and does not contain the Tbox declarations that make it easy to find the properties and classes used?  Read on...

Large Scale Triple Store Exploration using Samples

Well, here's an approach inspired by the mongoDB Compass tool.  The Compass tool uses a sample of an overall document database to provide insight into its schema.

The SPARQL queries below take the same approach for an RDF store:  they seek to get a feel for the full de facto schema of a large RDF data set by approximating the schema using only a small sample of the overall triples in the store.

The queries are:

  # Count type instances in a sample (for large TS)
  select distinct ?type (count (?type) as ?count)
  { ?s a ?type .
    {
      select *
      {?s ?p ?type.}
      limit 10000
    }
  } 
  group by ?type
  order by ?type

and

  # Count property instances in a sample (for large TS)
  select distinct ?prop (count (?prop) as ?count)
  { ?s ?prop ?o .
    {
Properties from sample query
      select *
      {?s ?prop ?o.}
      limit 10000
    }
  } 
  group by ?prop
  order by ?prop

These queries do a SPARQL subquery first, to get only a limited number of triples.  This limits the table scan to only a very small subset of the overall triple store.  Then they analyze only that small sample for type or property usage.

In these queries, we get only the first 10,000 triples the triple store wants to give us.  We then extract the de facto types (1st query) or properties (2nd query), with a count of how many times each is used in the sample.  (These counts could be used to approximate relative usage of each of the types / properties.)

The example on the right shows the properties used in the LUBM-100 data set, as returned by the query.  The counts in the right column help give us a quick feel for relative level of usage of each property.

Yes, this is only an approximation of the actual schema.
There is no guarantee how much of the schema this approach will actually discover.  But in tests I ran against a fairly large triple store (LUBM 100) it seemed to do a pretty good--and very fast!--job when using 10,000 triples as the sample size.  Your sample size could vary, depending on how performant your store is.

Happy exploring!

Followers