Library to query the Web of Data with evolutionary algorithms
The Web of Data is growing at an amazing rate as more and more data-sources are being made available online in RDF, and linked. At the same time specialised triple-stores, such as Virtuoso, OWLIM or 4store, have matured into powerful engines that can efficiently answer queries for a given schema over static data sets of billions of curated RDF triples. However, in many cases the schema is not known, nor is the precise nature of the search query, and the query engine as to deal with imperfect data. The presence of many SPARQL end points is also a challenge for querying them in a federated way. eRDF is a novel query engine developed at the Vrije Universiteit Amsterdam and Data Archiving and Networked Services. It uses the robustness of evolutionary algorithms to answer complex SPARQL query over many end points.
Let's suppose you want to find some persons and the capital of the country they live in:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX db: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?person ?first ?last ?home ?capital WHERE {
?person rdf:type foaf:Person.
?person foaf:firstName ?first.
?person foaf:family_name ?last.
OPTIONAL {
?person foaf:homepage ?home.
}
?person foaf:based_near ?country.
?country rdf:type db:Country.
?country db:capital ?capital.
?capital rdf:type db:Place.
}
ORDER BY ?first
Such a query can be answered by combining data from the dog food server and dbpedia. More data sets may also contain list of people but let’s focus on researchers as a start. We’ll have to indicate to eRDF which are the end points to query, this is done with a simple csv listing:
DBpedia;http://dbpedia.org/sparql
Semantic Web Dog Food;http://data.semanticweb.org/sparql
Assuming the query is saved into a "people.sparql" file and the end points list goes into a "endpoints.csv", the query engine is called like this:
java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.SPARQLEngine -q people.sparql -s endpoints.csv -t 5
The query will first be scanned for its basic graph patterns, all of them will be grouped and sent to the eRDF optimiser as a set of constraints to solve. Then, eRDF will look for solutions matching as many of these constraints as possible and push back all the relevant triples found back into an RDF model. After some time (set with the parameter “t”), eRDF is stopped and Jena is used to issue the query over the model that was just populated. The answers are then displayed, along with a list of the data sources that contributed in finding them.
If you don’t know which end points are likely to contribute to the answers, you can just query all of the WOD and see what happens... ;-) The package comes with a tool to fetch a list of SPARQL end points from CKAN, test them and create a configuration file. It gets called like that:
java -cp nl.erdf.datalayer-sparql-0.1-SNAPSHOT.jar nl.erdf.main.GetEndPointsFromCKAN
After a few minutes, you will get a "ckan-endpoints.csv" allowing you to run query the WoD from your laptop.
The publications related to this work are hosted, and available as PDF, on Mendeley:
eRDF is licensed under the revised BSD and is available on GitHub.
Having trouble with eRDF? contact christophe.gueret@dans.knaw.nl or @cgueret and we'll help you sort it out.