What is Link Traversal Query Processing?
-
Traditional query execution: index-based
For example: SPARQL query execution over the DBpedia SPARQL endpoint.
➕ Once index is built, queries are mostly fast.
➖ Combining with data outside of the index is harder.
➖ Index can become stale.
-
Link traversal query processing (LTQP): explorative
Based on the follow-your-nose principle of Linked Data.
➕ Implicit combinations of different sources.
➕ Query execution always happens over live data.
➖ Querying is mostly slower.
➖ Harder to achieve (and define) completeness.
Finding the names of my friends
SELECT ?name WHERE {
<https://www.rubensworks.net/#me> foaf:knows ?person.
?person foaf:name ?name.
}
- Determine seed URLs:
https://www.rubensworks.net/#me
if query-driven
- Dereference
https://www.rubensworks.net/#me
- Dereference all discovered URLs in
https://www.rubensworks.net/#me
- Repeat dereference process recursively...
- Execute query over all discovered RDF triples
Live example
🕰️ Brief History of LTQP
Pioneered by Olaf Hartig
-
Executing SPARQL queries over the web of linked data. (ISWC 2009)
Introduction of the LTQP concept
A non-blocking iterator-based pipeline for querying
-
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversal Based Query Execution. (ESWC 2011)
Formalization of LTQP
Query plan selection rules
🕰️ Brief History of LTQP (2)
-
Foundations of Traversal Based Query Execution over Linked Data. (HT 2011)
More formalizations
Reachability semantics: what links should be followed
-
LDQL: A query language for the Web of Linked Data. (JWS 2016)
Query language with ability to express navigation paths
-
Walking without a Map: Optimizing Response Times of Traversal-Based Linked Data Queries. (ISWC 2016)
Prioritize certain URLs to achieve most relevant results faster
🧐 Open Problems with LTQP
-
Slow query execution compared to index-based
Mainly caused by number of lookups
-
No guarantee of termination
Web is unbounded
-
No guarantee of completeness
No links may exist between source and target
🤝 Content Policies Reduce Search Space
-
Reachability semantics: all triples from followed links are considered applicable
Carol defines an incorrect name for Bob.
-
Content policies: user-defined "trust" preferences
Different people trust different sources
Alice queries over her contact information, but only wants to obtain friend-defined names.
-
Hypothesis: Reduction of search space → faster querying
🌲 Exploiting Linking Structure
-
Document linking structure can be declaratively exposed
Footprints, Shapetrees, TREE, ...
To enable auto-discovery, should be fully self-descriptive
-
Query execution can prune to only follow links that would give results
Alice stores her pictures by year, so querying for pictures in May 2020 only requires looking into one directory.
-
Hypothesis: More fine-grained linking structures will increase performance
🧬 Hybrid Query Execution
-
LTQP optimizations will only go so far
May never be sufficient for performance-critical cases
-
A hybrid of link-traversal-based and index-based query processing
Combines advantages of both approaches
-
Hypothesis: At the cost of preprocessing and partially stale data, LTQP can be made faster
🏋 How to evaluate and test all of this?
-
The Web itself is not suited
Too large and unpredictable to execute reproducible experiments
-
Creating a (closed) simulated Web of data
Based on a social network use case
Different fragmentation strategies for testing different LTQP approaches