Introduction to Link Traversal Query Processing

Ruben Taelman

Ghent University – imec – IDLab, Belgium

What is Link Traversal Query Processing?

Traditional query execution: index-based
For example: SPARQL query execution over the DBpedia SPARQL endpoint.
➕ Once index is built, queries are mostly fast.
➖ Combining with data outside of the index is harder.
➖ Index can become stale.
Link traversal query processing (LTQP): explorative
Based on the follow-your-nose principle of Linked Data.
➕ Implicit combinations of different sources.
➕ Query execution always happens over live data.
➖ Querying is mostly slower.
➖ Harder to achieve (and define) completeness.

Finding the names of my friends

SELECT ?name WHERE {
    <https://www.rubensworks.net/#me> foaf:knows ?person.
    ?person foaf:name ?name.
}

Determine seed URLs: https://www.rubensworks.net/#me if query-driven
Dereference https://www.rubensworks.net/#me
Dereference all discovered URLs in https://www.rubensworks.net/#me
Repeat dereference process recursively...
Execute query over all discovered RDF triples

Live example

Existing approaches
Next steps

Existing approaches
Next steps

🕰️ Brief History of LTQP

Pioneered by Olaf Hartig

Executing SPARQL queries over the web of linked data. (ISWC 2009)
Introduction of the LTQP concept
A non-blocking iterator-based pipeline for querying
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversal Based Query Execution. (ESWC 2011)
Formalization of LTQP
Query plan selection rules

🕰️ Brief History of LTQP (2)

Foundations of Traversal Based Query Execution over Linked Data. (HT 2011)
More formalizations
Reachability semantics: what links should be followed
LDQL: A query language for the Web of Linked Data. (JWS 2016)
Query language with ability to express navigation paths
Walking without a Map: Optimizing Response Times of Traversal-Based Linked Data Queries. (ISWC 2016)
Prioritize certain URLs to achieve most relevant results faster

🧐 Open Problems with LTQP

Slow query execution compared to index-based
Mainly caused by number of lookups
No guarantee of termination
Web is unbounded
No guarantee of completeness
No links may exist between source and target

Existing approaches
Next steps

🤝 Content Policies Reduce Search Space

Reachability semantics: all triples from followed links are considered applicable
Carol defines an incorrect name for Bob.
Content policies: user-defined "trust" preferences
Different people trust different sources
Alice queries over her contact information, but only wants to obtain friend-defined names.
Hypothesis: Reduction of search space → faster querying

🌲 Exploiting Linking Structure

Document linking structure can be declaratively exposed
Footprints, Shapetrees, TREE, ...
To enable auto-discovery, should be fully self-descriptive
Query execution can prune to only follow links that would give results
Alice stores her pictures by year, so querying for pictures in May 2020 only requires looking into one directory.
Hypothesis: More fine-grained linking structures will increase performance

🧬 Hybrid Query Execution

LTQP optimizations will only go so far
May never be sufficient for performance-critical cases
A hybrid of link-traversal-based and index-based query processing
Combines advantages of both approaches
Hypothesis: At the cost of preprocessing and partially stale data, LTQP can be made faster

🏋 How to evaluate and test all of this?

The Web itself is not suited
Too large and unpredictable to execute reproducible experiments
Creating a (closed) simulated Web of data
Based on a social network use case
Different fragmentation strategies for testing different LTQP approaches

👨‍💻 Current status

Implementation of existing LTQP strategies
✅ Separate modules in Comunica
Creation of an LTQP Benchmark
Based on the well-established LDBC Social Network Benchmark
⏳ Additional layer that applies fragmentations: WIP
Incorporating Content policies
Experimental implementation in Comunica
⏳Performance tests: WIP

Introduction to Link Traversal Query Processing

What is Link Traversal Query Processing?

Traditional query execution: index-based

Link traversal query processing (LTQP): explorative

Finding the names of my friends

🕰️ Brief History of LTQP

Executing SPARQL queries over the web of linked data. (ISWC 2009)

Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversal Based Query Execution. (ESWC 2011)

🕰️ Brief History of LTQP (2)

Foundations of Traversal Based Query Execution over Linked Data. (HT 2011)

LDQL: A query language for the Web of Linked Data. (JWS 2016)

Walking without a Map: Optimizing Response Times of Traversal-Based Linked Data Queries. (ISWC 2016)

🧐 Open Problems with LTQP

Slow query execution compared to index-based

No guarantee of termination

No guarantee of completeness

🤝 Content Policies Reduce Search Space

Reachability semantics: all triples from followed links are considered applicable

Content policies: user-defined "trust" preferences

Hypothesis: Reduction of search space → faster querying

🌲 Exploiting Linking Structure

Document linking structure can be declaratively exposed

Query execution can prune to only follow links that would give results

Hypothesis: More fine-grained linking structures will increase performance

🧬 Hybrid Query Execution

LTQP optimizations will only go so far

A hybrid of link-traversal-based and index-based query processing

Hypothesis: At the cost of preprocessing and partially stale data, LTQP can be made faster

🏋 How to evaluate and test all of this?

The Web itself is not suited

Creating a (closed) simulated Web of data

👨‍💻 Current status

Implementation of existing LTQP strategies

Creation of an LTQP Benchmark

Incorporating Content policies