Link Traversal-based Query Processing

Link Traversal-based
Query Processing

Query execution by following links between documents

Ruben Taelman

Ghent University – imec – IDLab, Belgium

SPARQL processing over centralized data

Dataset is collocated with query engine
All data is known beforehand
Single dataset
Combining multiple datasets is hard

How to query over decentralized data?

Data and query engine are not collocated
Query engine runs on a separate machine
Not just one datasets
Data is spread over the Web into multiple documents

Exploit interlinking of documents

Linked Data documents are linked to each other
Following the Linked Data principles
Query engine can follow links
Start from one document, and discover new documents on the fly

Link Traversal-based Query Processing

= Querying by following links between documents

Query process
Optimizations
Open problems

Query process
Optimizations
Open problems

Example: decentralized address book

Example: Find Alice's contact names

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me>
        foaf:knows ?person.
    ?person foaf:name ?name.
}

Query process:

Start from Alice's profile
Follow links to profiles of Bob and Carol
Query over union of all profiles
Find query results: [ { "name": "Bob" }, { "name": "Carol" } ]

The Link Traversal Query Process

Link queue: stores links to documents that need to be fetched

1. Initialize link queue with seed URLs
Seed URLs can be user-defined, or derived from query
2. Iterate and append link queue
Iteratively pop head off queue, and follow link to document

Add all URLs in document to link queue
3. Execute query
Over union of all discovered RDF triples

Example: Evolution of the Link Queue (1)

Initialize link queue with seed URL(s)

https://alice.pods.org/profile

Example: Evolution of the Link Queue (2)

Follow link to head of queue

https://alice.pods.org/profile

Example: Evolution of the Link Queue (3)

Push URLs in `https://alice.pods.org/profile` to queue

https://alice.pods.org/profile
https://bob.pods.org/profile
https://carol.org/

</profile#me> :knows
  <https://bob.pods.org/profile#me>,
  <https://carol.org/#i>.

No need to re-add link to https://alice.pods.org/profile

Example: Evolution of the Link Queue (4)

Remove handled link from queue

https://bob.pods.org/profile

https://carol.org/

Example: Evolution of the Link Queue (5)

Follow link to head of queue

https://bob.pods.org/profile
https://carol.org/

Example: Evolution of the Link Queue (6)

No more URLs found in `https://bob.pods.org/profile`

https://bob.pods.org/profile
https://carol.org/

</profile#me> :name "Bob";
  :email <mailto:bob@mail.org>;
  :telephone "0499 12 34 56".

mailto:bob@mail.org is not an HTTP(S) link that can be followed

Example: Evolution of the Link Queue (7)

Remove handled link from queue

https://carol.org/

Example: Evolution of the Link Queue (8)

Follow link to head of queue

https://carol.org/

Example: Evolution of the Link Queue (9)

No more URLs found in `https://carol.org/`

https://carol.org/

</#i> :name "Carol";
  :email <mailto:i@carol.org>;
  :telephone "0499 11 22 33".

Example: Evolution of the Link Queue (10)

Remove handled link from queue

Link queue is empty

Execute query over union of all discovered RDF triples

Query process
Optimizations
Open problems

Link Traversal has practical problems

But solutions exist!

Link queue rarely becomes empty
It grows in size due to massive size of the Web

Query process never has a chance to start!
Too many links are followed
Each document typically contains many links → exponential growth
Existing query planning algorithms don't work
There is no a priori knowledge about cardinalities and other statistics

Process query during link queue handling

→ Cope with infinitely growing link queues

Extension of pipelining approach
Link queue is a component in pipeline

May produce an infinite data stream

Pre-filter links before they enter queue

→ Cope with large number of links

Different filtering strategies exist

Follow all
No filtering, follow all possible links
Follow none
Follow no links
Follow if triple matches query
Follow links in a triple if they match at least one triple pattern in the query

Follow if triple matches query: example

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me> foaf:knows ?person.
    ?person foaf:name ?name.
}

Contents of https://alice.pods.org/profile:

✅ </profile#me> :knows <https://bob.pods.org/profile#me>.
✅ </profile#me> :knows <https://carol.org/#i>.
❌ </profile#me> :likes <https://bob.pods.org/posts/hello-world>.

Filtering links and query semantics

SPARQL queries have universal meaning
Queries can be executed over all (RDF) datasets
Query results depend on datasets
Executing same query over different datasets may produce different results
Different link filtering strategies leads to different range of documents
Query results may differ for different strategies

Results also depend on the seed URLs

Zero-knowledge query planning

→ Cope with existing query planning algorithms not working

Score-based heuristics for determining triple pattern order in BGPs

Selectivity
Triple patterns with fewer variables are prioritized
Seed-based
Triple patterns that contain a seed URL are prioritized
No ontologies
Triple patterns for well-known ontologies (e.g. rdf:type) are deprioritized
These often produce many results

Query process
Optimizations
Open problems

Slow query execution

Large number of links to follow
Fetching all documents on the Web is not practical
Following links is expensive
Setting up many TCP connections for doing HTTP requests takes time
No indexing
Query planning is based on heuristics

Query never terminates

Large number of links to follow
New content is produced faster than you can follow links
Attempts to solve this by prioritizing links in queue
For example, depth-first search, breadth-first search, page rank, ...
Only reduces time to first result, not total execution time
Better solutions are needed
Data publishers could point to query-relevant documents

Potential security issues

People can make unauthoritative statements
Web is open → anyone can say anything

Example: Carol gives a faulty name to Bob.
System hogging
Malicious publishers could link to documents that are expensive to process

Example: huge files, documents with syntax errors, ...
Session Hijacking
Sessions for authentication may be leaked to other domains

Query process
Optimizations
Open problems

Conclusions

Link Traversal is suitable for querying decentralized environments
Relatively new research area → many open problems
Main bottleneck is number of links
Future research should focus on guidance towards query-relevant links
You can become a pioneer in this domain!
IDLab is doing research on Link Traversal

We offer student jobs, master theses, PhD jobs

Link Traversal-basedQuery Processing

Query execution by following links between documents

SPARQL processing over centralized data

Dataset is collocated with query engine

Single dataset

How to query over decentralized data?

Data and query engine are not collocated

Not just one datasets

Exploit interlinking of documents

Linked Data documents are linked to each other

Query engine can follow links

Link Traversal-based Query Processing

= Querying by following links between documents

Example: decentralized address book

Example: Find Alice's contact names

Query process:

The Link Traversal Query Process

1. Initialize link queue with seed URLs

2. Iterate and append link queue

3. Execute query

Example: Evolution of the Link Queue (1)

Initialize link queue with seed URL(s)

Example: Evolution of the Link Queue (2)

Follow link to head of queue

Example: Evolution of the Link Queue (3)

Push URLs in https://alice.pods.org/profile to queue

Example: Evolution of the Link Queue (4)

Remove handled link from queue

Example: Evolution of the Link Queue (5)

Follow link to head of queue

Example: Evolution of the Link Queue (6)

No more URLs found in https://bob.pods.org/profile

Example: Evolution of the Link Queue (7)

Remove handled link from queue

Example: Evolution of the Link Queue (8)

Follow link to head of queue

Example: Evolution of the Link Queue (9)

No more URLs found in https://carol.org/

Example: Evolution of the Link Queue (10)

Remove handled link from queue

Execute query over union of all discovered RDF triples

Link Traversal has practical problems

Link queue rarely becomes empty

Too many links are followed

Existing query planning algorithms don't work

Process query during link queue handling

→ Cope with infinitely growing link queues

Extension of pipelining approach

Pre-filter links before they enter queue

→ Cope with large number of links

Follow all

Follow none

Follow if triple matches query

Follow if triple matches query: example

Filtering links and query semantics

SPARQL queries have universal meaning

Query results depend on datasets

Different link filtering strategies leads to different range of documents

Zero-knowledge query planning

→ Cope with existing query planning algorithms not working

Score-based heuristics for determining triple pattern order in BGPs

Selectivity

Seed-based

No ontologies

Slow query execution

Large number of links to follow

Following links is expensive

No indexing

Query never terminates

Large number of links to follow

Attempts to solve this by prioritizing links in queue

Better solutions are needed

Potential security issues

People can make unauthoritative statements

System hogging

Session Hijacking

Conclusions

Link Traversal is suitable for querying decentralized environments

Main bottleneck is number of links

You can become a pioneer in this domain!

Link Traversal-based
Query Processing

Push URLs in `https://alice.pods.org/profile` to queue

No more URLs found in `https://bob.pods.org/profile`

No more URLs found in `https://carol.org/`