Link Traversal Query Processing

Link Traversal
Query Processing

Query execution by following links between documents

Ruben Taelman

Ghent University – imec – IDLab, Belgium

SPARQL processing over centralized data

Dataset is collocated with query engine
All data is known beforehand
Single dataset
Combining multiple datasets is hard

Centralization not always possible

Private data (Solid)
Technical and legal reasons
Evolving data
Requires continuous re-indexing
Web scale data
Indexing the whole Web is infeasible for non-tech-giants

How to query over decentralized data?

Data and query engine are not collocated
Query engine runs on a separate machine
Not just one datasets
Data is spread over the Web into multiple documents

Approaches for querying over decentralized data

Federated Query Processing
Distributing query execution across known sources
Link Traversal Query Processing
Local query execution over sources that are discovered by following links

Federated Querying
Link Traversal
Optimizations
Open problems

Federated Querying
Link Traversal
Optimizations
Open problems

Client distributes query over query APIs

Clients do limited effort
Split up the query, distribute it (source selection), and combine results
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets

Federation over SPARQL endpoints

Servers are SPARQL endpoints (most common)
They accept any valid SPARQL query
Client-side source selection
Rewrite query in terms of SERVICE clauses

SELECT ?drug ?title WHERE {
  ?drug db:drugCategory dbc:micronutrient.
  ?drug db:casRegistryNumber ?id.
  ?keggDrug rdf:type kegg :Drug.
  ?keggDrug bio2rdf:xRef ?id.
  ?keggDrug purl:title ?title.
}

SELECT ?drug ?title WHERE {
  SERVICE <http://example.com/drb> {
    ?drug db:drugCategory dbc:micronutrient.
    ?drug db:casRegistryNumber ?id.
  }
  SERVICE <http://example.com/kegg> {
    ?keggDrug rdf:type kegg :Drug.
    ?keggDrug bio2rdf:xRef ?id.
    ?keggDrug purl:title ?title.
  }
}

Federation over heterogeneous sources

Servers are not only SPARQL endpoints
Other types of Linked Data Fragments: TPF, WiseKG, brTPF, ...
Different levels of server expressivity
Clients may have to take up more effort
Executing parts of queries client-side
Trade-off between server and client effort
Low-cost publishing and preventing server availability issues

Limitations of federated querying

All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
Limited scalability in terms of number of endpoints
Current federation techniques scale to the order of 10 sources

Federated Querying
Link Traversal
Optimizations
Open problems

Exploit interlinking of documents

Linked Data documents are linked to each other
Following the Linked Data principles
Query engine can follow links
Start from one document, and discover new documents on the fly

Link Traversal-based Query Processing

= Querying by following links between documents

Example: decentralized address book

Example: Find Alice's contact names

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me>
        foaf:knows ?person.
    ?person foaf:name ?name.
}

Query process:

Start from Alice's address book
Follow links to profiles of Bob and Carol
Query over union of all profiles
Find query results: [ { "name": "Bob" }, { "name": "Carol" } ]

The Link Traversal Query Process

Link queue: stores links to documents that need to be fetched

1. Initialize link queue with seed URLs
Seed URLs can be user-defined, or derived from query
2. Iterate and append link queue
Iteratively pop head off queue, and follow link to document

Add all URLs in document to link queue
3. Execute query
Over union of all discovered RDF triples

Example: Evolution of the Link Queue (1)

Initialize link queue with seed URL(s)

https://alice.pods.org/profile

Example: Evolution of the Link Queue (2)

Follow link to head of queue

https://alice.pods.org/profile

Example: Evolution of the Link Queue (3)

Push URLs in `https://alice.pods.org/profile` to queue

https://alice.pods.org/profile
https://bob.pods.org/profile
https://carol.org/

</profile#me> :knows
  <https://bob.pods.org/profile#me>,
  <https://carol.org/#i>.

No need to re-add link to https://alice.pods.org/profile

Example: Evolution of the Link Queue (4)

Remove handled link from queue

https://bob.pods.org/profile

https://carol.org/

Example: Evolution of the Link Queue (5)

Follow link to head of queue

https://bob.pods.org/profile
https://carol.org/

Example: Evolution of the Link Queue (6)

No more URLs found in `https://bob.pods.org/profile`

https://bob.pods.org/profile
https://carol.org/

</profile#me> :name "Bob";
  :email <mailto:bob@mail.org>;
  :telephone "0499 12 34 56".

mailto:bob@mail.org is not an HTTP(S) link that can be followed

Example: Evolution of the Link Queue (7)

Remove handled link from queue

https://carol.org/

Example: Evolution of the Link Queue (8)

Follow link to head of queue

https://carol.org/

Example: Evolution of the Link Queue (9)

No more URLs found in `https://carol.org/`

https://carol.org/

</#i> :name "Carol";
  :email <mailto:i@carol.org>;
  :telephone "0499 11 22 33".

Example: Evolution of the Link Queue (10)

Remove handled link from queue

Link queue is empty

Execute query over union of all discovered RDF triples

Federated Querying
Link Traversal
Optimizations
Open problems

Link Traversal has practical problems

But solutions exist!

Link queue rarely becomes empty
It grows in size due to massive size of the Web

Query process never has a chance to start!
Too many links are followed
Each document typically contains many links → exponential growth
Existing query planning algorithms don't work
There is no a priori knowledge about cardinalities and other statistics

Process query during link queue handling

→ Cope with infinitely growing link queues

Extension of pipelining approach
Link queue is a component in pipeline

May produce an infinite data stream

Pre-filter links before they enter queue

→ Cope with large number of links

Different filtering strategies exist

Follow all
No filtering, follow all possible links
Follow none
Follow no links
Follow if triple matches query
Follow links in a triple if they match at least one triple pattern in the query

Follow if triple matches query: example

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me> foaf:knows ?person.
    ?person foaf:name ?name.
}

Contents of https://alice.pods.org/profile:

✅ </profile#me> :knows <https://bob.pods.org/profile#me>.
✅ </profile#me> :knows <https://carol.org/#i>.
❌ </profile#me> :likes <https://bob.pods.org/posts/hello-world>.

Filtering links and query semantics

SPARQL queries have universal meaning
Queries can be executed over all (RDF) datasets
Query results depend on datasets
Executing same query over different datasets may produce different results
Different link filtering strategies leads to different range of documents
Query results may differ for different strategies

Results also depend on the seed URLs

Zero-knowledge query planning

→ Cope with existing query planning algorithms not working

Score-based heuristics for determining triple pattern order in BGPs

Selectivity
Triple patterns with fewer variables are prioritized
Seed-based
Triple patterns that contain a seed URL are prioritized
No ontologies
Triple patterns for well-known ontologies (e.g. rdf:type) are deprioritized
These often produce many results

Federated Querying
Link Traversal
Optimizations
Open problems

Slow query execution

Large number of links to follow
Fetching all documents on the Web is not practical
Following links is expensive
Setting up many TCP connections for doing HTTP requests takes time
No indexing
Query planning is based on heuristics

Query never terminates

Large number of links to follow
New content is produced faster than you can follow links
Attempts to solve this by prioritizing links in queue
For example, depth-first search, breadth-first search, page rank, ...
Only reduces time to first result, not total execution time
Better solutions are needed
Data publishers could point to query-relevant documents

Potential security issues

People can make unauthoritative statements
Web is open → anyone can say anything

Example: Carol gives a faulty name to Bob.
System hogging
Malicious publishers could link to documents that are expensive to process

Example: huge files, documents with syntax errors, ...

Federated Querying
Query process
Optimizations
Open problems

Conclusions

Link Traversal is suitable for querying decentralized environments
Copes with limitations of federated querying
Relatively new research area → many open problems
Main bottleneck is number of links
Future research should focus on guidance towards query-relevant links
You can become a pioneer in this domain!
IDLab is doing research on Link Traversal
We offer student jobs, master theses, PhD jobs

Link TraversalQuery Processing

Query execution by following links between documents

SPARQL processing over centralized data

Dataset is collocated with query engine

Single dataset

Centralization not always possible

Private data (Solid)

Evolving data

Web scale data

How to query over decentralized data?

Data and query engine are not collocated

Not just one datasets

Approaches for querying over decentralized data

Federated Query Processing

Link Traversal Query Processing

Client distributes query over query APIs

Clients do limited effort

Servers perform most of the effort

Federation over SPARQL endpoints

Servers are SPARQL endpoints (most common)

Client-side source selection

Federation over heterogeneous sources

Servers are not only SPARQL endpoints

Clients may have to take up more effort

Trade-off between server and client effort

Limitations of federated querying

All federation members must be known before execution starts

Limited scalability in terms of number of endpoints

Exploit interlinking of documents

Linked Data documents are linked to each other

Query engine can follow links

Link Traversal-based Query Processing

= Querying by following links between documents

Example: decentralized address book

Example: Find Alice's contact names

Query process:

The Link Traversal Query Process

1. Initialize link queue with seed URLs

2. Iterate and append link queue

3. Execute query

Example: Evolution of the Link Queue (1)

Initialize link queue with seed URL(s)

Example: Evolution of the Link Queue (2)

Follow link to head of queue

Example: Evolution of the Link Queue (3)

Push URLs in https://alice.pods.org/profile to queue

Example: Evolution of the Link Queue (4)

Remove handled link from queue

Example: Evolution of the Link Queue (5)

Follow link to head of queue

Example: Evolution of the Link Queue (6)

No more URLs found in https://bob.pods.org/profile

Example: Evolution of the Link Queue (7)

Remove handled link from queue

Example: Evolution of the Link Queue (8)

Follow link to head of queue

Example: Evolution of the Link Queue (9)

No more URLs found in https://carol.org/

Example: Evolution of the Link Queue (10)

Remove handled link from queue

Execute query over union of all discovered RDF triples

Link Traversal has practical problems

Link queue rarely becomes empty

Too many links are followed

Existing query planning algorithms don't work

Process query during link queue handling

→ Cope with infinitely growing link queues

Extension of pipelining approach

Pre-filter links before they enter queue

→ Cope with large number of links

Follow all

Follow none

Follow if triple matches query

Follow if triple matches query: example

Filtering links and query semantics

SPARQL queries have universal meaning

Query results depend on datasets

Different link filtering strategies leads to different range of documents

Zero-knowledge query planning

→ Cope with existing query planning algorithms not working

Link Traversal
Query Processing

Push URLs in `https://alice.pods.org/profile` to queue

No more URLs found in `https://bob.pods.org/profile`

No more URLs found in `https://carol.org/`