SPARQL processing over centralized data
-
Dataset is collocated with query engine
All data is known beforehand
-
Single dataset
Combining multiple datasets is hard
Centralization not always possible
-
Private data (Solid)
Technical and legal reasons
-
Evolving data
Requires continuous re-indexing
-
Web scale data
Indexing the whole Web is infeasible for non-tech-giants
How to query over decentralized data?
-
Data and query engine are not collocated
Query engine runs on a separate machine
-
Not just one datasets
Data is spread over the Web into multiple documents
Approaches for querying over decentralized data
-
Federated Query Processing
Distributing query execution across known sources
-
Link Traversal Query Processing
Local query execution over sources that are discovered by following links
Client distributes query over query APIs
-
Clients do limited effort
Split up the query, distribute it (source selection), and combine results
-
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets
Federation over SPARQL endpoints
-
Servers are SPARQL endpoints (most common)
They accept any valid SPARQL query
-
Client-side source selection
Rewrite query in terms of SERVICE clauses
SELECT ?drug ?title WHERE {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
?keggDrug rdf:type kegg :Drug.
?keggDrug bio2rdf:xRef ?id.
?keggDrug purl:title ?title.
}
SELECT ?drug ?title WHERE {
SERVICE <http://example.com/drb> {
?drug db:drugCategory dbc:micronutrient.
?drug db:casRegistryNumber ?id.
}
SERVICE <http://example.com/kegg> {
?keggDrug rdf:type kegg :Drug.
?keggDrug bio2rdf:xRef ?id.
?keggDrug purl:title ?title.
}
}
Federation over heterogeneous sources
-
Servers are not only SPARQL endpoints
Other types of Linked Data Fragments: TPF, WiseKG, brTPF, ...
Different levels of server expressivity
-
Clients may have to take up more effort
Executing parts of queries client-side
-
Trade-off between server and client effort
Low-cost publishing and preventing server availability issues
Limitations of federated querying
-
All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
-
Limited scalability in terms of number of endpoints
Current federation techniques scale to the order of 10 sources
Exploit interlinking of documents
-
Linked Data documents are linked to each other
Following the Linked Data principles
-
Query engine can follow links
Start from one document, and discover new documents on the fly
Link Traversal-based Query Processing
= Querying by following links between documents
Example: decentralized address book
Example: Find Alice's contact names
SELECT ?name WHERE {
<https://alice.pods.org/profile#me>
foaf:knows ?person.
?person foaf:name ?name.
}
Query process:
- Start from Alice's address book
- Follow links to profiles of Bob and Carol
- Query over union of all profiles
- Find query results:
[ { "name": "Bob" }, { "name": "Carol" } ]
The Link Traversal Query Process
Link queue: stores links to documents that need to be fetched
-
1. Initialize link queue with seed URLs
Seed URLs can be user-defined, or derived from query
-
2. Iterate and append link queue
Iteratively pop head off queue, and follow link to document
Add all URLs in document to link queue
-
3. Execute query
Over union of all discovered RDF triples
Example: Evolution of the Link Queue (1)
Initialize link queue with seed URL(s)
https://alice.pods.org/profile |
Example: Evolution of the Link Queue (2)
Follow link to head of queue
https://alice.pods.org/profile |
Example: Evolution of the Link Queue (3)
Push URLs in https://alice.pods.org/profile
to queue
https://alice.pods.org/profile |
https://bob.pods.org/profile |
https://carol.org/ |
</profile#me> :knows
<https://bob.pods.org/profile#me>,
<https://carol.org/#i>.
No need to re-add link to https://alice.pods.org/profile
Example: Evolution of the Link Queue (4)
Remove handled link from queue
https://bob.pods.org/profile |
https://carol.org/ |
Example: Evolution of the Link Queue (5)
Follow link to head of queue
https://bob.pods.org/profile |
https://carol.org/ |
Example: Evolution of the Link Queue (6)
No more URLs found in https://bob.pods.org/profile
https://bob.pods.org/profile |
https://carol.org/ |
</profile#me> :name "Bob";
:email <mailto:bob@mail.org>;
:telephone "0499 12 34 56".
mailto:bob@mail.org is not an HTTP(S) link that can be followed
Example: Evolution of the Link Queue (7)
Remove handled link from queue
Example: Evolution of the Link Queue (8)
Follow link to head of queue
Example: Evolution of the Link Queue (9)
No more URLs found in https://carol.org/
</#i> :name "Carol";
:email <mailto:i@carol.org>;
:telephone "0499 11 22 33".
Example: Evolution of the Link Queue (10)
Remove handled link from queue
Link queue is empty
Execute query over union of all discovered RDF triples
Link Traversal has practical problems
But solutions exist!
-
Link queue rarely becomes empty
It grows in size due to massive size of the Web
Query process never has a chance to start!
-
Too many links are followed
Each document typically contains many links → exponential growth
-
Existing query planning algorithms don't work
There is no a priori knowledge about cardinalities and other statistics
Process query during link queue handling
→ Cope with infinitely growing link queues
-
Extension of pipelining approach
Link queue is a component in pipeline
May produce an infinite data stream
Pre-filter links before they enter queue
→ Cope with large number of links
Different filtering strategies exist
-
Follow all
No filtering, follow all possible links
-
Follow none
Follow no links
-
Follow if triple matches query
Follow links in a triple if they match at least one triple pattern in the query
Follow if triple matches query: example
SELECT ?name WHERE {
<https://alice.pods.org/profile#me> foaf:knows ?person.
?person foaf:name ?name.
}
Contents of https://alice.pods.org/profile
:
✅ </profile#me> :knows <https://bob.pods.org/profile#me>.
✅ </profile#me> :knows <https://carol.org/#i>.
❌ </profile#me> :likes <https://bob.pods.org/posts/hello-world>.
Filtering links and query semantics
-
SPARQL queries have universal meaning
Queries can be executed over all (RDF) datasets
-
Query results depend on datasets
Executing same query over different datasets may produce different results
-
Different link filtering strategies leads to different range of documents
Query results may differ for different strategies
Results also depend on the seed URLs
Zero-knowledge query planning
→ Cope with existing query planning algorithms not working
Score-based heuristics for determining triple pattern order in BGPs
-
Selectivity
Triple patterns with fewer variables are prioritized
-
Seed-based
Triple patterns that contain a seed URL are prioritized
-
No ontologies
Triple patterns for well-known ontologies (e.g. rdf:type
) are deprioritized
These often produce many results
Slow query execution
-
Large number of links to follow
Fetching all documents on the Web is not practical
-
Following links is expensive
Setting up many TCP connections for doing HTTP requests takes time
-
No indexing
Query planning is based on heuristics
Query never terminates
-
Large number of links to follow
New content is produced faster than you can follow links
-
Attempts to solve this by prioritizing links in queue
For example, depth-first search, breadth-first search, page rank, ...
Only reduces time to first result, not total execution time
-
Better solutions are needed
Data publishers could point to query-relevant documents
Potential security issues
-
People can make unauthoritative statements
Web is open → anyone can say anything
Example: Carol gives a faulty name to Bob.
-
System hogging
Malicious publishers could link to documents that are expensive to process
Example: huge files, documents with syntax errors, ...
Conclusions
-
Link Traversal is suitable for querying decentralized environments
Copes with limitations of federated querying
Relatively new research area → many open problems
-
Main bottleneck is number of links
Future research should focus on guidance towards query-relevant links
-
You can become a pioneer in this domain!
IDLab is doing research on Link Traversal
We offer student jobs, master theses, PhD jobs