SPARQL querying over RDF datasets
 SPARQL: a language for reading and updating data in RDF datasets via declarative queries.

Different query forms:
SELECT
: selecting values in tabular form → focus of this presentation
CONSTRUCT
: construct new triples
ASK
: check if data exists
DESCRIBE
: describe a given resource
INSERT
: insert new triples
DELETE
: delete existing triples
Find all artists born in York
name  deathDate 
Albert Joseph Moore  
Charles Francis Hansom  1888 
David Reed (comedian)  
Dustin Gee  
E Ridsdale Tate  1922 
How do query engines process a query?
RDF dataset + SPARQL query
↓
...
↓
query results
How do query engines process a query?
RDF dataset + SPARQL query
↓
SPARQL query processing
↓
query results
Basic Graph Patterns enable graph pattern matching
Query results representation

Solution Mapping
Mapping from a set of variable labels to a set of RDF terms.

Solution Sequence
A list of solution mappings.
1 solution sequence with 3 solution mappings:
name  birthplace 
Bob Brockmann  http://dbpedia.org/resource/Louisiana 
Bennie Nawahi  http://dbpedia.org/resource/Honolulu 
Weird Al Yankovic  http://dbpedia.org/resource/Downey,_California 
Steps in SPARQL query processing

1. Parsing
Transform a SPARQL query string into an algebra expression

2. Optimization
Transform algebra expression into a query plan

3. Evaluation
Executes query plan to obtain query results
Parsing: transform query string into SPARQL Algebra

SPARQL Algebra = Abstract Syntax Tree (AST)
Algebraic datastructure that represents the query structure
Tree where each node is a separate operator: join, BGP, filter, ...
Easier for engines to work with than strings
Enables algebraic query optimizations
A.k.a query plan → basis for query planning
SPARQL Algebra example
Input:
SELECT ?x ?y ?z WHERE {
?x ?y ?z
}
Output:
{
"type": "project",
"input": {
"type": "bgp",
"patterns": [
{
"type": "pattern",
"subject": {
"termType": "Variable",
"value": "x"
},
"predicate": {
"termType": "Variable",
"value": "y"
},
"object": {
"termType": "Variable",
"value": "z"
}
}
]
},
"variables": [
{ "termType": "Variable",
"value": "x" },
{ "termType": "Variable",
"value": "y" },
{ "termType": "Variable",
"value": "z" }
]
}
Why is optimization needed?

Queries are declarative
They say what needs to be done, not how.
Transformation into executable code can happen in many different ways.

Factors that influence performance
Dataset
Order of operators in AST
Available operator algorithm implementations
Available data indexes
Different levels of optimization

Algebraic optimization
Independent of underlying data
Based on formally provable equivalence of operations

Lowlevel optimization
Based on the underlying data
Select and configure specific algorithms of algebra operators
Algebraic optimization

Dataindependent
Independent of the underlying dataset
Based on heuristics

Preserves evaluation semantics
Optimized algebra produces same results as executing unoptimized algebra

Tradeoff between optimization and performance
Algebraic optimization can take a lot of time,
sometimes even more than execution the suboptimal algebra.
Example: ORDER BY
+ DISTINCT

ORDER BY
: Sort solution mappings by the given variable

DISTINCT
: Remove duplicate solution mappings

SPARQL requires
DISTINCT
to happen after ORDER BY

Performance issues when sorting many nondistinct solution mappings

Optimization: Apply
DISTINCT
before ORDER BY
Input:
{
"type": "distinct",
"input": {
"type": "orderby",
"variable": "x",
"input": { ... }
}
}
Output:
{
"type": "orderby",
"variable": "x",
"input": {
"type": "distinct",
"input": { ... }
}
}
Lowlevel optimization

Datadependent
Based on the underlying dataset, statistics, and derived indexes.

Minimize computational complexity
Reduce the number of intermediary results.

Select and configure specific algorithms of algebra operators
Different algorithms behave better in certain situations.
Example: joining in BGPs
How, and in what order should triple patterns be evaluated?
1. ?person a dbpediaowl:Artist.
2. ?person rdfs:label ?name.
3. ?person dbpediaowl:birthPlace ?birthPlace.
Different factors influence join order

Cardinality of each pattern
Can be exact, or an estimate

Complexity of join algorithms
Based on cardinalities of patterns

Precomputing steps of join algorithms
Precomputation may be too expensive for low cardinalities

Restrictions of join algorithms
Some algorithms only work in certain cases
Obtain cardinality of each pattern
1. ?person a dbpediaowl:Artist.
200
2. ?person rdfs:label ?name.
10.000
3. ?person dbpediaowl:birthPlace ?birthPlace.
1.000
Can be precomputed, or estimated at runtime.
Select appropriate join algorithms
Usually happens in pairs of two operators (n
, m
)

Nestedloop join
Double forloop → O(n*m)
Works well for small sequences due to the lack of preprocessing.

Hash join
Precomputes hash tables for all mappings in one solution sequence.
Iterates over other solution sequence with lookups to hash table → O(n+m)
Not possible with OPTIONAL

Sortmerge join
Sort solution sequences.
Iterates over both solution sequences in order → O(n+m)
Sorting can become expensive with larger solution sequences
Join orders and algorithms influence worstcase complexity

All nestedloop join
((1 ⋈ 2) ⋈ 3) = ((200 * 10.000) * 1.000) = 2.000.000.000

All hash join
((1 ⋈ 2) ⋈ 3) = ((200 + 10.000) + 1.000) = 11.200

Nestedloop join and hash join
((1 ⋈ 3) ⋈ 2) = ((200 * 1.000) + 10.000) = 210.000
→ Restrictions and preprocessing steps (sorting, hashing, ...) not yet taken into account
→ Tradeoff between different factors that influence join order
Evaluating the optimized query plan

Bottomup
Start with leaves in query plan, and work up to the root.

Parallelisation
Evaluate independent branches in parallel on different threads.

Pipelining
Reduce time to first result.

Adaptive processing
Advanced: adaptively modify the query plan at runtime.
Evaluation iteratively handles leaf nodes
Based on operator type
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
"type": "pattern",
"subject": { ... },
"predicate": { ... },
"object": { ... }
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
...solution sequence...
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
"type": "bgp",
"patterns": [
{
...solution sequence...
},
{
...solution sequence...
}
]
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
...solution sequence...
}
}
Evaluation iteratively handles leaf nodes
{
"type": "project",
"variables": [ ... ]
"input": {
...solution sequence...
}
}
Evaluation iteratively handles leaf nodes
...solution sequence...
Some operations can be parallelised

Operators with multiple independent inputs
UNION branches are fully independent
Triple patterns in BGPs can be handled independently based on the algorithm
Some joins can be partially handled independently (e.g. sorting step in sortmerge join)

Local threads vs. distributed machines
Communication cost may be high → parallelising not always beneficial

I/O vs. CPU cost
Depending on how data is stored, I/O may be the bottleneck
E.g. remotely stored data → HTTP is bottleneck instead of CPU
Pipelining produces results as soon as possible

Evaluating operators is suboptimal
Requires intermediary results to be kept in memory
Results only become available once everything is done processing

Pipelining enables intermediary results to flow through operators
In general, intermediary results do not have to be kept in memory
Query results are produced as a stream, as soon as they are available
A pipeline of operators
Adaptive processing responds to changing conditions

Query optimization as preprocessing is restrictive
Assumes that all information is known beforehand
Can not cope with volatile distributed environments

Adaptive query processing: optimization during evaluation
Modify query plan based on network or data changes
Algorithms are typically more complex
Conclusions

Different steps in SPARQL query processing
Parsing, optimization, evaluation

Query optimization is crucial for performance
Based on heuristics and statistics
Most research happens in this area

Learn more from realworld SPARQL query engines
Comunica (TypeScript/JavaScript) and Jena (Java) are opensource
http://query.linkeddatafragments.org/