SPARQL Query Processing

How do SPARQL query engines work?

Ruben Taelman

Ghent University – imec – IDLab, Belgium

SPARQL querying over RDF datasets

SPARQL: a language for reading and updating data in RDF datasets via declarative queries.
Different query forms:
- SELECT: selecting values in tabular form → focus of this presentation
- CONSTRUCT: construct new triples
- ASK: check if data exists
- DESCRIBE: describe a given resource
- INSERT: insert new triples
- DELETE: delete existing triples

Find all artists born in York

SELECT ?name ?deathDate WHERE {
  ?person a dbpedia-owl:Artist;
          rdfs:label ?name;
          dbpedia-owl:birthPlace [ rdfs:label "York"@en ].
  FILTER LANGMATCHES(LANG(?name),  "EN")
  OPTIONAL { ?person dbpprop:dateOfDeath ?deathDate. }
}

name	deathDate
Albert Joseph Moore
Charles Francis Hansom	1888
David Reed (comedian)
Dustin Gee
E Ridsdale Tate	1922

How do query engines process a query?

RDF dataset + SPARQL query
↓
...
↓
query results

How do query engines process a query?

RDF dataset + SPARQL query
↓
SPARQL query processing
↓
query results

Basics and Terminology
Parsing
Optimization
Evaluation

Basics and Terminology
Parsing
Optimization
Evaluation

Basic Graph Patterns enable graph pattern matching

Basic Graph Pattern (BGP)
A collection of triple patterns.

    ?person a dbpedia-owl:Artist.
    ?person rdfs:label ?name.
    ?person dbpedia-owl:birthPlace ?birthPlace.

Triple Pattern
A triple in which any component may be a variable.
Variables start with ?, followed by a label. (e.g. ?name)
More complex operators are possible
OPTIONAL, UNION, FILTER, ...

Query results representation

Solution Mapping
Mapping from a set of variable labels to a set of RDF terms.
Solution Sequence
A list of solution mappings.

1 solution sequence with 3 solution mappings:

name	birthplace
Bob Brockmann	http://dbpedia.org/resource/Louisiana
Bennie Nawahi	http://dbpedia.org/resource/Honolulu
Weird Al Yankovic	http://dbpedia.org/resource/Downey,_California

Steps in SPARQL query processing

1. Parsing
Transform a SPARQL query string into an algebra expression
2. Optimization
Transform algebra expression into a query plan
3. Evaluation
Executes query plan to obtain query results

Basics and Terminology
Parsing
Optimization
Evaluation

Parsing: transform query string into SPARQL Algebra

SPARQL Algebra = Abstract Syntax Tree (AST)
Algebraic datastructure that represents the query structure

Tree where each node is a separate operator: join, BGP, filter, ...

Easier for engines to work with than strings

Enables algebraic query optimizations

A.k.a query plan → basis for query planning

SPARQL Algebra example

Input:

SELECT ?x ?y ?z WHERE {
  ?x ?y ?z
}

Output:

{
  "type": "project",
  "input": {
    "type": "bgp",
    "patterns": [
      {
        "type": "pattern",
        "subject": {
          "termType": "Variable",
          "value": "x"
        },
        "predicate": {
          "termType": "Variable",
          "value": "y"
        },
        "object": {
          "termType": "Variable",
          "value": "z"
        }
      }
    ]
  },
  "variables": [
    { "termType": "Variable",
      "value": "x" },
    { "termType": "Variable",
      "value": "y" },
    { "termType": "Variable",
      "value": "z" }
  ]
}

Basics and Terminology
Parsing
Optimization
Evaluation

Why is optimization needed?

Queries are declarative
They say what needs to be done, not how.

Transformation into executable code can happen in many different ways.
Factors that influence performance
Dataset

Order of operators in AST

Available operator algorithm implementations

Available data indexes

Different levels of optimization

Algebraic optimization
Independent of underlying data

Based on formally provable equivalence of operations
Low-level optimization
Based on the underlying data

Select and configure specific algorithms of algebra operators

Algebraic optimization

Data-independent
Independent of the underlying dataset

Based on heuristics
Preserves evaluation semantics
Optimized algebra produces same results as executing unoptimized algebra
Trade-off between optimization and performance
Algebraic optimization can take a lot of time,
sometimes even more than execution the sub-optimal algebra.

Example: `ORDER BY` + `DISTINCT`

ORDER BY: Sort solution mappings by the given variable
DISTINCT: Remove duplicate solution mappings
SPARQL requires DISTINCT to happen after ORDER BY
Performance issues when sorting many non-distinct solution mappings
Optimization: Apply DISTINCT before ORDER BY

Input:

{
  "type": "distinct",
  "input": {
    "type": "orderby",
    "variable": "x",
    "input": { ... }
  }
}

Output:

{
  "type": "orderby",
  "variable": "x",
  "input": {
    "type": "distinct",
    "input": { ... }
  }
}

Low-level optimization

Data-dependent
Based on the underlying dataset, statistics, and derived indexes.
Minimize computational complexity
Reduce the number of intermediary results.
Select and configure specific algorithms of algebra operators
Different algorithms behave better in certain situations.

Example: joining in BGPs

How, and in what order should triple patterns be evaluated?

1. ?person a dbpedia-owl:Artist.

2. ?person rdfs:label ?name.

3. ?person dbpedia-owl:birthPlace ?birthPlace.

Different factors influence join order

Cardinality of each pattern
Can be exact, or an estimate
Complexity of join algorithms
Based on cardinalities of patterns
Precomputing steps of join algorithms
Precomputation may be too expensive for low cardinalities
Restrictions of join algorithms
Some algorithms only work in certain cases

Obtain cardinality of each pattern

1. ?person a dbpedia-owl:Artist. 200

2. ?person rdfs:label ?name. 10.000

3. ?person dbpedia-owl:birthPlace ?birthPlace. 1.000

Can be pre-computed, or estimated at runtime.

Select appropriate join algorithms

Usually happens in pairs of two operators (n, m)

Nested-loop join
Double for-loop → O(n*m)
Works well for small sequences due to the lack of preprocessing.
Hash join
Precomputes hash tables for all mappings in one solution sequence.
Iterates over other solution sequence with lookups to hash table → O(n+m)
Not possible with OPTIONAL
Sort-merge join
Sort solution sequences.
Iterates over both solution sequences in order → O(n+m)
Sorting can become expensive with larger solution sequences

Join orders and algorithms influence worst-case complexity

All nested-loop join
((1 ⋈ 2) ⋈ 3) = ((200 * 10.000) * 1.000) = 2.000.000.000
All hash join
((1 ⋈ 2) ⋈ 3) = ((200 + 10.000) + 1.000) = 11.200
Nested-loop join and hash join
((1 ⋈ 3) ⋈ 2) = ((200 * 1.000) + 10.000) = 210.000

→ Restrictions and preprocessing steps (sorting, hashing, ...) not yet taken into account

→ Trade-off between different factors that influence join order

Basics and Terminology
Parsing
Optimization
Evaluation

Evaluating the optimized query plan

Bottom-up
Start with leaves in query plan, and work up to the root.
Parallelisation
Evaluate independent branches in parallel on different threads.
Pipelining
Reduce time to first result.
Adaptive processing
Advanced: adaptively modify the query plan at runtime.

Evaluation iteratively handles leaf nodes

Based on operator type

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        ...solution sequence...

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        ...solution sequence...

      },
      {

        "type": "pattern",
        "subject": { ... },
        "predicate": { ... },
        "object": { ... }

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {

        ...solution sequence...

      },
      {

        ...solution sequence...

      }
    ]
  }
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    "type": "bgp",
    "patterns": [
      {
        ...solution sequence...
      },
      {
        ...solution sequence...
      }
    ]

}
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {

    ...solution sequence...

}
}

Evaluation iteratively handles leaf nodes

{
  "type": "project",
  "variables": [ ... ]
  "input": {
    ...solution sequence...
  }
}

Evaluation iteratively handles leaf nodes

...solution sequence...

Some operations can be parallelised

Operators with multiple independent inputs
UNION branches are fully independent

Triple patterns in BGPs can be handled independently based on the algorithm

Some joins can be partially handled independently (e.g. sorting step in sort-merge join)
Local threads vs. distributed machines
Communication cost may be high → parallelising not always beneficial
I/O vs. CPU cost
Depending on how data is stored, I/O may be the bottleneck

E.g. remotely stored data → HTTP is bottleneck instead of CPU

Pipelining produces results as soon as possible

Evaluating operators is sub-optimal
Requires intermediary results to be kept in memory

Results only become available once everything is done processing
Pipelining enables intermediary results to flow through operators
In general, intermediary results do not have to be kept in memory

Query results are produced as a stream, as soon as they are available

A pipeline of operators

Adaptive processing responds to changing conditions

Query optimization as preprocessing is restrictive
Assumes that all information is known beforehand

Can not cope with volatile distributed environments
Adaptive query processing: optimization during evaluation
Modify query plan based on network or data changes

Algorithms are typically more complex

Basics and Terminology
Parsing
Optimization
Evaluation

Conclusions

Different steps in SPARQL query processing
Parsing, optimization, evaluation
Query optimization is crucial for performance
Based on heuristics and statistics

Most research happens in this area
Learn more from real-world SPARQL query engines
Comunica (TypeScript/JavaScript) and Jena (Java) are open-source

http://query.linkeddatafragments.org/