Triple Storage for Random-Access Versioned Querying of RDF Archives

Ruben Taelman
Joachim Van Herwegen
Miel Vander Sande
Erik Mannens
Ruben Verborgh

Ghent University – imec – IDLab, Belgium

Versioning: Variants of datasets exist
in parallel

DBpedia 2015
DBpedia 2016
DBpedia 2014
...

How to store multiple RDF dataset versions
to enable efficient versioned queries?

RDF Versioning Fundamentals
Hybrid Multiversion Storage Approach
Evaluation

RDF Versioning Fundamentals
Hybrid Multiversion Storage Approach
Evaluation

Fernández et al. 2018

RDF versioning

RDF Version (i)

DBpedia 2015, DBpedia 2016, ...

Set of triples annotated with version label i

Query atoms

Version Materialization (VM):
Which cities existed in (DBpedia) 2016?

Query a single version
Delta Materialization (DM):
Which cities were created or
removed between 2015 and 2016?

Query differences between 2 versions
Version Query (VQ):
In which years has Brussels existed?

Query all versions,
results are annotated with their versions

Fernández et al. 2018

Versioning storage strategies

Independent Copies (IC)
Stores every version completely.
+ Fast VM.
- Slow DM and VQ, a lot of storage space.
Change-Based (CB)
Stores differences between consecutive versions.
+ Fast DM, less storage space for small changes.
- Slow VM and VQ.
Timestamp-Based (TB)
Stores temporal validity of each triple.
+ Fast VQ, less storage space for overlapping versions.
- Slow VM and DM.

RDF Versioning Fundamentals
Hybrid Multiversion Storage Approach
Evaluation

Hybrid storage has been done before

Hybrid Independent Copy (IC) and Change-Based (CB)

Problem:
For long delta chains, version materialization becomes increasingly expensive.

We make deltas relative to the snapshot

This makes version materialization fast for any version

→

Problem:
Changes in deltas accumulate, which can lead to a lot of redundancy.

We combine all deltas in a
combined timestamp-based store

Compresses all redundant triples, and annotates them with versions.

t₁ ∈ 1,2, t₂ ∈ 3,5, t₃ ∈ 1,2,4, ...

Streaming, offset-enabled querying

Triple pattern queries

The foundational elements of SPARQL engines.
Streaming

Triple-by-triple processing of results is more memory efficient.
Offsets

Reduces processing time if only a subset is needed. (e.g. pagination)
Algorithms for each query atom (VM, DM, VQ)

Algorithms are explained in the paper.

RDF Versioning Fundamentals
Hybrid Multiversion Storage Approach
Evaluation

Implementation: OSTRICH

Offset-enabled STore for TRIple CHangesets
Implemented in C/C++
Open-source: https://github.com/rdfostrich/
Uses HDT for snapshots, and Kyoto Cabinet (B+Tree) for deltas

Fernández et al. 2018

Evaluation using BEAR

BEnchmark for rdf ARchives
Based on three real-world datasets from different domains
Triple pattern queries for the different query atoms
Baseline implementations of different storage strategies using HDT and Jena

Experiments

Storage

Storage sizes and ingestion times.
Querying

Query execution times for the different triple pattern query atoms.

We highlight some results, more results can be found in the paper.

Storage size and time scale linearly
for a few large versions

BEAR-A: 10 versions, 30M~66M triples per version.

Storage size and time mostly scale linearly for a many small versions

BEAR-B-hourly: 1299 versions, ~48K triples per version.

→ Delta chain becomes too long after version 1100

Version Materialization times are constant with respect to version

Median BEAR-B-hourly VM query results for all triple patterns for all versions.

Delta Materialization times are constant with respect to version

Median BEAR-B-hourly DM query results for all triple patterns for all versions.

Version Queries are always fast

Median BEAR-B-hourly VQ query results for all triple patterns.

RDF Versioning Fundamentals
Hybrid Multiversion Storage Approach
Evaluation

Conclusions

Trade-off between different storage strategies

Triple pattern query times for all query atoms of 1ms or less
Offset-enabled query algorithms

For efficient pagination of results
OSTRICH as an open-source implementation

Reusable in C/C++, including JavaScript bindings for Node.JS

Additional links

Practical usage of OSTRICH

Versioned Triple Pattern Fragments: http://versioned.linkeddatafragments.org/bear
Comunica SPARQL engine: https://github.com/rdfostrich/comunica-actor-init-sparql-ostrich