Storing and Querying Evolving Knowledge Graphs on the Web

Ruben Taelman

IDLab, Ghent University — imec, 9 January 2020

Storing and Querying Evolving Knowledge Graphs on the Web

Advisors: prof. dr. ir. Ruben Verborgh, dr. Miel Vander Sande

Ghent University – imec – IDLab, Belgium

A need for Evolving Knowledge Graphs

How to store and query
evolving knowledge graphs
on the Web?

I focus on four challenges

Generating Synthetic Evolving Data

Contributions:

Storing Evolving Data

Contributions:

Querying a heterogeneous Web

Contributions:

Querying Evolving Data

Contributions:

Overview

How to store and query evolving knowledge graphs on the Web?

Publishing and querying of evolving knowledge on the Web.

Backup slides

0.1. Issues with quality of manuscript

Typesetting, references, font spacing, positioning of titles, code listings cut off, ...

ScholarMarkdown

0.2. Relationship to other papers

Chronological overview of my main publications

0.3. Open challenges

0.4. Research Answer

How to store and query evolving knowledge graphs on the Web?

→ Decentralized publishing and querying of evolving knowledge at a low cost.

1.1. Mention of sensors in motivation may be misleading

In hindsight, I agree.
However, still relevant because:

2.1. Can the approach be generalized?

Yes, thanks to modularity of phases (region, stops, edges, routes, trips)

Generalization is possible on 2 levels:

2.2. Is the generator deterministic?

Yes, using random seed

2.3. Value of coherence evaluation

Distance functions are indeed more valueable, but coherence is still important.
Most existing benchmarks are too structured:

Duan et al. "Apples and Oranges: A Comparison of RDF Benchmarks and Real RDF Datasets"
My goal: proving that PoDiGG does not suffer from this problem.

2.4.1. Equation 5: Normalization is surprising


Equation 5 (page 48): Generic function for calculating distance between 2 sets of elements

2.4.2. Equation 5: Side-effects of closest element


Equation 4 (page 48): Function to determine closest element (used in Eq 5)
d(red, black) < d(green, black). But green is intuitively more similar to black.
Possible solution: normalize points to "central point"

2.7. Better than random not sufficient to claim realism

3.2. What is stored?

3.3. Why use OSP index for S?O queries?

No filtering required in OSP index compared to SPO index.
Query: S2 ? O2
OSP
O1S1P1
O1S2P2
O2S2P1
O2S2P2
O3S1P1
03S3P3
SPO
S1P1O1
S1P1O3
S2P1O2
S2P2O1
S2P2O2
S3P3O3

3.4. Relative position definition

Assumption: triples are comparable in the context of an index (SPO, POS, OSP)
T: set of all possible triples
t: triple (t ∈ T)
D: delta (D ⊆ T)
p: triple pattern (T → True ∨ False)

relative_position(t, D, p) = count{t' | t' ∈ D ∧ p(t') ∧ t' < t}
      

3.5. Why are local changes needed?

Required for VM queries and during ingestion.
View on one triple in a dataset:
VersionTypeLocal change?
0Snapshot/
1-F
2+T
3-F

Queries:

3.6. Local change definition

Definition on page 27 was incomplete (example in Table 9 (page 76) is correct).

Triples that revert a previous addition or deletion within the same delta chain, are called local changes.
Triples that revert a previous non-local-change (addition or deletion) within the same delta chain, are called local changes.

Implementation

3.8. Proof refers to functions, not defined

Lines 13, 14, 15, 16, 2, 3, 11, 13, 20, 21, 22 (in order of mentions)

3.9. Complexity of streaming ingestion

S: last snapshot
A: additions to insert
D: deletions to insert

O(|S| + |A| + |D|)
      
Pseudocode in appendix
Increase (Fig. 19, page 89) is caused by size of versions (~30M triples / version)

3.10. Results: median over how many queries?

Two levels of result summarisation within each query atom (VM, DM, VQ):
  1. Within each triple pattern category (S??, S?O, ...) (~50 queries each)

    Normal distribution → average (done by BEAR)
    Produces different figures in appendix.
  2. Across each triple pattern category

    Non-normal distribution → median
    Produces 3 overview figures.

3.12. Statistical test for hypothesis 3

3.13. Hypothesis 3: assumptions

3.14. Offset for DM not explained

3.15. How is Table 9 stored in a B+tree?

3.16. Why only a single delta chain?

4.2. Comunica configs

4.4. Federation

4.6. Comunica adoption

5.1.1. TPF-QS Formalization

"On the Semantics of TPF-QS towards Publishing and Querying RDF Streams at Web-scale"

5.1.2. TPF-QS Formalization: Comparison

5.7. Assumptions and use cases

Assumptions: Use cases: Out of scope: