An Introduction to Querying over Knowledge Graphs on the Web

An Introduction to Querying
over Knowledge Graphs
on the Web

Ruben Taelman

Ghent University – imec – IDLab, Belgium

Storing data as Knowledge Graphs

Graph-based data model
When data does not fit into a fixed relation model
Based on Semantic Web and Linked Data technologies
Interlinking data across multiple data sources

Knowledge Graphs
Decentralized Querying
Examples

Knowledge Graphs
Decentralized Querying
Examples

1989, CERN Switzerland

Tim Berners-Lee
inventor of the World Wide Web

The Web's foundational ideas

Data is linked to each other
Across heterogeneous systems (decentralized)
Browser programs
Visualises data and helps users navigating the Web
Everyone can read data
Using Web browsers
Open standards
Anyone can implement tools (browsers, ...) on top of it

1990-... World-wide adoption

Not just for researchers anymore

The Web is a global information space

a.k.a. The World Wide Web (WWW)

Mostly used by humans through Web browsers

Web is focused on humans

Web pages show information

Visualized using Web browsers
Clicking on links

To discover new information
Search information

Using search engines such as Google, Bing, ...

Achieving tasks requires manual effort

Will it rain next week?
1. Find a weather prediction website
2. Select your location
3. Navigate to next week
Book a trip for a group of people
- Comparing agendas
- Comparing interests: nature, musea, ...
- Regional temperatures
- Geopolitical status
- ...

What if intelligent agents
could do these tasks for us?

Knowledge Graphs
Decentralized Querying
Examples

I Have a Dream

A Web where intelligent
agents are able to achieve
our day-to-day tasks.
2001

The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.

The dream: Getting there step by step

2015: Agents are being introduced that can perform simple tasks

Google Assistant, Alexa, Siri, ...

How do these agents work?

→ Execute Queries over Knowledge Graphs

Query = Structured Question
Knowledge Graph = Structured information

→ Structured questions over structured information

A Knowledge Graph is
a collection of structured information

Semantic Web technologies

RDF
Standard for representing and exchanging graph data
Knowledge Graphs are collections of such graph data
SPARQL
Standard for querying over RDF data

Generative AI models

2022: LLMs offer human-like conversations based on crawled Web data

Differences to Knowledge Graphs:

Adaptive understanding of unstructured data
Inconsistent answers and hallucinations
Black box

Knowledge Graphs
Decentralized Querying
Examples

RDF enables data exchange on the Web

RDF
Resource Description framework
RDF 1.0
Recommended by W3C since 1999
RDF 1.1
Recommended by W3C since 2014
RDF 1.2
Upcoming in 2025

Facts are represented as RDF triples

Multiple RDF triples form RDF datasets

:Alice :knows :Bob .
:Alice :knows :Carol .
:Alice :name "Alice" .
:Bob :name "Bob" .
:Bob :knows :Carol .

Multiple RDF triples form RDF datasets

RDF Resources are identified by IRIs

https://example.org/Alice

IRI
Internationalized Resource Identifier
Global identifiers
Can be used by different data sources → interlinking!
IRIs can be URLs
A URL is an IRI that can be dereferenced (e.g. looked up in Web browsers).
Allows additional information to be looked up for a resource.

Multiple syntaxes exist for RDF

Turtle

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

dbr:Alice foaf:knows dbr:Bob .
dbr:Alice foaf:name "Alice" .

JSON-LD

{
    "@context": {
        "dbr": "http://dbpedia.org/resource/",
        "foaf": "http://xmlns.com/foaf/0.1/"
    },
    "@id": "dbr:Alice",
    "foaf:knows": "dbr:Bob",
    "foaf:name": "Alice"
}

Knowledge Graphs
Decentralized Querying
Examples

SPARQL querying over RDF datasets

SPARQL: language to read and update RDF datasets via declarative queries.
Different query forms:
- SELECT: selecting values in tabular form → focus of this presentation
- CONSTRUCT: construct new triples
- ASK: check if data exists
- DESCRIBE: describe a given resource
- INSERT: insert new triples
- DELETE: delete existing triples

Specification: https://www.w3.org/TR/sparql-query/

Find all artists born in York

SELECT ?name ?deathDate WHERE {
  ?person a dbpedia-owl:Artist;
          rdfs:label ?name;
          dbpedia-owl:birthPlace [ rdfs:label "York"@en ].
  FILTER LANGMATCHES(LANG(?name),  "EN")
  OPTIONAL { ?person dbpprop:dateOfDeath ?deathDate. }
}

name	deathDate
Albert Joseph Moore
Charles Francis Hansom	1888
David Reed (comedian)
Dustin Gee
E Ridsdale Tate	1922

How do query engines process a query?

RDF dataset + SPARQL query
↓
...
↓
query results

How do query engines process a query?

RDF dataset + SPARQL query
↓
SPARQL query processing
↓
query results

Basic Graph Patterns enable graph pattern matching

Basic Graph Pattern (BGP)
A collection of triple patterns.

    ?person a dbpedia-owl:Artist.
    ?person rdfs:label ?name.
    ?person dbpedia-owl:birthPlace ?birthPlace.

Triple Pattern
A triple in which any component may be a variable.
Variables start with ?, followed by a label. (e.g. ?name)
More complex operators are possible
OPTIONAL, UNION, FILTER, ...

Steps in SPARQL query processing

1. Parsing
Transform a SPARQL query string into an algebra expression
2. Optimization
Transform algebra expression into a query plan
3. Evaluation
Executes query plan to obtain query results

Publishing Knowledge Graphs as SPARQL Endpoints

SPARQL endpoint: API that accepts SPARQL queries, and replies with results.

Most popular way to publish Knowledge Graphs
Alternatives are data dumps and Linked Data Documents
Very powerful
Very complex queries can be formulated with SPARQL
Power comes with a cost
SPARQL endpoints can require very powerful servers

Knowledge Graphs
Decentralized Querying
Examples

SPARQL processing over centralized data

Dataset is collocated with query engine
All data is known beforehand
Single dataset
Combining multiple datasets is hard

Centralization not always possible

Private data
Technical and legal reasons
Evolving data
Requires continuous re-indexing
Web scale data
Indexing the whole Web is infeasible (for non-tech-giants)

How to query over decentralized data?

Data and query engine are not collocated
Query engine runs on a separate machine
Not just one datasets
Data is spread over the Web into multiple documents

Approaches for querying over decentralized data

Federated Query Processing
Distributing query execution across known sources
Link Traversal Query Processing
Local query execution over sources that are discovered by following links

Knowledge Graphs
Decentralized Querying
Examples

Client distributes query over query APIs

Clients do limited effort
Split up the query, distribute it (source selection), and combine results
Servers perform most of the effort
They actually execute the queries, over potentially huge datasets

Federation over SPARQL endpoints

Servers are SPARQL endpoints (most common)
They accept any valid SPARQL query
Client-side source selection
Rewrite query in terms of SERVICE clauses

SELECT ?drug ?title WHERE {
  ?drug db:drugCategory dbc:micronutrient.
  ?drug db:casRegistryNumber ?id.
  ?keggDrug rdf:type kegg :Drug.
  ?keggDrug bio2rdf:xRef ?id.
  ?keggDrug purl:title ?title.
}

SELECT ?drug ?title WHERE {
  SERVICE <http://example.com/drb> {
    ?drug db:drugCategory dbc:micronutrient.
    ?drug db:casRegistryNumber ?id.
  }
  SERVICE <http://example.com/kegg> {
    ?keggDrug rdf:type kegg :Drug.
    ?keggDrug bio2rdf:xRef ?id.
    ?keggDrug purl:title ?title.
  }
}

Federation over heterogeneous sources

Servers are not only SPARQL endpoints
Other types of Linked Data Fragments: TPF, WiseKG, brTPF, ...
Different levels of server expressivity
Clients may have to take up more effort
Executing parts of queries client-side
Trade-off between server and client effort
Low-cost publishing and preventing server availability issues

Limitations of federated querying

All federation members must be known before execution starts
Source selection distributes query across list of sources
No discovery of new sources
Limited scalability in terms of number of endpoints
Current federation techniques scale to the order of 10 sources

Knowledge Graphs
Decentralized Querying
Examples

Exploit interlinking of documents

Linked Data documents are linked to each other
Following the Linked Data principles
Query engine can follow links
Start from one document, and discover new documents on the fly

Link Traversal-based Query Processing

= Querying by following links between documents

Example: decentralized address book

Example: Find Alice's contact names

SELECT ?name WHERE {
    <https://alice.pods.org/profile#me>
        foaf:knows ?person.
    ?person foaf:name ?name.
}

Query process:

Start from Alice's address book
Follow links to profiles of Bob and Carol
Query over union of all profiles
Find query results: [ { "name": "Bob" }, { "name": "Carol" } ]

Decentralized querying has open problems

Link Traversal is suitable for querying decentralized environments
Copes with limitations of federated querying
Relatively new research area → many open problems
Main bottleneck is number of links
Future research should focus on guidance towards query-relevant links

Knowledge Graphs
Decentralized Querying
Examples

Notable SPARQL engines

Amazon Neptune
SPARQL endpoint as a service
Apache Jena
Open-source engine in Java
QLever
Open-source engine in C
Comunica
Open-source engine in TypeScript, which focus on decentralized querying

Popular SPARQL endpoints

Uniprot: Protein sequences (210+ billion triples)
https://sparql.uniprot.org/
Wikidata: RDF representation of Wikipedia (16+ billion triples)
https://query.wikidata.org/

Example Federated queries

Requires prior knowledge of sources.

Find all proteins linked to arachidonate (CHEBI:32395)
Joins Uniprot and Rhea
Retrieve human enzymes that metabolize sphingolipids and are annotated in ChEMBL
Joins Uniprot, Rhea, and ChEMBL

Example Link Traversal queries

Discover sources on the fly.

Comunica Link Traversal
Queries over real-world and synthetic environments

Knowledge Graphs
Decentralized Querying
Examples

Conclusions

Knowledge Graphs use (Semantic) Web technologies
RDF, SPARQL, …
Knowledge Graphs can be interlinked
Thanks to RDF's global identifiers
Federated Querying and Link Traversal
Combining data across multiple Knowledge Graphs
Complex queries can be slow

An Introduction to Queryingover Knowledge Graphson the Web

Storing data as Knowledge Graphs

1989, CERN Switzerland

Tim Berners-Leeinventor of the World Wide Web

The Web's foundational ideas

Data is linked to each other

Browser programs

Everyone can read data

Open standards

1990-... World-wide adoption

Not just for researchers anymore

The Web is a global information space

a.k.a. The World Wide Web (WWW)

Mostly used by humans through Web browsers

Web is focused on humans

Web pages show information

Clicking on links

Search information

Achieving tasks requires manual effort

Will it rain next week?

Book a trip for a group of people

What if intelligent agentscould do these tasks for us?

I Have a Dream

A Web where intelligentagents are able to achieveour day-to-day tasks.2001

The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.

The dream: Getting there step by step

2015: Agents are being introduced that can perform simple tasks

How do these agents work?

→ Execute Queries over Knowledge Graphs

→ Structured questions over structured information

A Knowledge Graph isa collection of structured information

Semantic Web technologies

RDF

SPARQL

Generative AI models

2022: LLMs offer human-like conversations based on crawled Web data

RDF enables data exchange on the Web

RDF

Facts are represented as RDF triples

Multiple RDF triples form RDF datasets

Multiple RDF triples form RDF datasets

RDF Resources are identified by IRIs

IRI

Global identifiers

IRIs can be URLs

Multiple syntaxes exist for RDF

Turtle

JSON-LD

SPARQL querying over RDF datasets

Find all artists born in York

How do query engines process a query?

How do query engines process a query?

Basic Graph Patterns enable graph pattern matching

Basic Graph Pattern (BGP)

Triple Pattern

More complex operators are possible

Steps in SPARQL query processing

1. Parsing

2. Optimization

3. Evaluation

Publishing Knowledge Graphs as SPARQL Endpoints

Most popular way to publish Knowledge Graphs

Very powerful

Power comes with a cost

SPARQL processing over centralized data

Dataset is collocated with query engine

Single dataset

Centralization not always possible

Private data

Evolving data

Web scale data

How to query over decentralized data?

Data and query engine are not collocated

Not just one datasets

Approaches for querying over decentralized data

Federated Query Processing

Link Traversal Query Processing

Client distributes query over query APIs

Clients do limited effort

Servers perform most of the effort

An Introduction to Querying
over Knowledge Graphs
on the Web

Tim Berners-Lee
inventor of the World Wide Web

What if intelligent agents
could do these tasks for us?

A Web where intelligent
agents are able to achieve
our day-to-day tasks.
2001

A Knowledge Graph is
a collection of structured information