Biological Knowledge Graphs

Biological
Knowledge Graphs

Ruben Taelman

Ghent University – imec – IDLab, Belgium

Storing data as Knowledge Graphs

Graph-based data model
When data does not fit into a fixed relation model
Based on Semantic Web and Linked Data technologies
Interlinking data across multiple data sources

World Wide Web
Semantics Web
RDF
SPARQL
Examples

World Wide Web
Semantics Web
RDF
SPARQL
Examples

1989, CERN Switzerland

Tim Berners-Lee
inventor of the World Wide Web

The Web's foundational ideas

Data is linked to each other
Across heterogeneous systems (decentralized)
Browser programs
Visualises data and helps users navigating the Web
Everyone can read data
Using Web browsers
Open standards
Anyone can implement tools (browsers, ...) on top of it

1990-... World-wide adoption

Not just for researchers anymore

The Web is a global information space

a.k.a. The World Wide Web (WWW)

Mostly used by humans through Web browsers

Web is focused on humans

Web pages show information

Visualized using Web browsers
Clicking on links

To discover new information
Search information

Using search engines such as Google, Bing, ...

Achieving tasks requires manual effort

Will it rain next week?
1. Find a weather prediction website
2. Select your location
3. Navigate to next week
Book a trip for a group of people
- Comparing agendas
- Comparing interests: nature, musea, ...
- Regional temperatures
- Geopolitical status
- ...

What if intelligent agents
could do these tasks for us?

World Wide Web
Semantics Web
RDF
SPARQL
Examples

I Have a Dream

A Web where intelligent
agents are able to achieve
our day-to-day tasks.
2001

The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.

The dream: Getting there step by step

2015: Agents are being introduced that can perform simple tasks

Google Assistant, Alexa, Siri, ...

How do these agents work?

→ Execute Queries over Knowledge Graphs

Query = Structured Question
Knowledge Graph = Structured information

→ Structured questions over structured information

A Knowledge Graph is
a collection of structured information

Semantic Web technologies

RDF
Standard for representing and exchanging graph data
Knowledge Graphs are collections of such graph data
SPARQL
Standard for querying over RDF data

Generative AI models

2022: LLMs offer human-like conversations based on crawled Web data

Differences to Knowledge Graphs:

Adaptive understanding of unstructured data
Inconsistent answers and hallucinations
Black box

World Wide Web
Semantics Web
RDF
SPARQL
Examples

RDF enables data exchange on the Web

RDF
Resource Description framework
RDF 1.0
Recommended by W3C since 1999
RDF 1.1
Recommended by W3C since 2014
RDF 1.2
Upcoming in 2025

Facts are represented as RDF triples

Multiple RDF triples form RDF datasets

:Alice :knows :Bob .
:Alice :knows :Carol .
:Alice :name "Alice" .
:Bob :name "Bob" .
:Bob :knows :Carol .

Multiple RDF triples form RDF datasets

RDF Resources are identified by IRIs

https://example.org/Alice

IRI
Internationalized Resource Identifier
Global identifiers
Can be used by different data sources → interlinking!
IRIs can be URLs
A URL is an IRI that can be dereferenced (e.g. looked up in Web browsers).
Allows additional information to be looked up for a resource.

Multiple syntaxes exist for RDF

Turtle

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

dbr:Alice foaf:knows dbr:Bob .
dbr:Alice foaf:name "Alice" .

JSON-LD

{
    "@context": {
        "dbr": "http://dbpedia.org/resource/",
        "foaf": "http://xmlns.com/foaf/0.1/"
    },
    "@id": "dbr:Alice",
    "foaf:knows": "dbr:Bob",
    "foaf:name": "Alice"
}

World Wide Web
Semantics Web
RDF
SPARQL
Examples

SPARQL querying over RDF datasets

SPARQL: language to read and update RDF datasets via declarative queries.
Different query forms:
- SELECT: selecting values in tabular form → focus of this presentation
- CONSTRUCT: construct new triples
- ASK: check if data exists
- DESCRIBE: describe a given resource
- INSERT: insert new triples
- DELETE: delete existing triples

Specification: https://www.w3.org/TR/sparql-query/

Find all artists born in York

SELECT ?name ?deathDate WHERE {
  ?person a dbpedia-owl:Artist;
          rdfs:label ?name;
          dbpedia-owl:birthPlace [ rdfs:label "York"@en ].
  FILTER LANGMATCHES(LANG(?name),  "EN")
  OPTIONAL { ?person dbpprop:dateOfDeath ?deathDate. }
}

name	deathDate
Albert Joseph Moore
Charles Francis Hansom	1888
David Reed (comedian)
Dustin Gee
E Ridsdale Tate	1922

How do query engines process a query?

RDF dataset + SPARQL query
↓
...
↓
query results

How do query engines process a query?

RDF dataset + SPARQL query
↓
SPARQL query processing
↓
query results

Basic Graph Patterns enable graph pattern matching

Basic Graph Pattern (BGP)
A collection of triple patterns.

    ?person a dbpedia-owl:Artist.
    ?person rdfs:label ?name.
    ?person dbpedia-owl:birthPlace ?birthPlace.

Triple Pattern
A triple in which any component may be a variable.
Variables start with ?, followed by a label. (e.g. ?name)
More complex operators are possible
OPTIONAL, UNION, FILTER, ...

Query results representation

Solution Mapping
Mapping from a set of variable labels to a set of RDF terms.
Solution Sequence
A list of solution mappings.

1 solution sequence with 3 solution mappings:

name	birthplace
Bob Brockmann	http://dbpedia.org/resource/Louisiana
Bennie Nawahi	http://dbpedia.org/resource/Honolulu
Weird Al Yankovic	http://dbpedia.org/resource/Downey,_California

Steps in SPARQL query processing

1. Parsing
Transform a SPARQL query string into an algebra expression
2. Optimization
Transform algebra expression into a query plan
3. Evaluation
Executes query plan to obtain query results

Publishing Knowledge Graphs as SPARQL Endpoints

SPARQL endpoint: API that accepts SPARQL queries, and replies with results.

Most popular way to publish Knowledge Graphs
Alternatives are data dumps and Linked Data Documents
Very powerful
Very complex queries can be formulated with SPARQL
Power comes with a cost
SPARQL endpoints can require very powerful servers

World Wide Web
Semantics Web
RDF
SPARQL
Examples

Popular biological SPARQL endpoints

Uniprot: Protein sequences (210+ billion triples)
https://sparql.uniprot.org/
Rhea: Chemical and transport reactions (5+ million triples)
https://sparql.rhea-db.org/

Federation over multiple SPARQL endpoints

Data across multiple datasources can be joined in a single query

Find all proteins linked to arachidonate (CHEBI:32395)
Joins Uniprot and Rhea
Retrieve human enzymes that metabolize sphingolipids and are annotated in ChEMBL
Joins Uniprot, Rhea, and ChEMBL

World Wide Web
Semantics Web
RDF
SPARQL
Examples

Conclusions

Biological Knowledge Graphs use (Semantic) Web technologies
RDF, SPARQL, …
Knowledge Graphs can be interlinked
Thanks to RDF's global identifiers
Federated querying
Combining data across multiple Knowledge Graphs
Complex queries can be slow

BiologicalKnowledge Graphs

Storing data as Knowledge Graphs

1989, CERN Switzerland

Tim Berners-Leeinventor of the World Wide Web

The Web's foundational ideas

Data is linked to each other

Browser programs

Everyone can read data

Open standards

1990-... World-wide adoption

Not just for researchers anymore

The Web is a global information space

a.k.a. The World Wide Web (WWW)

Mostly used by humans through Web browsers

Web is focused on humans

Web pages show information

Clicking on links

Search information

Achieving tasks requires manual effort

Will it rain next week?

Book a trip for a group of people

What if intelligent agentscould do these tasks for us?

I Have a Dream

A Web where intelligentagents are able to achieveour day-to-day tasks.2001

The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.

The dream: Getting there step by step

2015: Agents are being introduced that can perform simple tasks

How do these agents work?

→ Execute Queries over Knowledge Graphs

→ Structured questions over structured information

A Knowledge Graph isa collection of structured information

Semantic Web technologies

RDF

SPARQL

Generative AI models

2022: LLMs offer human-like conversations based on crawled Web data

RDF enables data exchange on the Web

RDF

RDF 1.0

RDF 1.1

RDF 1.2

Facts are represented as RDF triples

Multiple RDF triples form RDF datasets

Multiple RDF triples form RDF datasets

RDF Resources are identified by IRIs

IRI

Global identifiers

IRIs can be URLs

Multiple syntaxes exist for RDF

Turtle

JSON-LD

SPARQL querying over RDF datasets

Find all artists born in York

How do query engines process a query?

How do query engines process a query?

Basic Graph Patterns enable graph pattern matching

Basic Graph Pattern (BGP)

Triple Pattern

More complex operators are possible

Query results representation

Solution Mapping

Solution Sequence

Steps in SPARQL query processing

1. Parsing

2. Optimization

3. Evaluation

Publishing Knowledge Graphs as SPARQL Endpoints

Most popular way to publish Knowledge Graphs

Very powerful

Power comes with a cost

Popular biological SPARQL endpoints

Uniprot: Protein sequences (210+ billion triples)

Rhea: Chemical and transport reactions (5+ million triples)

Federation over multiple SPARQL endpoints

Find all proteins linked to arachidonate (CHEBI:32395)

Retrieve human enzymes that metabolize sphingolipids and are annotated in ChEMBL

Conclusions

Biological Knowledge Graphs use (Semantic) Web technologies

Knowledge Graphs can be interlinked

Federated querying

Biological
Knowledge Graphs

Tim Berners-Lee
inventor of the World Wide Web

What if intelligent agents
could do these tasks for us?

A Web where intelligent
agents are able to achieve
our day-to-day tasks.
2001

A Knowledge Graph is
a collection of structured information