Ruben Taelman

Hi!
I'm Ruben Taelman, a Web postdoctoral researcher at IDLab,
with a focus on decentralization, Linked Data publishing, and querying.

My goal is to make data accessible for everyone by providing
intelligent infrastructure and algorithms for data publication and retrieval.

To support my research, I develop various open source JavaScript libraries such as streaming RDF parsers and the Comunica engine to query Linked Data on the Web.
As this website itself contains Linked Data, you can query it live with Comunica.

Have a look at my publications or projects
and contact me if any of those topics interest you.

Latest blog posts

The cost of modularity in SPARQL
How much do modularity and decentralization conflict with centralized speed?
22 April 2025

The JavaScript-based Comunica SPARQL query engine is designed for querying over decentralized environments, e.g. through federation and link traversal. In addition, it makes use of a modular architecture to achieve high flexibility for developers. To determine the impact of these design decisions, and to be able to put the base level performance of Comunica in perspective, I compared its performance to state-of-the-art centralized SPARQL engines in terms of querying over centralized Knowledge Graphs. Results show that Comunica can closely match the performance of state-of-the-art SPARQL query engines, despite having vastly different optimization criteria.
Querying a Decentralized Web
The road towards effective query execution of Decentralized Knowledge Graphs.
21 January 2022

Most of today’s applications are built based around the assumption that data is centralized. However, with recent decentralization efforts such as Solid quickly gaining popularity, we may be evolving towards a future where data is massively decentralized. In order to enable applications over decentralized data, there is a need for new querying techniques that can effectively execute over it. This post discusses the impact of decentralization on query execution, and the problems that need to be solved before we can use it effectively in a decentralized Web.

Highlighted publications

Conference Link Traversal Query Processing over Decentralized Environments with Structural Assumptions
1. Ruben Taelman0
2. Ruben Verborgh1
In Proceedings of the 22nd International Semantic Web Conference To counter societal and economic problems caused by data silos on the Web, efforts such as Solid strive to reclaim private data by storing it in permissioned documents over a large number of personal vaults across the Web. Building applications on top of such a decentralized Knowledge Graph involves significant technical challenges: centralized aggregation prior to query processing is excluded for legal reasons, and current federated querying techniques cannot handle this large scale of distribution at the expected performance. We propose an extension to Link Traversal Query Processing (LTQP) that incorporates structural properties within decentralized environments to tackle their unprecedented scale. In this article, we analyze the structural properties of the Solid decentralization ecosystem that are relevant for query execution, we introduce novel LTQP algorithms leveraging these structural properties, and evaluate their effectiveness. Our experiments indicate that these new algorithms obtain accurate results in the order of seconds, which existing algorithms cannot achieve. This work reveals that a traversal-based querying method using structural assumptions can be effective for large-scale decentralization, but that advances are needed in the area of query planning for LTQP to handle more complex queries. These insights open the door to query-driven decentralized applications, in which declarative queries shield developers from the inherent complexity of a decentralized landscape. 2023
More
Conference Comunica: a Modular SPARQL Query Engine for the Web
In Proceedings of the 17th International Semantic Web Conference Query evaluation over Linked Data sources has become a complex story, given the multitude of algorithms and techniques for single- and multi-source querying, as well as the heterogeneity of Web interfaces through which data is published online. Today’s query processors are insufficiently adaptable to test multiple query engine aspects in combination, such as evaluating the performance of a certain join algorithm over a federation of heterogeneous interfaces. The Semantic Web research community is in need of a flexible query engine that allows plugging in new components such as different algorithms, new or experimental SPARQL features, and support for new Web interfaces. We designed and developed a Web-friendly and modular meta query engine called Comunica that meets these specifications. In this article, we introduce this query engine and explain the architectural choices behind its design. We show how its modular nature makes it an ideal research platform for investigating new kinds of Linked Data interfaces and querying algorithms. Comunica facilitates the development, testing, and evaluation of new query processing capabilities, both in isolation and in combination with others. 2018
More
Journal Triple Storage for Random-Access Versioned Querying of RDF Archives
In Journal of Web Semantics When publishing Linked Open Datasets on the Web, most attention is typically directed to their latest version. Nevertheless, useful information is present in or between previous versions. In order to exploit this historical information in dataset analysis, we can maintain history in RDF archives. Existing approaches either require much storage space, or they expose an insufficiently expressive or efficient interface with respect to querying demands. In this article, we introduce an RDF archive indexing technique that is able to store datasets with a low storage overhead, by compressing consecutive versions and adding metadata for reducing lookup times. We introduce algorithms based on this technique for efficiently evaluating queries at a certain version, between any two versions, and for versions. Using the BEAR RDF archiving benchmark, we evaluate our implementation, called OSTRICH. Results show that OSTRICH introduces a new trade-off regarding storage space, ingestion time, and querying efficiency. By processing and storing more metadata during ingestion time, it significantly lowers the average lookup time for versioning queries. OSTRICH performs better for many smaller dataset versions than for few larger dataset versions. Furthermore, it enables efficient offsets in query result streams, which facilitates random access in results. Our storage technique reduces query evaluation time for versioned queries through a preprocessing step during ingestion, which only in some cases increases storage space when compared to other approaches. This allows data owners to store and query multiple versions of their dataset efficiently, lowering the barrier to historical dataset publication and analysis. 2018
More

More publications

Latest publications

Conference Link Traversal over Decentralised Environments using Restart-Based Query Planning
1. Jonni Hanski0
2. Ruben Taelman1
3. Ruben Verborgh2
In Proceedings of the 25th International Conference on Web Engineering With the emergence of decentralisation initiatives to address various issues around regulatory compliance and barriers of entry to data- driven markets, data access abstraction layers in the form of query engines are needed to assist in developing services on top of such environments. Prior work, however, has demonstrated significant network overhead dur- ing data retrieval in traversal-based query execution over decentralised Linked Data sources, dwarfing the relative impact of local processing and query optimisations. Certain decentralisation initatives, however, offer an environment with seemingly sufficient structure to address this, allowing client-side query engines to attain measurable performance improvements through local optimisations. One example is the Solid initiative, offering distributed well-defined user data stores, helping traversal-based query execution approaches in efficiently locating and accessing query-relevant data. Within this work, we demonstrate the impact of client-side adaptive query planning optimisations within structured distributed environments, using the Solid ecosystem as an example, to highlight the potential for tangible improvements in traversal-based execution. Through the implementation of a restart-based query planning technique, we achieve average query execution time reductions of up to 36% compared to a baseline of unchanged query plan execution. Conversely, we also demonstrate how such techniques, when applied without robust cost-benefit estimation, can effectively double the query execution time. This illustrates the importance and potential of client-side techniques even in such distributed environments, and highlights the importance of further investigation in the direction of these techniques. 2025
More
Journal Algebraic Mapping Operators for Knowledge Graph Generation
1. Sitt Min Oo0
2. Ben De Meester1
3. Ruben Taelman2
4. Pieter Colpaert3
In Semantic Web Journal Recent advancements in declarative knowledge graph generation have led to the development of multiple mapping languages, their various versions, and different mapping engines that can interpret these languages and execute the mapping process. The field has progressed to the extent that current studies are now more focused on optimizing the knowledge graph generation process. Although different mapping engines share the common functionality of generating knowledge graphs from heterogeneous data sources, sharing the various optimization techniques and features of these engines remains challenging due to the lack of formal operational semantics for the general mapping processes. A set of algebraic mapping operators can provide the necessary operational semantics for general mapping processes, establish a theoretical foundation for mapping languages, and facilitate the introduction and evaluation of a compliant implementation, that is capable of interpreting and executing multiple mapping languages. In this paper, we propose such an algebra based on the SPARQL algebra. This allows us to maximally reuse established definitions, and further bridge the world of knowledge graph generation with query engines. To evaluate that our work is not limited to a single specific mapping language, we translated mapping languages ShExML and RML to our mapping plan composed of algebraic mapping operators. The results of our completeness evaluation shows that our algebraic operators cover the operational semantics of RML and partially for ShExML. To fully cover ShExML, further analysis into ShExML’s concise operational semantics is needed (e.g. for joining data from two input sources). For performance evaluation, our proof-of-concept algebraic mapping engine has a consistent and low memory usage across the different workloads, and achieved second place in the Knowledge Graph Construction Workshop’s performance challenge. Algebraic mapping operators decouple mapping engines from the mapping languages, enabling multilingual mapping engines. Furthermore, the mapping plan can incorporate optimization techniques as a separate process from the mapping itself, allowing us to benefit from state-of-the-art mapping process optimizations. The proposed set of algebraic mapping operators will lay the foundation for future studies on the theoretical analysis of complexity and expressiveness of mapping languages, and will provide consistency in the execution semantics of mapping engines. Furthermore, the alignment of our algebra with SPARQL will enable further research into advanced methods such as virtualization, enabling heterogeneous data querying. 2025
More
Conference Incremunica: Web-based Incremental View Maintenance for SPARQL
In Proceedings of the 22nd Extended Semantic Web Conference The dynamic nature of Linked Data from IoT devices, social media, and the financial sector requires efficient mechanisms to keep SPARQL query results up to date, as traditional reevaluation methods are computationally expensive and impractical. Incremental view maintenance (IVM) offers a more efficient alternative by updating query results incrementally. However, existing engines lack support for federated querying, dynamically adding and removing sources during query execution, SPARQL Query Language support, multiple IVM techniques, and client-side execution. In this paper, we present Incremunica, an incremental query engine that addresses these gaps. Incremunica uniquely integrates multiple state-of-the-art incremental operators, allowing it to adapt to different queries and data for optimal performance. In this article, we provide 1) a requirements analysis comparing Incremunica to related work, 2) an explanation of Incremunica’s architecture and features, 3) a performance evaluation showing improvements over reevaluation, and 4) a demonstration of its benefits through a social media watch party application. 2025
More
Demo SGF: SPARQL Updates over Decentralized Knowledge Graphs without Access Path Dependencies
1. Jitse De Smet0
2. Ruben Taelman1
In Proceedings of the 22nd Extended Semantic Web Conference: Posters and Demos Decentralized data ecosystems, such as the Solid project, empower users to control their data but introduce complexities in data storage and retrieval. Current solutions provide mechanisms for describing data structures but lack sufficient guidance for determining where to create or update resources. To address this challenge, we propose the Storage Guiding Framework (SGF), a framework that enables clients to manage RDF resource storage within Solid pods. This paper introduces SGF, detailing the describing structure and how SGF allows clients to treat Solid pods as RDF collections rather than a collection of unstructured HTTP documents. Our findings show that SGV enhances data accessibility by eliminating the access path data-dependency and providing clear storage strategies. This improvement simplifies client-side data management while maintaining flexibility in data organization. 2025
More
Poster Optimizing Traversal Queries of Sensor Data Using a Rule-Based Reachability Approach
In Proceedings of the 23rd International Semantic Web Conference: Posters and Demos Link Traversal queries face challenges in completeness and long execution time due to the size of the web. Reachability criteria define completeness by restricting the links followed by engines. However, the number of links to dereference remains the bottleneck of the approach. Web environments often have structures exploitable by query engines to prune irrelevant sources. Current criteria rely on using information from the query definition and predefined predicate. However, it is difficult to use them to traverse environments where logical expressions indicate the location of resources. We propose to use a rule-based reachability criterion that captures logical statements expressed in hypermedia descriptions within linked data documents to prune irrelevant sources. In this poster paper, we show how the Comunica link traversal engine is modified to take hints from a hypermedia control vocabulary, to prune irrelevant sources. Our preliminary findings show that by using this strategy, the query engine can significantly reduce the number of HTTP requests and the query execution time without sacrificing the completeness of results. Our work shows that the investigation of hypermedia controls in link pruning of traversal queries is a worthy effort for optimizing web queries of unindexed decentralized databases. 2024
More

More publications