Ruben Taelman

Hi!
I'm Ruben Taelman, a Web postdoctoral researcher at IDLab,
with a focus on decentralization, Linked Data publishing, and querying.

My goal is to make data accessible for everyone by providing
intelligent infrastructure and algorithms for data publication and retrieval.

To support my research, I develop various open source JavaScript libraries such as streaming RDF parsers and the Comunica engine to query Linked Data on the Web.
As this website itself contains Linked Data, you can query it live with Comunica.

Have a look at my publications or projects
and contact me if any of those topics interest you.

Latest blog posts

Querying a Decentralized Web
The road towards effective query execution of Decentralized Knowledge Graphs.
21 January 2022

Most of today’s applications are built based around the assumption that data is centralized. However, with recent decentralization efforts such as Solid quickly gaining popularity, we may be evolving towards a future where data is massively decentralized. In order to enable applications over decentralized data, there is a need for new querying techniques that can effectively execute over it. This post discusses the impact of decentralization on query execution, and the problems that need to be solved before we can use it effectively in a decentralized Web.
5 rules for open source maintenance
Guidelines for publishing and maintaining open source projects.
24 May 2021

Thanks to continuing innovation of software development tools and services, it has never been easier to start a software project and publish it under an open license. While this has lead to the availability of a huge arsenal of open source software projects, the number of qualitative projects that are worth reusing is of a significantly smaller order of magnitude. Based on personal experience, I provide five guidelines in this post that will help you to publish and maintain highly qualitative open-source software.

Highlighted publications

Conference Link Traversal Query Processing over Decentralized Environments with Structural Assumptions
1. Ruben Taelman0
2. Ruben Verborgh1
In Proceedings of the 22nd International Semantic Web Conference To counter societal and economic problems caused by data silos on the Web, efforts such as Solid strive to reclaim private data by storing it in permissioned documents over a large number of personal vaults across the Web. Building applications on top of such a decentralized Knowledge Graph involves significant technical challenges: centralized aggregation prior to query processing is excluded for legal reasons, and current federated querying techniques cannot handle this large scale of distribution at the expected performance. We propose an extension to Link Traversal Query Processing (LTQP) that incorporates structural properties within decentralized environments to tackle their unprecedented scale. In this article, we analyze the structural properties of the Solid decentralization ecosystem that are relevant for query execution, we introduce novel LTQP algorithms leveraging these structural properties, and evaluate their effectiveness. Our experiments indicate that these new algorithms obtain accurate results in the order of seconds, which existing algorithms cannot achieve. This work reveals that a traversal-based querying method using structural assumptions can be effective for large-scale decentralization, but that advances are needed in the area of query planning for LTQP to handle more complex queries. These insights open the door to query-driven decentralized applications, in which declarative queries shield developers from the inherent complexity of a decentralized landscape. 2023
More
Conference Comunica: a Modular SPARQL Query Engine for the Web
In Proceedings of the 17th International Semantic Web Conference Query evaluation over Linked Data sources has become a complex story, given the multitude of algorithms and techniques for single- and multi-source querying, as well as the heterogeneity of Web interfaces through which data is published online. Today’s query processors are insufficiently adaptable to test multiple query engine aspects in combination, such as evaluating the performance of a certain join algorithm over a federation of heterogeneous interfaces. The Semantic Web research community is in need of a flexible query engine that allows plugging in new components such as different algorithms, new or experimental SPARQL features, and support for new Web interfaces. We designed and developed a Web-friendly and modular meta query engine called Comunica that meets these specifications. In this article, we introduce this query engine and explain the architectural choices behind its design. We show how its modular nature makes it an ideal research platform for investigating new kinds of Linked Data interfaces and querying algorithms. Comunica facilitates the development, testing, and evaluation of new query processing capabilities, both in isolation and in combination with others. 2018
More
Journal Triple Storage for Random-Access Versioned Querying of RDF Archives
In Journal of Web Semantics When publishing Linked Open Datasets on the Web, most attention is typically directed to their latest version. Nevertheless, useful information is present in or between previous versions. In order to exploit this historical information in dataset analysis, we can maintain history in RDF archives. Existing approaches either require much storage space, or they expose an insufficiently expressive or efficient interface with respect to querying demands. In this article, we introduce an RDF archive indexing technique that is able to store datasets with a low storage overhead, by compressing consecutive versions and adding metadata for reducing lookup times. We introduce algorithms based on this technique for efficiently evaluating queries at a certain version, between any two versions, and for versions. Using the BEAR RDF archiving benchmark, we evaluate our implementation, called OSTRICH. Results show that OSTRICH introduces a new trade-off regarding storage space, ingestion time, and querying efficiency. By processing and storing more metadata during ingestion time, it significantly lowers the average lookup time for versioning queries. OSTRICH performs better for many smaller dataset versions than for few larger dataset versions. Furthermore, it enables efficient offsets in query result streams, which facilitates random access in results. Our storage technique reduces query evaluation time for versioned queries through a preprocessing step during ingestion, which only in some cases increases storage space when compared to other approaches. This allows data owners to store and query multiple versions of their dataset efficiently, lowering the barrier to historical dataset publication and analysis. 2018
More

More publications

Latest publications

Poster Observations on Bloom Filters for Traversal-Based Query Execution over Solid Pods
1. Jonni Hanski0
2. Ruben Taelman1
3. Ruben Verborgh2
In Proceedings of the 21st Extended Semantic Web Conference: Posters and Demos Traversal-based query execution enables the resolving of queries over Linked Data documents, using a follow-your-nose approach to locating query-relevant data by following series of links through documents. This traversal, however, incurs an unavoidable overhead in the form of data access costs. Through only following links known to be relevant for answering a given query, this overhead could be minimized. Prior work exists in the form of reachability conditions to determine the links to dereference, however this does not take into consideration the contents behind a given link. Within this work, we have explored the possibility of using Bloom filters to prune query-irrelevant links based on the triple patterns contained within a given query, when performing traversal-based query execution over Solid pods containing simulated social network data as an example use case. Our discoveries show that, with relatively uniform data across an entire benchmark dataset, this approach fails to effectively filter links, especially when the queries contain triple patterns with low selectivity. Thus, future work should consider the query plan beyond individual patterns, or the structure of the data beyond individual triples, to allow for more effective pruning of links. 2024
More
Workshop DFDP: A Declarative Form Description Pipeline for Decentralizing Web Forms
1. Ieben Smessaert
2. Patrick Hochstenbach
3. Ben De Meester2
4. Ruben Taelman3
5. Ruben Verborgh4
In Proceedings of the 2nd International Workshop on Semantics in Dataspaces Forms are key to bidirectional communication on the Web: without them, end-users would be unable to place online orders or file support tickets. Organizations often need multiple, highly similar forms, which currently require multiple implementations. Moreover, the data is tightly coupled to the application, restricting the end-user from reusing it with other applications, or storing the data somewhere else. Organizations and end-users have a need for a technique to create forms that are more controllable, reusable, and decentralized. To address this problem, we introduce the Declarative Form Description Pipeline (DFDP) that meets these requirements. DFDP achieves controllability through end-users’ editable declarative form descriptions. Reusability for organizations is ensured through descriptions of the form fields and associated actions. Finally, by leveraging a decentralized environment like Solid, the application is decoupled from the storage, preserving end-user control over their data. In this paper, we introduce and explain how such a declarative form description can be created and used without assumptions about the viewing environment or data storage. We show how separate applications can interoperate and be interchanged by using a description that contains details for form rendering and data submission decisions using a form, policy, and rule ontology. Furthermore, we prove how this approach solves the shortcomings of traditional Web forms. Our proposed pipeline enables organizations to save time by building similar forms without starting from scratch. Similarly, end-users can save time by letting machines prefill the form with existing data. Additionally, DFDP empowers end-users to be in control of the application they use to manage their data in a data store. User study results provide insights to further improve usability by providing automatic suggestions based on field labels entered. 2024
More
Workshop Requirements and Challenges for Query Execution across Decentralized Environments
1. Ruben Taelman0
In Companion Proceedings of the ACM Web Conference 2024 Due to the economic and societal problems being caused by the Web’s growing centralization, there is an increasing interest in decentralizing data on the Web. This decentralization does however cause a number of technical challenges. If we want to give users in decentralized environments the same level of user experience as they are used to with centralized applications, we need solutions to these challenges. We discuss how query engines can act as layer between applications on the one hand, and decentralized environments on the other hand, Query engines therefore act as an abstraction layer that hides the complexities of decentralized data management for application developers. In this article, we outline the requirements for query engines over decentralized environments. Furthermore, we show how existing approaches meet these requirements, and which challenges remain. As such, this article offers a high-level overview of a roadmap in the query and decentralization research domains. 2024
More
Demo Demonstration of Link Traversal SPARQL Query Processing over the Decentralized Solid Environment
1. Ruben Taelman0
2. Ruben Verborgh1
In Proceedings of the 27th International Conference on Extending Database Technology (EDBT) To tackle economic and societal problems originating from the centralization of Knowledge Graphs on the Web, there has been an increasing interest towards decentralizing Knowledge Graphs across a large number of small authoritative sources. In order to effectively build user-facing applications, there is a need for efficient query engines that abstract the complexities around accessing such massively Decentralized Knowledge Graphs (DKGs). As such, we have designed and implemented novel Link Traversal Query Processing algorithms into the Comunica query engine framework that are capable of efficiently evaluating SPARQL queries across DKGs provided by the Solid decentralization initiative. In this article, we demonstrate this query engine through a Web-based interface over which SPARQL queries can be executed over simulated and real-world Solid environments. Our demonstration shows the benefits of a traversal-based approach towards querying DKGs, and uncovers opportunities for further optimizations in future work in terms of both query execution and discovery algorithms. 2024
More
Poster Personalized Medicine Through Personal Data Pods
1. Elias Crum
2. Ruben Taelman1
3. Bart Buelens
4. Gokhan Ertaylan
5. Ruben Verborgh4
In Proceedings of the 15th International SWAT4HCLS Conference Medical care is in the process of becoming increasingly personalized through the use of patient genetic information. Concurrently, privacy concerns regarding collection and storage of sensitive personal genome sequence data have necessitated public debate and legal regulation. Here we identify two fundamental challenges associated with privacy and shareability of genomic data storage and propose the use of Solid pods to address these challenges. We establish that personal data pods using Solid specifications can enable decentralized storage, increased patient control over their data, and support of Linked Data formats, which when combined, could offer solutions to challenges currently restricting personalized medicine in practice. 2024
More

More publications