Chapter 1: Introduction

Research Outline and Questions

In this thesis I investigate Linked Data approaches to implementing FAIR Research Objects and sharing reproducible Computational Workflows.

Research Outline and Questions

Following this topic, this section elaborates Research Questions (RQ) on these interlinked ideas:

  1. Realization of the FAIR Digital Object concept using Web technologies.
  2. Implementing FAIR Research Objects with an pragmatic use of Linked Data practices.
  3. Unifying a FAIR Digital Object approach for computational workflows.

Aims for FAIR Digital Objects on the Web (RQ1)

FAIR Digital Object (FDO) has been proposed as a machine-actionable ecosystem of scholarly outputs [Schultes 2019], in theory realizing the FAIR principles [Wilkinson 2016] for a programmable mesh of strongly typed objects that go beyond the open data publication practices that the FAIR guidelines have popularised [Jacobsen 2020].

The FDO specifications [Ivonne 2023] are conceptual in nature; however, most existing Digital Object implementations [Kahn 2006] rely on the DOIP protocol [Reilly 2009] and the Handle system [Sun 2003a], neither of which are particularly familiar to software developers.

The Web, on the other side, is ubiquitous in modern software engineering [Taivalsaari 2021], used for everything from user interfaces, mobile applications and controlling devices to enterprise cross-platform integrations, backend data processing and microservices, frequently utilising cloud computing which itself is controlled using Web technologies [Marinescu 2023].

A relevant research question therefore is:

RQ1:

Can the the promising FDO concept be realised using existing Web technology, taking into account the lessons learnt from the early Semantic Web developments and more recent Linked Data practices?

I address RQ1 in Chapter 2 and Chapter 3.

Aims for FAIR Research Objects (RQ2)

Research Objects (RO) [Bechhofer 2013] have been proposed as a mechanism to capture a range of diverse scholarly outputs in a single archivable item with detailed metadata. The RO concept was first realised using Semantic Web ontologies [myExperiment 2009, Belhajjame 2015] – these approaches primarily targetted long-term preservation of scientific workflows, utillised by RO as a mechanism to capture computational methods.

The principles of Research Objects extend far beyond workflows; however, early RO implementations mainly focused on capturing software [Goble 2018]. To some extent the lack of wider adoption of ontology-based ROs can also be explained by developers of researcher-facing software and platforms (e.g. repositories, data management systems) having a lack of familiarity with use of Semantic Web technology – or worse, they tried and then struggled [Carriero 2010, Tudorache 2020].

Following the lessons learnt on early RO implementations and the emerging FAIR principles [Wilkinson 2016, Jacobsen 2020], after engagement with the digital libraries community it was agreed to formulate a lightweight approach to Research Objects [Sefton 2018, Ó Carragáin 2019b] for the purpose of data packaging. The updated aims of FAIR Research Objects can be summarised as:

Following from these, the second research question is:

RQ2:

Can a more pragmatic use of Linked Data practices better implement Research Objects for a wider developer audience, by using familiar Web technologies and give lightweight recommendations?

RQ2 is primarily addressed by Chapter 4.

Aims for FAIR Computational Workflows (RQ3)

A growing (if not majority) part of scientific analysis is now conducted using software and computational models. The concept of Research Software Engineering [Cohen 2020] has been established along with new professions Research Software Engineer [Baxter 2012] and Data Scientist [van der Aalst 2014] – researchers are not just using off-the-shelf software, but also combining computational tools (e.g. pipelines) and writing their own analytical source code (e.g. R scripts) and simulations.

From this observation emerges the need to treat software as FAIR artefacts [Lamprecht 2019], following best practices for documentation [Lee 2018], open development [Prlić 2012] and ensuring research software is robust [Taschuk 2017] so it can be reused and cited as scholarly outputs [Smith 2016]. With this motivation the principles of FAIR Research Software [Katz 2021b] have been established by the RDA FAIR for Research Software (FAIR4RS) Working Group [Barker 2022] and are gradually building traction, particularly in the life sciences. A remaining challenge is how citations of research software can be practically propagated following their execution.

While sharing of research software helps distribute the computational methods, the way software is used for a particular analysis requires additional measures to make it reproducible [Stodden 2016, Sandve 2013].

Computational Workflows (or Scientific Workflows) are used to structure and automate data analysis pipelines so they can be scalable, portable and explainable [Atkinson 2017], and as a side-effect of these features can significantly improve reproducibility [Cohen-Boulakia 2017]. There exists, however, a plethora of workflow systems and languages [Leipzig 2021, Amstutz 2021], although recent efforts have created the Common Workflow Language [Crusoe 2022] as a standard representation with that is executable by multiple engines.

Notably, workflow definitions themselves can be considered FAIR scholarly outputs [Goble 2020] and are published in repositories including Dockstore [Yuen 2021] and WorkflowHub [Goble 2021]. One could consider computational workflows as a kind of FAIR Research Software [de Visser 2023], but by their nature workflows also encourage the FAIR principles (e.g. preparing a computational tool for a workflow system [Brack 2022a] may include publishing it in a container registry). Workflow systems are also useful for creating and consuming FAIR Digital Objects [Wittenburg 2022a], and in addition workflow systems commonly provide explicit provenance logs of their executions.

Approaches to describing workflow provenance in a machine-readable format were initially diverse [Cruz 2009], and later converged on the use of ontologies [Missier 2010], most notably using W3C PROV-O [Lebo 2013a] but with various specializations [Garijo 2011, Garijo 2012, Missier 2013, Belhajjame 2015, Cuevas-Vicenttín 2016].

The tendency for workflow provenance models to diverge may be down to differences in the execution semantics of different workflow systems – which if accurately reflected in provenance means further differences at this level. This in turn leads to incompatibility of provenance traces and lack of common tooling. In addition execution details may obscure the link from the computational procesesses and the final workflow data outputs that researchers ultimately care more about than the intricacies of the workflow engine.

The third research question from these considerations is therefore:

RQ3:

Can a FAIR Digital Object approach for computational workflows unify machine-readable descriptions of research software, data and provenance, which can be consistently implemented by developers of different workflow management systems?

The multiple aspects of RQ3, as highlighted in this section, are adressed by Chapter 5.

Main Contributions

The contributions from this PhD include:

These contributions have not evolved in isolation, but in co-development with multiple international collaborations (see Appendix A).

Thesis Overview

Chapter 2 gives the background of the concepts FAIR Digital Object (FDO) and Linked Data, including a brief history of the Semantic Web, followed by a critical analysis of these technologies and their use.

Chapter 3 targets RQ1 and contributes a framework-based evaluation of Linked Data and FDO as possible architectures for implementing a distributed object system for the purpose of FAIR data publishing. The discussion in this chapter considers how the two approaches can benefit from each other’s strengths.

Chapter 4 addresses RQ2 by introducing the contribution of RO-Crate – a pragmatic data packaging mechanism using Linked Data standards to implement FDO and be extensible for domain-specific metadata.

Chapter 5 considers RQ3 by exploring the relationship between Computational Workflows and FAIR practices using RO-Crate and FDO, with use cases from molecular dynamics and specimen digitization. The contribution of the Workflow Run Crate profiles is presented as an interoperable way to capture and publish workflow execution provenance.

Chapter 6 summarises and discusses the contributions from this thesis, reflects on later third-party developments and concludes by evaluating the research questions.

Origins

Chapter 2 and section 3.1 are based on the journal article Evaluating FAIR Digital Object and Linked Data as distributed object systems [Soiland-Reyes 2023c] (see appendices A.4.1 and B.1.1). I am the main author of this manuscript.

Section 3.2 is based on Updating Linked Data practices for FAIR Digital Object principles [Soiland-Reyes 2022d] (see appendices A.4.2 and B.1.2). I am the main author of this manuscript.

Section 4.1 and section 4.3 are based on the publication Packaging research artefacts with RO-Crate [Soiland-Reyes 2022a] (see appendices A.4.3, B.1.3 and B.1.5). I am the main author of this manuscript.

Section 4.2 is based on the publication Creating lightweight FAIR digital objects with RO-Crate [Soiland-Reyes 2022c] (see appendices A.4.4 and B.1.4). I am the main author of this manuscript.

Section 5.1 is based on the publication Making Canonical Workflow Building Blocks interoperable across workflow languages [Soiland-Reyes 2022b] (see appendices A.4.5 and B.1.6). I am the main author of this manuscript.

Section 5.2 is based on the publication The Specimen Data Refinery: A canonical workflow framework and FAIR Digital Object approach to speeding up digital mobilisation of natural history collections [Hardisty 2022] (see appendices A.4.6 and B.1.7). I mainly contributed to sections 5.2.2.2, 5.2.2.3, 5.2.4.1, 5.2.7 in this manuscript.

Section 5.3 is based on the publication Incrementally building FAIR Digital Objects with Specimen Data Refinery workflows [Woolland 2022] (see appendices A.4.7 and B.1.8). I am the main author of this manuscript.

Section 5.3 is based on the preprint Recording provenance of workflow runs with RO-Crate [Leo 2023b] (see appendices A.4.8 and B.1.9). I am the last author of this manuscript, and have mainly contributed to sections 5.4.1, 5.4.5, 5.4.5.3, 5.4.5.4.

References

See chapter references.