Chapter 1: Introduction

Science is increasingly dependent on digital means, with computational methods used in almost all aspects of research, ranging from digitising plant specimens in herbariums [Thiers 2016], to molecular simulations of protein bindings for pharmacetical drug design [Śledź 2018].

Academics, government agencies and industry are now commonly making data publicly available under open licenses, feeding a broadening democratisation of science [Kitchin 2021] across social-economic borders¹, and expanding the potential for new multidiciplinary fields, commercialisation, citizen engagement and wider societal benefits [Bisol 2014].

Cloud-based computational infrastructures for “big data” are readily available for use with a wide range of open source software, enabling large scale secondary data analysis and detailed visualisations of research outputs [Hashem 2015].

However, in this accelerated ecosystem of Open Science, concerns have been raised about replicability of research findings [Ioannidis 2005], flagged as a “reproducibility crisis” [Baker 2016]. It is perhaps then ironic that the increased use of computers—with their inherently repeatable execution mechanisms—can negatively contribute to this crisis, as research publications do not commonly provide sufficient computational details such as code, data formats or software versions [Stodden 2016].

The increased focus on reusability of digital data and computational methods has been given the attention of funders and research communities. This led to the development of the FAIR principles for making data and their metadata Findable, Accessible, Interoperable and Reusable, i.e. retrievable and understandable for programmatic use [Wilkinson 2016].

One technological measure for achieving FAIR is using Linked Data (LD), a set of practices for publishing and relating data on the Web using controlled vocabularies [Berners-Lee 2006], serialised using formats of the Resource Description Framework (RDF) [Schreiber 2014] and organised using the Web Ontology Language (OWL) [W3C 2012], however the combined complexity of these underlying Semantic Web technologies can hamper adoption by developers [Klímek 2019] and researchers who want to make their data available.

Computational workflows have been developed as ways to structure execution of software tools, for instance for scientific data analysis, so that, by using a Workflow Management system (WfMS), tool execution is reproducible, scalable and documented. For these purposes, workflow systems have become heavily adopted by some research fields such as life sciences, however the workflow definitions themselves are not yet commonly shared as part of scholarly outputs, and only gradually being recognised as a form of FAIR Research Software [Katz 2021b].

Research Object (RO) is a concept proposed for sharing composites of research artefacts, together with their history and related resources such as software, workflows and external references [Bechhofer 2013]. The initial implementations of RO heavily used ontologies, and required a tight integration with the workflow management systems, but has great potential for FAIR publication of any scholarly outputs.

The FAIR principles are widely referenced in Open Science literature, and nominally adapted by many research data repositories and funder policies—but how can they better be translated into practice by typical researchers and software developers which may be using workflow systems, but not know any Linked Data technologies?

This is the focus for this thesis, where I investigate Linked Data approaches to implementing FAIR Research Objects and sharing reproducible Computational Workflows.

Motivation – achieving FAIR research outputs

This section gives the motivation for the thesis, together with a brief background to inform the research questions in Section 1.2. Further details on existing work are provided in Section 2.

FAIR Principles

The FAIR Principles [Wilkinson 2016] were introduced to improve sharing and digital reuse of research outputs ("data") as part of emerging open research practices. The main goals of FAIR are to support Findability, Accessability, Interoperability and Reusability, through machine-readable metadata and standardised publication methods for data, as quoted in Table [ch10:fair].


	In order to be Findable:
F1	(Meta)data are assigned a globally unique and persistent identifier.
F2	Data are described with rich metadata (defined by R1 below).
F3	Metadata clearly and explicitly include the identifier of the data it describes.
F4	(Meta)data are registered or indexed in a searchable resource.

	In order to be Acccessible:
A1	(Meta)data are retrievable by their identifier using a standardized communications protocol.
A1.1	The protocol is open, free, and universally implementable.
A1.2	The protocol allows for an authentication and authorization procedure, where necessary.
A2	Metadata are accessible, even when the data are no longer available.

	In order to be Interoperable:
I1	(Meta)data use a formal, accessible, shared, and broadly applicable language for
	knowledge representation.
I2	(Meta)data use vocabularies that follow FAIR principles.
I3	(Meta)data include qualified references to other (meta)data..

	In order to be Reusable:
R1	Meta(data) are richly described with a plurality of accurate and relevant attributes.
R1.1	(Meta)data are released with a clear and accessible data usage license.
R1.2	(Meta)data are associated with detailed provenance.
R1.3	(Meta)data meet domain-relevant community standards.

Table 1: FAIR Guiding Principles. Adapted from [Wilkinson 2016], emphasis added in italics.

Although these guidelines are quite specific, they do not prescribe any particular technology or repository [Mons 2017]. Further formalizations of the FAIR principles include RDA’s FAIR Data Maturity Model [FAIR Maturity 2020, Bahui 2020]. FAIR has also been expanded beyond data, e.g. to cover software [Katz 2021b], computational workflows [Goble 2020], training materials [Garcia 2020a], machine learning models [Duarte 2023] and digital twins [Schultes 2022].

The FAIR principles have become highly influential for open research stakeholders [Jacobsen 2020], particularly in large research infrastructure initatives such as by the European Open Science Cloud (EOSC) [Schouppe 2018], and increasing awareness and support for the principles by national Open Science policies and funders [Davidson 2019, Davidson 2022]. Implementation of the principles by platform developers and researchers have however raised many questions and practical challenges [Mons 2020, Riungu-Kalliosaari 2022].

For instance, in order to evaluate a given resource’s FAIRness, additional technical constraints need to be assumed, such as use of particular formal vocabularies. FAIR metrics [Wilkinson 2018, Devaraju 2021] have recently become an area of active research, as different FAIR assessment tools may give a range of results for the same data resource, primarily based on which technical assumptions are made [Wilkinson 2022a, Verburg 2023].

Recently there have been increased emphasis on training and awareness on the FAIR principles [Shanahan 2021, Rocca-Serra 2023], and registries of standards and vocabularies [Sansone 2019]. However—with a general lack of skills in data management planning, inadequate (opaque) data formats, and lnot enough time investment to provide rich metadata—research data, even when shared through repositories, can become effectively “un-findable” or near impossible to reuse [Carballo-Garcia 2022].

From this current situation we can identify several challenges with regards to finding practical ways for developers of RS to generate and consume FAIR data.

Existing approaches to implementing FAIR

The vision on the Semantic Web [Berners-Lee 1999] were proposed as a way to make structured data on the Web. This evolved into a Linked Data (LD) stack that uses logic-based ontologies, Web deployment of individually described resources, and cross-references between these resources with URI identifiers. The Semantic Web can be considered as the ecosystem of such Linked Data resources, which can be queried, traversed and reasoned about.

Linked Data was seen early on as a possible mean to implementing the FAIR principles, and a large focus of initiatives like GO-FAIR and Research Data Alliance and the wider FAIR community has been to find ways to FAIRify existing data sources, such as developing domain-specific vocabularies and mappings, along with training and tooling to support these processes. FAIR publishing of datasets is encouraged using Data Catalog Vocabulary (DCAT) [Albertoni 2023], e.g. by the European Commission’s Semantic Interoperability Community Europe (SEMIC) and the larger Interoperable Europe initiative.

There are now a large number of choices for Semantic Web technologies, serialisation formats, vocabularies, deployments and identifiers—motivating the proposal of FAIR Implementation Profiles [Schultes 2020] to document and guide technology decisions.

The field of Life Sciences was an early adopter of Linked Data, establishing training portals like FAIR Cookbook [Rocca-Serra 2023], developing biomedical ontologies as indexed in BioPortal [Whetzel 2011] (over 1300 as of 2024-05-18), and sharing practices at conferences like Semantic Web Applications for Health Care and Life Sciences (SWAT4HCLS) active since 2008. The life science research infrastructure ELIXIR Europe has has over 170 training materials for FAIR listed in its training portal TeSS (as of 2024-04-28), while the ELIXIR service FAIRsharing [Sansone 2019] has over 1700 standards, 2100 databases and 250 policies (as of 2024-04-28) for FAIR sharing of research data².

A challenge for consumption of FAIR services in such a diverse landscape is thus how to support reliable machine actionability—making the data generally interpretable and typed sufficiently to allow invocation of pre-defined operations.

FAIR Digital Objects (FDO)

FAIR Digital Object (FDO) has been proposed as a machine-actionable ecosystem of scholarly outputs [Schultes 2019], and has now become a major initiative for realising the FAIR principles in a different way than the initial Semantic Web approach. FDO proponents envision a programmable mesh of strongly typed objects, which goes beyond the open data publication practices that the FAIR guidelines have popularised. For this, FDO aims to provide concrete constraints for systems, which lead to predictable machine actions.

The FDO guidelines³ [Anders 2023] and the more detailed FDO specifications [Ivonne 2023] are largely conceptual in nature, with several demonstrated implementations [Wittenburg 2022a, Lannom 2022a] which in theory can operate side-by-side. Many of these, however, rely on novel or older network protocols [Reilly 2009, Sun 2003a] which are not particularly familiar to software developers, and not commonly supported by software libraries or frameworks.

This divergence from the more Web-centric “FAIR majority view”, while sound from a technical perspective and promising with regards to predictable computational consumption, raises organisational challenges for wider adoption of FDOs, e.g. within EOSC and research infrastructures, and might be introducing a steeper learning curve than already exists for FAIR, particularly for developers of RS who are primarily interested in solving scientific challenges.

Clearly the existing adoptions of Linked Data as-is would not present a coherent ecosystem for FDO machine-actionability, but it can be worth examining which aspects of the Web can benefit FDO development.

Research Software and Computational Workflows

A growing (if not majority) part of scientific analysis is now conducted using software and computational models. The concept of Research Software Engineering [Cohen 2020] has been established along with new professions Research Software Engineer [Baxter 2012] and Data Scientist [van der Aalst 2014]—researchers are not just using off-the-shelf software, but also combining multiple computational tools (e.g. in pipelines) and writing their own analytical source code (e.g. statistical R scripts) and simulations.

From this observation emerges the need to treat software as FAIR artefacts [Lamprecht 2019], following best practices for documentation [Lee 2018], open development [Prlić 2012] and ensuring Research Software (RS) is robust [Taschuk 2017] so it can be reused and cited as scholarly outputs [Smith 2016]. With this motivation, the principles of FAIR Research Software [Katz 2021b] have been established by the Research Data Allience (RDA) working group FAIR for Research Software (FAIR4RS) [Barker 2022] and are gradually building traction, particularly in the life sciences. An example of a remaining challenge is how citations of Research Software can be practically propagated following their execution.

Sharing of Research Software according to these principles helps communicate the computational methods, expanding tremendously the potential for consumption, analysis and production of scientific data across organisations and their application to a broadening scope of research problems.

However, the way software is used for a particular analysis to reach a given scientific goal requires additional measures to make it reproducible [Stodden 2016, Sandve 2013]. Computational Workflows (or scientific workflows) can structure and automate data analysis pipelines so they are scalable, portable and explainable [Atkinson 2017], and as a side-effect of these features can significantly improve reproducibility [Cohen-Boulakia 2017].

Several challenges emerge when considering sharing of workflows as FAIR digital objects. For instance, a workflow composes multiple tools that themselves need to be shared. Data used by a workflow have their own attribution and licenses. The execution of a workflow produces many intermediate data, but understanding that data creation from the workflow definition alone requires deep knowledge about the particular Workflow Management System (WfMS).

Gathering scholarly outputs in Research Objects

The identified need for communicating computational methods through Research Software and workflows highlights that science must go beyond sharing of just data and metadata in order to achieve the FAIR principles. For a third-party researcher to fully take advantage of software and data, and to avoid delving further into the reproducibility crisis, the full set of contextual digital resources should be grouped and communicated as a scholarly unit.

Research Objects (RO) [Bechhofer 2013] have been proposed as a mechanism to capture a range of diverse scholarly outputs in a single archivable item with detailed metadata. The RO concept was first realised using Semantic Web ontologies [myExperiment 2009, Belhajjame 2015]—these approaches primarily targetted long-term preservation of scientific workflows, utillised by RO as a mechanism to capture computational methods, augmented by the workflow inputs, outputs, workflow engine configuration and human-readable explanation of each step.

The principles of Research Objects extend far beyond workflows—however, early RO implementations mainly focused on capturing software [Goble 2018]. To some extent, the lack of wider adoption of ontology-based ROs can also be explained by Research Software Engineers (e.g. developers of molecular dynamics simulations) and platforms (e.g. repositories, data management systems) having a lack of familiarity with workflow systems or Semantic Web technology—or worse, they tried these technologies and then struggled [Carriero 2010, Tudorache 2020].

From this, a challenge is to make Linked Data technology approachable for developers who are best placed at implementing the FAIR principles, in platforms that are effectively making Research Objects.

Research Outline and Questions

Following the motivation in Section 1.1, this section elaborates my Research Questions (RQ) on three interlinked ideas:

Realization of the FAIR Digital Object concept using Web technologies.
Implementing FAIR Research Objects with an pragmatic use of Linked Data practices.
Unifying a FAIR Digital Object approach for computational workflows

Aims for FAIR Digital Objects on the Web (RQ1)

The Web is ubiquitous in modern software engineering [Taivalsaari 2021], used for everything from user interfaces, mobile applications and controlling devices, to enterprise cross-platform integrations, backend data processing and microservices, frequently utilising cloud computing which itself is controlled using Web technologies [Marinescu 2023].

The principles of seem important to achieve machine-actionable scholarly outputs, but several of these goals have an overlap with the motivations for the Semantic Web and Linked Data—yet it is not clear if changing from the Web stack to a different set of network protocols are necessary to achieve the FDO benefits.

A relevant research question therefore is:

RQ1:

Can the promising FDO concept be realised using existing Web technology, taking into account the lessons learnt from the early Semantic Web developments and more recent Linked Data practices?

I address RQ1 in Chapter 2 and Chapter 3.

Aims for FAIR Research Objects (RQ2)

Following the lessons learnt on early Research Object (RO) implementations and the emerging FAIR principles, a new engagement between the RO and digital libraries communities started in 2018, where it was agreed to formulate a lightweight approach to Research Objects [Sefton 2018, Ó Carragáin 2019b] for the purpose of data packaging. From this initative, the updated aims of FAIR Research Objects can be summarised as:

Describe and package data collections, datasets, software etc. with their metadata.
Platform-independent object exchange between repositories and services.
Support reproducibility and analysis: link data with codes and workflows.
Transfer of sensitive/large distributed datasets with persistent identifiers.
Aggregate citations and persistent identifiers.
Propagate provenance and existing metadata.
Publish and archive mixed objects and references.
Reuse existing standards, but hide their complexity.

Following from these aims, the second research question is:

RQ2:

Can a more pragmatic use of Linked Data practices better implement Research Objects for a wider developer audience, by using familiar Web technologies and give lightweight recommendations?

RQ2 is primarily addressed by Chapter 4.

Aims for FAIR Computational Workflows (RQ3)

There exists a plethora of workflow systems and languages [Leipzig 2021, Amstutz 2021], with recent efforts creating the Common Workflow Language [Crusoe 2022] as a standard representation with FAIR metadata capabilities that is executable by multiple engines.

Notably, workflow definitions themselves can be considered FAIR scholarly outputs [Goble 2020]—FAIR Computational Workflows which are published in repositories like Dockstore [Yuen 2021] and WorkflowHub [Goble 2021]. One could consider computational workflows as a kind of FAIR Research Software [de Visser 2023], but by their nature workflows also encourage the FAIR principles (e.g. preparing a computational tool for a workflow system [Brack 2022a] may include publishing it in a container registry). Workflow systems are also useful for creating and consuming FAIR Digital Objects [Wittenburg 2022b], and in addition workflow systems commonly provide explicit provenance logs of their executions.

Approaches to describing workflow provenance in a machine-readable format were initially diverse [Cruz 2009], and later converged on the use of ontologies [Missier 2010], most notably using W3C PROV-O [Lebo 2013a] but with various specializations [Garijo 2011, Garijo 2012, Missier 2013, Belhajjame 2015, Cuevas-Vicenttín 2016].

The tendency for workflow provenance models to diverge may be down to differences in the execution semantics of different workflow systems—which if accurately reflected in provenance means further differences at this level. This in turn leads to incompatibility of provenance traces and lack of common tooling. In addition execution details may obscure the link from the computational procesesses and the final workflow data outputs, which researchers ultimately care more about than the intricacies of the workflow engine.

The third research question from these considerations is therefore:

RQ3:

Can a FAIR Digital Object approach for computational workflows unify machine-readable descriptions of Research Software, data and provenance, which can be consistently implemented by developers of different workflow management systems?

The multiple aspects of RQ3, as highlighted in this section, are adressed by Chapter 5.

Main Contributions

The contributions from this PhD include:

An evaluation of FAIR Digital Objects and Linked Data, considering them from a developer perspective as distributed object systems.
A Research Object implementation based on familiar Web technologies, adapted and extended by numerous research projects and software developers.
A profile to capture provenance of computational workflow runs using this implementation, implemented by at least six workflow management systems.

These contributions have not evolved in isolation, but in co-development with multiple international collaborations (see Appendix A) across scientific diciplines.

Thesis Overview

Chapter 2 gives the background of the concepts FAIR Digital Object (FDO) and Linked Data, including a brief history of the Semantic Web, followed by a critical analysis of these technologies and their use.

Chapter 3 targets RQ1 and contributes a framework-based evaluation of Linked Data and FDO as possible architectures for implementing a distributed object system for the purpose of FAIR data publishing. The discussion in this chapter considers how the two approaches can benefit from each other’s strengths.

Chapter 4 addresses RQ2 by introducing the contribution of RO-Crate – a pragmatic data packaging mechanism using Linked Data standards to implement FDO and be extensible for domain-specific metadata.

Chapter 5 considers RQ3 by exploring the relationship between Computational Workflows and FAIR practices using RO-Crate and FDO, with use cases from molecular dynamics and specimen digitization. The contribution of the Workflow Run Crate profiles is presented as an interoperable way to capture and publish workflow execution provenance.

Chapter 6 summarises and discusses the contributions from this thesis, reflects on later third-party developments and concludes by evaluating the research questions.

Origins

Chapter 2 and Section 3.1 are based on the journal article [Soiland-Reyes 2023c] (see appendices A.4.1 and B.1.1). I am the main author of this manuscript.

Stian Soiland-Reyes, Carole Goble, Paul Groth (2024):
Evaluating FAIR Digital Object and Linked Data as distributed object systems.
PeerJ Computer Science 10:e1781
https://doi.org/10.7717/peerj-cs.1781

Section 3.2 is based on [Soiland-Reyes 2022d] (see appendices A.4.2 and B.1.2). I am the main author of this manuscript.

Stian Soiland-Reyes, Leyla Jael Castro, Daniel Garijo, Marc Portier, Carole Goble, Paul Groth (2022):
Updating Linked Data practices for FAIR Digital Object principles.
Research Ideas and Outcomes 8:e94501
https://doi.org/10.3897/rio.8.e94501

Section 4.1 and Section 4.3 are based on the publication [Soiland-Reyes 2022a] (see appendices A.4.3, B.1.3 and B.1.5). I am the main author of this manuscript.

Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble (2022):
Packaging research artefacts with RO-Crate.
Data Science 5(2)
https://doi.org/10.3233/DS-210053

Section 4.2 is based on the publication [Soiland-Reyes 2022c] (see appendices A.4.4 and B.1.4). I am the main author of this manuscript.

Stian Soiland-Reyes, Peter Sefton, Leyla Jael Castro, Frederik Coppens, Daniel Garijo, Simone Leo, Marc Portier, Paul Groth (2022):
Creating lightweight FAIR digital objects with RO-Crate.
Research Ideas and Outcomes 8:e93937
https://doi.org/10.3897/rio.8.e93937

Section 5.1 is based on the publication [Soiland-Reyes 2022b] (see appendices A.4.5 and B.1.6). I am the main author of this manuscript.

Stian Soiland-Reyes, Genís Bayarri, Pau Andrio, Robin Long, Douglas Lowe, Ania Niewielska, Adam Hospital, Paul Groth (2022):
Making Canonical Workflow Building Blocks interoperable across workflow languages.
Data Intelligence 4(2)
https://doi.org/10.1162/dint_a_00135

Section 5.2 is based on the publication [Hardisty 2022] (see appendices A.4.6 and B.1.7). I mainly contributed to sections 5.2.2.2, 5.2.2.3, 5.2.4.1, 5.2.7 in this manuscript.

Alex Hardisty, Paul Brack, Carole Goble, Laurence Livermore, Ben Scott, Quentin Groom, Stuart Owen, Stian Soiland-Reyes (2022):
The Specimen Data Refinery: A canonical workflow framework and FAIR Digital Object approach to speeding up digital mobilisation of natural history collections.
Data Intelligence 4(2)
https://doi.org/10.1162/dint_a_00134

Section 5.3 is based on the publication [Woolland 2022] (see appendices A.4.7 and B.1.8). I am the main author of this manuscript.

Oliver Woolland, Paul Brack, Stian Soiland-Reyes, Ben Scott, Laurence Livermore (2022):
Incrementally building FAIR Digital Objects with Specimen Data Refinery workflows.
Research Ideas and Outcomes 8:e94349
https://doi.org/10.3897/rio.8.e94349

Section 5.3 is based on the preprint [Leo 2023b] (see appendices A.4.8 and B.1.9). I am the last author of this manuscript, and have mainly contributed to sections 5.4.1, 5.4.5, 5.4.5.3, 5.4.5.4.

Simone Leo, Michael R. Crusoe, Laura Rodríguez-Navas, Raül Sirvent, Alexander Kanitz, Paul De Geest, Rudolf Wittner, Luca Pireddu, Daniel Garijo, José M. Fernández, Iacopo Colonnelli, Matej Gallo, Tazro Ohta, Hirotaka Suetake, Salvador Capella-Gutierrez, Renske de Wit, Bruno de Paula Kinoshita, Stian Soiland-Reyes (2023):
Recording provenance of workflow runs with RO-Crate.
arXiv 2312.07852v1 [cs.DL] https://doi.org/10.48550/arXiv.2312.07852

This thesis also cites background material where I have contributed as co-author, provided as supplements on the Web, see Appendix B.3.

References

See chapter references.

Although current open data practices do not benefit the Global South equally [Serwadda 2018]. ↩︎
It is worth noting that not all of these databases and standards are based on Linked Data methods, and may be supporting FAIR principles in a looser sense. ↩︎
See Section 2.1.1 ↩︎