Packaging research artefacts with RO-Crate

Cite as

Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble (2022):
Packaging research artefacts with RO-Crate.
Data Science 5(2)
https://doi.org/10.3233/DS-210053

Packaging research artefacts with RO-Crate

Stian Soiland-Reyesa,b, Peter Seftonc, Mercè Crosasd, Leyla Jael Castroe, Frederik Coppensf, José M. Fernándezg, Daniel Garijoh, Björn Grüningi, Marco La Rosaj, Simone Leok, Eoghan Ó Carragáinl, Marc Portierm, Ana Trisovicd, RO-Crate Communityn, Paul Grothb, Carole Goblea

a Department of Computer Science, The University of Manchester, Manchester, UK
b Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
c Faculty of Science, University Technology Sydney, Australia
d Institute for Quantitative Social Science, Harvard University, Cambridge, MA, USA.
e ZB MED Information Centre for Life Sciences, Cologne, Germany.
f VIB-UGent Center for Plant Systems Biology, Gent, Belgium.
g Barcelona Supercomputing Center, Barcelona, Spain.
h Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain.
i Bioinformatics Group, Department of Computer Science, Albert-Ludwigs-University Freiburg, Freiburg, Germany.
j PARADISEC, Melbourne, Australia.
k Center for Advanced Studies, Research, and Development in Sardinia (CRS4), Pula (CA), Italy.
l University College Cork, Ireland.
m Vlaams Instituut voor de Zee, Oostende, Belgium.
n https://www.researchobject.org/ro-crate/community (see Appendix B)

Abstract

An increasing number of researchers support reproducibility by including pointers to and descriptions of datasets, software and methods in their publications. However, scientific articles may be ambiguous, incomplete and difficult to process by automated systems. In this paper we introduce RO-Crate, an open, community-driven, and lightweight approach to packaging research artefacts along with their metadata in a machine readable manner. RO-Crate is based on schema.org annotations in JSON-LD, aiming to establish best practices to formally describe metadata in an accessible and practical way for their use in a wide variety of situations.

An RO-Crate is a structured archive of all the items that contributed to a research outcome, including their identifiers, provenance, relations and annotations. As a general purpose packaging approach for data and their metadata, RO-Crate is used across multiple areas, including bioinformatics, digital humanities and regulatory sciences. By applying “just enough” Linked Data standards, RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility.

An RO-Crate for this article is archived at https://doi.org/10.5281/zenodo.5146227

Introduction

The move towards Open Science has increased the need and demand for the publication of artefacts of the research process [1]. This is particularly apparent in domains that rely on computational experiments; for example, the publication of software, datasets and records of the dependencies that such experiments rely on [113].

It is often argued that the publication of these assets, and specifically software [80], workflows [55] and data, should follow the FAIR principles [123]; namely, that they are Findable, Accessible, Interoperable and Reusable. These principles are agnostic to the implementation strategy needed to comply with them. Hence, there has been an increasing amount of work in the development of platforms and specifications that aim to fulfil these goals [91].

Important examples include data publication with rich metadata (e.g. Zenodo [40]), domain-specific data deposition (e.g. PDB [16]) and following practices for reproducible research software [101] (e.g. use of containers). While these platforms are useful, experience has shown that it is important to put greater emphasis on the interconnection of the multiple artefacts that make up the research process [71].

The notion of Research Objects [12] (RO) was introduced to address this connectivity, providing semantically rich aggregations of (potentially distributed) resources with a layer of structure over a research study; this is then to be delivered in a machine-readable format.

A Research Object combines the ability to bundle multiple types of artefacts together, such as spreadsheets, code, examples, and figures. The RO is augmented with annotations and relationships that describe the artefacts' context (e.g. a CSV being used by a script, a figure being a result of a workflow).

This notion of ROs provides a compelling vision as an approach for implementing FAIR data. However, existing Research Object implementations require a large technology stack [14], are typically tailored to a particular platform and are also not easily usable by end-users.

To address this gap, a new community came together [23] to develop RO-Crate — an approach to package and aggregate research artefacts with their metadata and relationships. The aim of this paper is to introduce RO-Crate and assess it as a strategy for making multiple types of research artefacts FAIR. Specifically, the contributions of this paper are as follows:

  1. An introduction to RO-Crate, its purpose and context;
  2. A guide to the RO-Crate community and tooling;
  3. Examples of RO-Crate usage, demonstrating its value as connective tissue for different artefacts from different communities.

The rest of this article is organised as follows. We first describe RO-Crate through its development methodology that formed the RO-Crate concept, showing its foundations in Linked Data and emerging principles. We then define RO-Crate technically, before we introduce the community and tooling. We move to analyse RO-Crate with respect to usage in a diverse set of domains. Finally, we present related work and conclude with some remarks including RO-Crate highlights and future work. The appendix adds a formal definition of RO-Crate using First-Order logic.

RO-Crate

RO-Crate aims to provide an approach to packaging research artefacts with their metadata that can be easily adopted. To illustrate this, let us imagine a research paper reporting on the sequence analysis of proteins obtained from an experiment on mice. The sequence output files, sequence analysis code, resulting data and reports summarising statistical measures are all important and inter-related research artefacts, and consequently would ideally all be co-located in a directory and accompanied with their corresponding metadata. In reality, some of the artefacts (e.g. data or software) will be recorded as external reference to repositories that are not necessarily following the FAIR principles. This conceptual directory, along with the relationships between its constituent digital artefacts, is what the RO-Crate model aims to represent, linking together all the elements of an experiment that are required for the experiment’s reproducibility and reusability.

The question then arises as to how the directory with all this material should be packaged in a manner that is accessible and usable by others. This means programmatically and automatically accessible by machines and human-readable. A de facto approach to sharing collections of resources is through compressed archives (e.g. a ZIP file). This solves the problem of “packaging”, but it does not guarantee downstream access to all artefacts in a programmatic fashion, nor describe the role of each file in that particular research. Both features, the ability to automatically access and reason about an object, are crucial and lead to the need for explicit metadata about the contents of the folder, describing each and linking them together.

Examples of metadata descriptions across a wide range of domains abound within the literature, both in research data management [6] [46] [75] and within library and information systems [24] [127]. However, many of these approaches require knowledge of metadata schemas, particular annotation systems, or the use of complex software stacks. Indeed, particularly within research, these requirements have led to a lack of adoption and growing frustration with current tooling and specifications [94] [119] [102].

RO-Crate seeks to address this complexity by:

  1. being conceptually simple and easy to understand for developers;
  2. providing strong, easy tooling for integration into community projects;
  3. providing a strong and opinionated guide regarding current best practices;
  4. adopting de-facto standards that are widely used on the Web.

In the following sections we demonstrate how the RO-Crate specification and ecosystem achieve these goals.

Development Methodology

It is a good question as to what base level we assume for ‘conceptually simple’. We take simplicity to apply at two levels: for the developers who produce the platforms and for the data practitioners and users of those platforms.

For our development methodology we followed the mantra of working closely with a small group to really get a deep understanding of requirements and ensure rapid feedback loops. We created a pool of early adopter projects from a range of disciplines and groups, primarily addressing developers of platforms. Thus the base level for simplicity was developer friendliness.

We assumed a developer familiar with making Web applications with JSON data (who would then learn how to make RO-Crate JSON-LD), which informed core design choices for our JSON-level documentation approach and RO-Crate serialization (section on implementation). Our group of early adopters, growing as the community evolved, drove the RO-Crate requirements and provided feedback through our multiple communication channels including bi-monthly meetings, which we describe in section on community along with the established norms.

Addressing the simplicity of understanding and engaging with RO-Crate by data practitioners is through the platforms, for example with interactive tools (section RO-Crate tooling) like Describo [78] and Jupyter notebooks [70], and by close discussions with domain scientists on how to appropriately capture what they determine to be relevant metadata. This ultimately requires a new type of awareness and learning material separate from developer specifications, focusing on the simplicity of extensibility to serve the user needs, along with user-driven development of new RO-Crate Profiles specific for their needs (section on in use).

Conceptual Definition

A key premise of RO-Crate is the existence of a wide variety of resources on the Web that can help describe research. As such, RO-Crate relies on the Linked Data principles [63]. Figure 1 shows the main conceptual elements involved in an RO-Crate: The RO-Crate Metadata File (top) describes the Research Object using structured metadata including external references, coupled with the contained artefacts (bottom) bundled and described by the RO-Crate.

The conceptual notion of a Research Object [12] is thus realised with the RO-Crate model and serialised using Linked Data constructs within the RO-Crate metadata file.

A Persistent Identifier (PID) [86] points to a Research Object (RO), which may be archived using different packaging approaches like BagIt [74], OCFL [96], git or ZIP. The RO is described within a RO-Crate Metadata File, providing identifiers for authors using ORCID, organisations using Research Organization Registry (ROR) [79] and licences such as Creative Commons using SPDX identifiers. The RO-Crate content is further described with additional metadata following a Linked Data approach. Data can be embedded files and directories, as well as links to external Web resources, PIDs and nested RO-Crates.

Conceptual RO-Crate Overview

A Persistent Identifier (PID) [86] points to a Research Object (RO), which may be archived using different packaging approaches like BagIt [74], OCFL [96], git or ZIP. The RO is described within a RO-Crate Metadata File, providing identifiers for authors using ORCID, organisations using Research Organization Registry (ROR) [79] and licences such as Creative Commons using SPDX identifiers. The RO-Crate content is further described with additional metadata following a Linked Data approach. Data can be embedded files and directories, as well as links to external Web resources, PIDs and nested RO-Crates.

Linked Data as a foundation

The Linked Data principles [18] (use of IRIs1 to identify resources (i.e. artefacts), resolvable via HTTP, enriched with metadata and linked to each other) are core to RO-Crate; therefore IRIs are used to identify an RO-Crate, its constituent parts and metadata descriptions, and the properties and classes used in the metadata.

RO-Crates are self-described and follow the Linked Data principles to describe all of their resources in both human and machine readable manner. Hence, resources are identified using global identifiers (absolute IRIs) where possible; and relationships between two resources are defined with links.

The foundation of Linked Data and shared vocabularies also means that multiple RO-Crates and other Linked Data resources can be indexed, combined, queried, validated or transformed using existing Semantic Web technologies such as SPARQL, SHACL and well established knowledge graph triple stores like Apache Jena and OntoText GraphDB.

The possibilities of consuming2 RO-Crate metadata with such powerful tools gives another strong reason for using Linked Data as a foundation. This use of mature Web3 technologies also means its developers and consumers are not restricted to the Research Object aspects that have already been specified by the RO-Crate community, but can extend and integrate RO-Crate in multiple standardised ways.

RO-Crate is a self-described container

An RO-Crate is defined as a self-described Root Data Entity that describes and contains data entities, which are further described by referencing contextual entities. A data entity is either a file (i.e. a byte sequence stored on disk somewhere) or a directory (i.e. set of named files and other directories). A file does not need to be stored inside the RO-Crate root, it can be referenced via a PID/IRI. A contextual entity exists outside the information system (e.g. a Person, a workflow language) and is stored solely by its metadata. The representation of a data entity as a byte sequence makes it possible to store a variety of research artefacts including not only data but also, for instance, software and text.

The Root Data Entity is a directory, the RO-Crate Root, identified by the presence of the RO-Crate Metadata File ro-crate-metadata.json (top of Figure 1. This file fdescribes the RO-Crate using Linked Data, its content and related metadata using Linked Data in JSON-LD format [112]. This is a W3C standard RDF serialisation that has become popular; it is easy to read by humans while also offering some advantages for data exchange on the Internet. JSON-LD, a subset of the widely supported and well-known JSON format, has tooling available for many programming languages.

The minimal requirements for the root data entity metadata are name, description and datePublished, as well as a contextual entity identifying its license — additional metadata are commonly added to entities depending on the purpose of the particular RO-Crate.

RO-Crates can be stored, transferred or published in multiple ways, e.g. BagIt [74], Oxford Common File Layout [96] (OCFL), downloadable ZIP archives in Zenodo or through dedicated online repositories, as well as published directly on the Web, e.g. using GitHub Pages. Combined with Linked Data identifiers, this caters for a diverse set of storage and access requirements across different scientific domains, from metagenomics workflows producing hundreds of gigabytes of genome data to cultural heritage records with access restrictions for personally identifiable data. Specific RO-Crate profiles (section on extensibility) may constrain serialization and publication expectations, and require additional contextual types and properties.

Data Entities are described using Contextual Entities

RO-Crate distinguishes between data and contextual entities in a similar way to HTTP terminology’s early attempt to separate information (data) and non-information (contextual) resources [120]. Data entities are usually files and directories located by relative IRI references within the RO-Crate Root, but they can also be Web resources or restricted data identified with absolute IRIs, including Persistent Identifiers (PIDs) [86].

As both types of entities are identified by IRIs, their distinction is allowed to be blurry; data entities can be located anywhere and be complex, while contextual entities can have a Web presence beyond their description inside the RO-Crate. For instance https://orcid.org/0000-0002-1825-0097 is primarily an identifier for a person, but secondarily it is also a Web page and a way to refer to their academic work.

A particular IRI may appear as a contextual entity in one RO-Crate and as a data entity in another; the distinction lies in the fact that data entities can be considered to be contained or captured by that RO-Crate (RO Content in Figure 1, while contextual entities mainly explain an RO-Crate or its content (although this distinction is not a formal requirement).

In RO-Crate, a referenced contextual entity (e.g. a person identified by ORCID) should always be described within the RO-Crate Metadata File with at least a type and name, even where their PID might resolve to further Linked Data. This is so that clients are not required to follow every link for presentation purposes, for instance HTML rendering. Similarly any imported extension terms would themselves also have a human-readable description in the case where their PID does not directly resolve to human-readable documentation.

Figure 2 shows a simplified class diagram of RO-Crate, highlighting the different types of data entities and contextual entities that can be aggregated and related. While an RO-Crate would usually contain one or more data entities (hasPart), it may also be a pure aggregation of contextual entities (mentions).

The RO-Crate Metadata File conforms to a version of the specification; and contains a JSON-LD graph [112] that describes the entities that make up the RO-Crate. The RO-Crate Root Data Entity represent the Research Object as a dataset. The RO-Crate aggregates data entities (hasPart) which are further described using contextual entities (which may include aggregated and non-aggregated data entities). Multiple types and relations from schema.org allow annotations to be more specific, including figures, nested datasets, computational workflows, people, organisations, instruments and places. Contextual entities not otherwise cross-referenced from other entities' properties (describes) can be grouped under the root entity (mentions).

Simplified class diagram of RO-Crate

The RO-Crate Metadata File conforms to a version of the specification; and contains a JSON-LD graph [112] that describes the entities that make up the RO-Crate. The RO-Crate Root Data Entity represent the Research Object as a dataset. The RO-Crate aggregates data entities (hasPart) which are further described using contextual entities (which may include aggregated and non-aggregated data entities). Multiple types and relations from schema.org allow annotations to be more specific, including figures, nested datasets, computational workflows, people, organisations, instruments and places. Contextual entities not otherwise cross-referenced from other entities' properties (describes) can be grouped under the root entity (mentions).

Guide through Recommended Practices

RO-Crate as a specification aims to build a set of recommended practices on how to practically apply existing standards in a common way to describe research outputs and their provenance, without having to learn each of the underlying technologies in detail.

As such, the RO-Crate 1.1 specification [106] can be seen as an opinionated and example-driven guide to writing schema.org [62] metadata as JSON-LD [112] (see section on implementation, which leaves it open for implementers to include additional metadata using other schema.org types and properties, or even additional Linked Data vocabularies/ontologies or their own ad-hoc terms.

However the primary purpose of the RO-Crate specification is to assist developers in leveraging Linked Data principles for the focused purpose of describing Research Objects in a structured language, while reducing the steep learning curve otherwise associated with Semantic Web adaptation, like development of ontologies, identifiers, namespaces, and RDF serialization choices.

Ensuring Simplicity

One aim of RO-Crate is to be conceptually simple. This simplicity has been repeatedly checked and confirmed through an informal community review process. For instance, in the discussion on supporting ad-hoc vocabularies in RO-Crate, the community explored potential Linked Data solutions. The conventional wisdom in RDF best practices is to establish a vocabulary with a new IRI namespace, formalised using RDF Schema or OWL ontologies. However, this may seem an excessive learning curve for non-experts in semantic knowledge representation, and the RO-Crate community instead agreed on a dual lightweight approach: (i) Document how projects with their own Web-presence can make a pure HTML-based vocabulary, and (ii) provide a community-wide PID namespace under https://w3id.org/ro/terms that redirect to simple CSV files maintained in GitHub.

To further verify this idea of simplicity, we have formalised the RO-Crate definition (see Appendix on Formal Definition). An important result of this exercise is that the underlying data structure of RO-Crate, although conceptually a graph, is represented as a depth-limited tree. This formalisation also emphasises the boundedness of the structure; namely, the fact that elements are specifically identified as being either semantically contained by the RO-Crate as Data Entities (hasPart) or mainly referenced (mentions) and typed as external to the Research Object as Contextual Entities. It is worth pointing out that this semantic containment can extend beyond the physical containment of files residing within the RO-Crate Root directory on a given storage system, as the RO-Crate data entities may include any data resource globally identifiable using IRIs.

Extensibility and RO-Crate profiles

The RO-Crate specification provides a core set of conventions to describe research outputs using types and properties applicable across scientific domains. However we have found that domain-specific use of RO-Crate will, implicitly or explicitly, form a specialised profile of RO-Crate; i.e., a set of conventions, types and properties that are minimally required and one can expect to be present in that subset of RO-Crates. For instance, RO-Crates used for exchange of workflows will have to contain a data entity of type ComputationalWorkflow, or cultural heritage records should have a contentLocation.

Making such profiles explicit allow further reliable programmatic consumption and generation of RO-Crates beyond the core types defined in the RO-Crate specification. Following the RO-Crate mantra of guidance over strictness, profiles are mainly duck-typing rather than strict syntactic or semantic types, but may also have corresponding machine-readable schemas at multiple levels (file formats, JSON, RDF shapes, RDFS/OWL semantics).

The next version of the RO-Crate specification 1.2 will define a formalization for publishing and declaring conformance to RO-Crate profiles. Such a profile is primarily a human-readable document of before-mentioned expectations and conventions, but may also define a machine-readable profile as a Profile Crate: Another RO-Crate that describe the profile and in addition can list schemas for validation, compatible software, applicable repositories, serialization/packaging formats, extension vocabularies, custom JSON-LD contexts and examples (see for example the Workflow RO-Crate profile).

In addition, there are sometimes existing domain-specific metadata formats, but they are either not RDF-based (and thus time-consuming to construct terms for in JSON-LD) or are at a different granularity level that might become overwhelming if represented directly in the RO-Crate Metadata file (e.g. W3C PROV bundle detailing every step execution of a workflow run [68]). RO-Crate allows such alternative metadata files to co-exist, and be described as data entities with references to the standards and vocabularies they conform to. This simplifies further programmatic consumption even where no filename or file extension conventions have emerged for those metadata formats.

Section on in use examines the observed specializations of RO-Crate use in several domains and their emerging profiles.

Technical implementation of the RO-Crate model

The RO-Crate conceptual model has been realised using JSON-LD and schema.org in a prescriptive form as discussed in section on conceptual definition. These technical choices were made to cater for simplicity from a developer perspective (as introduced in section on methodology).

JSON-LD [112] provides a way to express Linked Data as a JSON structure, where a context provides mapping to RDF properties and classes. While JSON-LD cannot map arbitrary JSON structures to RDF, we found that it does lower the barrier compared to other RDF syntaxes, as the JSON syntax nowadays is a common and popular format for data exchange on the Web.

However, JSON-LD alone has too many degrees of freedom and hidden complexities for software developers to reliably produce and consume without specialised expertise or large RDF software frameworks. A large part of the RO-Crate specification is therefore dedicated to describing the acceptable subset of JSON structures.

RO-Crate JSON-LD

RO-Crate mandates the use of flattened, compacted JSON-LD in the RO-Crate Metadata file ro-crate-metadata.json4 where a single @graph array contains all the data and contextual entities in a flat list. An example can be seen in the JSON-LD snippet in Listing 1 below, describing a simple RO-Crate containing data entities described using contextual entities:

{ "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
      { "@id": "ro-crate-metadata.json",      
        "@type": "CreativeWork",
        "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
        "about": {"@id": "./"}
      },
      { "@id": "./",
        "@type": "Dataset",
        "name": "A simplified RO-Crate",
        "author": {"@id": "#alice"},
        "license": {"@id": "https://spdx.org/licenses/CC-BY-4.0"},
        "datePublished": "2021-11-02T16:04:43Z",
        "hasPart": [
          {"@id": "survey-responses-2019.csv"},
          {"@id": "https://example.com/pics/5707039334816454031_o.jpg"}
        ]
      },
      { "@id": "survey-responses-2019.csv",
        "@type": "File",
        "about": {"@id": "https://example.com/pics/5707039334816454031_o.jpg"},
        "author": {"@id": "#alice"}
      },
      { "@id": "https://example.com/pics/5707039334816454031_o.jpg",
        "@type": ["File", "ImageObject"],
        "contentLocation": {"@id": "http://sws.geonames.org/8152662/"},
        "author": {"@id": "https://orcid.org/0000-0002-1825-0097"}
      },
      { "@id": "#alice",
        "@type": "Person",
        "name": "Alice"
      },
      { "@id": "https://orcid.org/0000-0002-1825-0097",
        "@type": "Person",
        "name": "Josiah Carberry"
      },
      { "@id": "http://sws.geonames.org/8152662/",
        "@type": "Place",
        "name": "Catalina Park"
      },
      { "@id": "https://spdx.org/licenses/CC-BY-4.0",
        "@type": "CreativeWork",
        "name": "Creative Commons Attribution 4.0"
      }
  ]
}

Listing 1: Simplified5 RO-Crate metadata file showing the flattened compacted JSON-LD @graph array containing the data entities and contextual entities, cross-referenced using @id. The ro-crate-metadata.json entity self-declares conformance with the RO-Crate specification using a versioned persistent identifier, further RO-Crate descriptions are on the root data entity ./ or any of the referenced data or contextual entities. This is exemplified by the data entity ImageObject referencing contextual entities for contentLocation and author that differs from that of the overall RO-Crate. In this crate, about of the CSV data entity reference the ImageObject, which then take the roles of both a data entity and contextual entity. While Person entities ideally are identified with ORCID PIDs as for Josiah, #alice is here in contrast an RO-Crate local identifier, highlighting the pragmatic “just enough” Linked Data approach.

In this flattened profile of JSON-LD, each {entity} are directly under @graph and represents the RDF triples with a common subject (@id), mapped properties like hasPart, and objects — as either literal "string" values, referenced {objects} (which properties are listed in its own entity), or a JSON [list] of these. If processed as JSON-LD, this forms an RDF graph by matching the @id IRIs and applying the @context mapping to schema.org terms. \normalsize

Flattened JSON-LD

When JSON-LD 1.0 [112] was proposed, one of the motivations was to seamlessly apply an RDF nature on top of regular JSON as frequently used by Web APIs. JSON objects in APIs are frequently nested with objects at multiple levels, and the perhaps most common form of JSON-LD is the compacted form which follows this expectation (JSON-LD 1.1 further expands these capabilities, e.g. allowing nested @context definitions).

While this feature of JSON-LD can be seen as a way to “hide” its RDF nature, we found that the use of nested trees (e.g. a Person entity appearing as author of a File which nests under a Dataset with hasPart) counter-intuitively forces consumers to consider the JSON-LD as an RDF Graph, since an identified Person entity can appear at multiple and repeated points of the tree (e.g. author of multiple files), necessitating node merging or duplication, which can become complicated as this approach also invites the use of blank nodes (entities missing @id).

By comparison, a single flat @graph array approach, as required by RO-Crate, means that applications can choose to process and edit each entity as pure JSON by a simple lookup based on @id. At the same time, lifting all entities to the same level reflects the Research Object principles [12] in that describing the context and provenance is just as important as describing the data, and the requirement of @id of every entity forces RO-Crate generators to consciously consider existing IRIs and identifiers.

JSON-LD context

In JSON-LD, the @context is a reference to another JSON-LD document that provides mapping from JSON keys to Linked Data term IRIs, and can enable various JSON-LD directives to cater for customised JSON structures for translating to RDF.

RO-Crate reuses vocabulary terms and IRIs from schema.org, but provides its own versioned JSON-LD context, which has a flat list with the mapping from JSON-LD keys to their IRI equivalents (e.g. key "author" maps to the http://schema.org/author property).

The rationale behind this decision is to support JSON-based RO-Crate applications that are largely unaware of JSON-LD, that still may want to process the @context to find or add Linked Data definitions of otherwise unknown properties and types. Not reusing the official schema.org context means RO-Crate is also able to map in additional vocabularies where needed, namely the Portland Common Data Model (PCDM) [31] for repositories and Bioschemas [58] for describing computational workflows. RO-Crate profiles may extend the @context to re-use additional domain-specific ontologies.

Similarly, while the chema.org context currently have "@type": "@id" annotations for implicit object properties, RO-Crate JSON-LD distinguishes explicitly between references to other entities ({"@id": "#alice"}) and string values ("Alice") — meaning RO-Crate applications can find references for corresponding entities and IRIs without parsing the @context to understand a particular property. Notably this is exploited by the ro-crate-html-js [95] tool to provide reliable HTML rendering for otherwise unknown properties and types.

RO-Crate Community

The RO-Crate conceptual model, implementation and best practices are developed by a growing community of researchers, developers and publishers. RO-Crate’s community is a key aspect of its effectiveness in making research artefacts FAIR. Fundamentally, the community provides the overall context of the implementation and model and ensures its interoperability.

The RO-Crate community consists of:

  1. A diverse set of people representing a variety of stakeholders.
  2. A set of collective norms.
  3. An open platform that facilitates communication (GitHub, Google Docs, monthly teleconferences).

People

The initial concept of RO-Crate was formed at the first Workshop on Research Objects (RO2018), held as part of the IEEE conference on eScience. This workshop followed up on considerations made at a Research Data Alliance (RDA) meeting on Research Data Packaging that found similar goals across multiple data packaging efforts [23]: simplicity, structured metadata and the use of JSON-LD.

An important outcome of discussions that took place at RO2018 was the conclusion that the original Wf4Ever Research Object ontologies [14], in principle sufficient for packaging research artefacts with rich descriptions, were, in practice, considered inaccessible for regular programmers (e.g., Web developers) and in danger of being incomprehensible for domain scientists due to their reliance on Semantic Web technologies and other ontologies.

DataCrate [103] was presented at RO2018 as a promising lightweight alternative approach, and an agreement was made by a group of volunteers to attempt building what was initially called “RO Lite” as a combination of DataCrate’s implementation and Research Object’s principles.

This group, originally made up of library and Semantic Web experts, has subsequently grown to include domain scientists, developers, publishers and more. This perspective of multiple views led to the specification being used in a variety of domains, from bioinformatics and regulatory submissions to humanities and cultural heritage preservation.

The RO-Crate community is strongly engaged with the European-wide biology/bioinformatics collaborative e-Infrastructure ELIXIR [34], along with European Open Science Cloud (EOSC) projects including EOSC-Life, FAIRplus, CS3MESH4EOSC and BY-COVID. RO-Crate has also established collaborations with Bioschemas [58], GA4GH [99], OpenAIRE [100] and multiple H2020 projects.

A key set of stakeholders are developers: the RO-Crate community has made a point of attracting developers who can implement the specifications but, importantly, keeps “developer user experience” in mind. This means that the specifications are straightforward to implement and thus do not require expertise in technologies that are not widely deployed.

This notion of catering to “developer user experience” is an example of the set of norms that have developed and now define the community.

Norms

The RO-Crate community is driven by informal conventions and notions that are prevalent but not neccessarily written down. Here, we distil what we as authors believe are the critical set of norms that have facilitated the development of RO-Crate and contributed to the ability for RO-Crate research packages to be FAIR. This is not to say that there are no other norms within the community nor that everyone in the community holds these uniformly. Instead, what we emphasise is that these norms are helpful and also shaped by community practices:

  1. Simplicity.
  2. Developer friendliness.
  3. Focus on examples and best practices rather than rigorous specification.
  4. Reuse “just enough” Web standards.

A core norm of RO-Crate is that of simplicity, which sets the scene for how we guide developers to structure metadata with RO-Crate. We focus mainly on documenting simple approaches to the most common use cases, such as authors having an affiliation. This norm also influences our take on developer friendliness; for instance, we are using the Web-native JSON format, allowing only a few of JSON-LD’s flexible Linked Data features. Moreover, the RO-Crate documentation is largely built up by examples showcasing best practices, rather than rigorous specifications. We build on existing Web standards that themselves are defined rigorously, which we utilise just enough in order to benefit from the advantages of Linked Data (e.g., extensions by namespaced vocabularies), without imposing too many developer choices or uncertainties (e.g., having to choose between the many RDF syntaxes).

While the above norms alone could easily lead to the creation of “yet another” JSON format, we keep the goal of FAIR interoperability of the captured metadata, and therefore follow closely FAIR best practices and current developments such as data citations, PIDs, open repositories and recommendations for sharing research outputs and software.

Open Platforms

The critical infrastructure that enables the community around RO-Crate is the use of open development platforms. This underpins the importance of open community access to supporting FAIR. Specifically, it is difficult to build and consume FAIR research artefacts without being able to access the specifications, understand how they are developed, know about any potential implementation issues, and discuss usage to evolve best practices.

The development of RO-Crate was driven by capturing documentation of real-life examples and best practices rather than creating a rigorous specification. At the same time, we agreed to be opinionated on the syntactic form to reduce the jungle of implementation choices; we wanted to keep the important aspects of Linked Data to adhere to the FAIR principles while retaining the option of combining and extending the structured metadata using the existing Semantic Web stack, not just build a standalone JSON format.

Further work during 2019 started adapting the DataCrate documentation through a more collaborative and exploratory RO Lite phase, initially using Google Docs for review and discussion, then moving to GitHub as a collaboration space for developing what is now the RO-Crate specification, maintained as Markdown in GitHub Pages and published through Zenodo.

In addition to the typical Open Source-style development with GitHub issues and pull requests, the RO-Crate Community have, at time of writing, two regular monthly calls, a Slack channel and a mailing list for coordinating the project; also many of its participants collaborate on RO-Crate at multiple conferences and coding events such as the ELIXIR BioHackathon. The community is jointly developing the RO-Crate specification and Open Source tools, as well as providing support and considering new use cases. The RO-Crate Community is open for anyone to join, to equally participate under a code of conduct, and as of October 2021 has more than 50 members (see Appendix RO-Crate Community).

RO-Crate Tooling

The work of the community has led to the development of a number of tools for creating and using RO-Crates. Table 1 shows the current set of implementations6. Reviewing this list, one can see support for commonly used programming languages, including Python, JavaScript, and Ruby. Additionally, the tools can be integrated into commonly used research environments, in particular, the command line tool ro-crate-html-js [95] for creating a human-readable preview of an RO-Crate as a sidecar HTML file. Furthermore, there are tools that cater to end-users (Describo [78], WorkflowHub [124]), in order to simplify creating and managing RO-Crate. For example, Describo was developed to help researchers of the Australian Criminal Characters project to annotate historical prisoner records for greater insight into the history of Australia [97].

While the development of these tools is promising, our analysis of their maturity status shows that the majority of them are in the Beta stage. This is partly due to the fact that the RO-Crate specification itself only recently reached 1.0 status, in November 2019 [105]. Now that there is a fixed point of reference: With version 1.1 (October 2020) [107] RO-Crate has stabilised based on feedback from application development, and now we are seeing a further increase in the maturity of these tools, along with the creation of new ones.

Given the stage of the specification, these tools have been primarily targeting developers, essentially providing them with the core libraries for working with RO-Crate. Another target has been that of research data managers who need to manage and curate large amounts of data.

Tool Name Targets Language /Platform Status Brief Description
Describo [78] Research Data Managers NodeJS (Desktop) RC Interactive desktop application to create, update and export RO-Crates for different profiles
Describo Online [77] Platform developers NodeJS (Web) Alpha Web-based application to create RO-Crates using cloud storage
ro-crate-excel [84] Data managers JavaScript Beta Command-line tool to create/edit RO-Crates with spreadsheets
ro-crate-html-js [95] Developers JavaScript Beta HTML rendering of RO-Crate
ro-crate-js [49] Research Data Managers JavaScript Alpha Library for creating/manipulating crates; basic validation code
ro-crate-ruby [9] Developers Ruby Beta Ruby library for reading/writing RO-Crate, with workflow support
ro-crate-py [41]) Developers Python Alpha Object-oriented Python library for reading/writing RO-Crate and use by Jupyter Notebook
WorkflowHub [124] Workflow users Ruby Beta Workflow repository; imports and exports Workflow RO-Crate
Life Monitor [35] Workflow developers Python Alpha Workflow testing and monitoring service; Workflow Testing profile of RO-Crate
SCHeMa [118] Workflow users PHP Alpha Workflow execution using RO-Crate as exchange mechanism [10.5281/zenodo.4671709]
galaxy2cwl [50] Workflow developers Python Alpha Wraps Galaxy workflow as Workflow RO-Crate
Modern PARADISEC [51] Repository managers Platform Beta Cultural Heritage portal based on OCFL and RO-Crate
ONI express [115] Repository managers Platform Beta Platform for publishing data and documents stored in an OCFL repository via a Web interface
ocfl-tools [52] Developers JavaScript (CLI) Beta Tools for managing RO-Crates in an OCFL repository
RO Composer [8] Repository developers Java Alpha REST API for gradually building ROs for given profile.
RDA maDMP Mapper [7] Data Management Plan users Python Beta Mapping between machine-actionable data management plans (maDMP) and RO-Crate [87]
Ro-Crate_2_ma-DMP [20] Data Management Plan users Python Beta Convert between machine-actionable data management plans (maDMP) and RO-Crate
CheckMyCrate [13] Developers Python (CLI) Alpha Validation according to Workflow RO-Crate profile
RO-Crates-and-Excel [126] Data Managers Java (CLI) Alpha Describe column/data details of spreadsheets as RO-Crate using DataCube vocabulary

Table 1: Applications and libraries implementing RO-Crate, targeting different types of users across multiple programming languages. Status is indicative as assessed by this work (Alpha < Beta < Release Candidate (RC) < Release).

Profiles of RO-Crate in use

RO-Crate fundamentally forms part of an infrastructure to help build FAIR research artefacts. In other words, the key question is whether RO-Crate can be used to share and (re)use research artefacts. Here we look at three research domains where RO-Crate is being applied: Bioinformatics, Regulatory Science and Cultural Heritage. In addition, we note how RO-Crate may have an important role as part of machine-actionable data management plans and institutional repositories.

From these varied uses of RO-Crate we observe natural differences in their detail level and the type of entities described by the RO-Crate. For instance, on submission of an RO-Crate to a workflow repository, it is reasonable to expect the RO-Crate to contain at least one workflow, ideally with a declared licence and workflow language. Specific additional recommendations such as on identifiers is also needed to meet the emerging requirements of FAIR Digital Objects. Work has now begun7 to formalise these different profiles of RO-Crates, which may impose additional constraints based on the needs of a specific domain or use case.

Bioinformatics workflows

WorkflowHub.eu is a European cross-domain registry of computational workflows, supported by European Open Science Cloud projects, e.g. EOSC-Life, and research infrastructures including the pan-European bioinformatics network ELIXIR [34]. As part of promoting workflows as reusable tools, WorkflowHub includes documentation and high-level rendering of the workflow structure independent of its native workflow definition format. The rationale is that a domain scientist can browse all relevant workflows for their domain, before narrowing down their workflow engine requirements. As such, the WorkflowHub is intended largely as a registry of workflows already deposited in repositories specific to particular workflow languages and domains, such as UseGalaxy.eu [10] and Nextflow nf-core [45].

We here describe three different RO-Crate profiles developed for use with WorkflowHub.

Profile for describing workflows

Being cross-domain, WorkflowHub has to cater for many different workflow systems. Many of these, for instance Nextflow [39] and Snakemake [73], by virtue of their script-like nature, reference multiple neighbouring files typically maintained in a GitHub repository. This calls for a data exchange method that allows keeping related files together. WorkflowHub has tackled this problem by adopting RO-Crate as the packaging mechanism [17], typing and annotating the constituent files of a workflow and — crucially — marking up the workflow language, as many workflow engines use common file extensions like *.xml and *.json. Workflows are further described with authors, license, diagram previews and a listing of their inputs and outputs. RO-Crates can thus be used for interoperable deposition of workflows to WorkflowHub, but are also used as an archive for downloading workflows, embedding metadata registered with the WorkflowHub entry and translated workflow files such as abstract Common Workflow Language (CWL) [36] definitions and diagrams [56].

RO-Crate acts therefore as an interoperability layer between registries, repositories and users in WorkflowHub. The iterative development between WorkflowHub developers and the RO-Crate community heavily informed the creation of the Bioschemas [58] profile for Computational Workflows, which again informed the RO-Crate 1.1 specification on workflows and led to the RO-Crate Python library [41] and WorkflowHub’s Workflow RO-Crate profile, which, in a similar fashion to RO-Crate itself, recommends which workflow resources and descriptions are required. This co-development across project boundaries exemplifies the drive for simplicity and for establishing best practices.

Profile for recording workflow runs

RO-Crates in WorkflowHub have so far been focused on workflows that are ready to be run, and development of WorkflowHub is now creating a Workflow Run RO-Crate profile8 for the purposes of benchmarking, testing and executing workflows. As such, RO-Crate serves as a container of both a workflow definition that may be executed and of a particular workflow execution with test results.

This workflow run profile is a continuation of our previous work with capturing workflow provenance in a Research Object in CWLProv [68] and TavernaPROV [110]. In both cases, we used the PROV Ontology [81], including details of every task execution with all the intermediate data, which required significant workflow engine integration.9

Simplifying from the CWLProv approach, the planned Workflow Run RO-Crate profile will use a high level schema.org provenance for the input/output boundary of the overall workflow execution. This Level 1 workflow provenance [68] can be expressed generally across workflow languages with minimal workflow engine changes, with the option of more detailed provenance traces as separate PROV artefacts in the RO-Crate as data entities. In the current development of Specimen Data Refinery [122] these RO-Crates will [^96] document the text recognition workflow runs of digitised biological specimens, exposed as FAIR Digital Objects [38].

WorkflowHub has recently enabled minting of Digital Object Identifiers (DOIs), a PID commonly used for scholarly artefacts, for registered workflows, e.g. 10.48546/workflowhub.workflow.56.1 [83], lowering the barrier for citing workflows as computational methods along with their FAIR metadata – captured within an RO-Crate. While it is not an aim for WorkflowHub to be a repository of workflow runs and their data, RO-Crates of exemplar workflow runs serve as useful workflow documentation, as well as being an exchange mechanism that preserves FAIR metadata in a diverse workflow execution environment.

Profile for testing workflows

The value of computational workflows, however, is potentially undermined by the “collapse” over time of the software and services they depend upon: for instance, software dependencies can change in a non-backwards-compatible manner, or active maintenance may cease; an external resource, such as a reference index or a database query service, could shift to a different URL or modify its access protocol; or the workflow itself may develop hard-to-find bugs as it is updated. This workflow decay can take a big toll on the workflow’s reusability and on the reproducibility of any processes it evokes [125].

For this reason, WorkflowHub is complemented by a monitoring and testing service called LifeMonitor [35], also supported by EOSC-Life. LifeMonitor’s main goal is to assist in the creation, periodic execution and monitoring of workflow tests, enabling the early detection of software collapse in order to minimise its detrimental effects. The communication of metadata related to workflow testing is achieved through the adoption of a Workflow Testing RO-Crate profile stacked on top of the Workflow RO-Crate profile. This further specialisation of Workflow RO-Crate allows to specify additional testing-related entities (test suites, instances, services, etc.), leveraging RO-Crate’s extension mechanism through the addition of terms from custom namespaces.

In addition to showcasing RO-Crate’s extensibility, the testing profile is an example of the format’s flexibility and adaptability to the different needs of the research community. Though ultimately related to a computational workflow, in fact, most of the testing-specific entities are more about describing a protocol for interacting with a monitoring service than a set of research outputs and its associated metadata. Indeed, one of LifeMonitor’s main functionalities is monitoring and reporting on test suites running on existing Continuous Integration (CI) services, which is described in terms of service URLs and job identifiers in the testing profile. In principle, in this context, data could disappear altogether, leading to an RO-Crate consisting entirely of contextual entities. Such an RO-Crate acts more as an exchange format for communication between services (WorkflowHub and LifeMonitor) than as an aggregator for research data and metadata, providing a good example of the format’s high versatility.

Regulatory Sciences

BioCompute Objects (BCO) [5] is a community-led effort to standardise submissions of computational workflows to biomedical regulators. For instance, a genomics sequencing pipeline, as part of a personalised cancer treatment study, can be submitted to the US Food and Drugs Administration (FDA) for approval. BCOs are formalised in the standard IEEE 2791-2020 [64] as a combination of JSON Schemas that define the structure of JSON metadata files describing exemplar workflow runs in detail, covering aspects such as the usability and error domain of the workflow, its runtime requirements, the reference datasets used and representative output data produced.

BCOs provide a structured view over a particular workflow, informing regulators about its workings independently of the underlying workflow definition language. However, BCOs have only limited support for additional metadata.10 For instance, while the BCO itself can indicate authors and contributors, and in particular regulators and their review decisions, it cannot describe the provenance of individual data files or workflow definitions.

As a custom JSON format, BCOs cannot be extended with Linked Data concepts, except by adding an additional top-level JSON object formalised in another JSON Schema. A BCO and workflow submitted by upload to a regulator will also frequently consist of multiple cross-related files. Crucially, there is no way to tell whether a given *.json file is a BCO file, except by reading its content and check for its spec_version.

We can then consider how a BCO and its referenced artefacts can be packaged and transferred following FAIR principles. BCO RO-Crate [109], part of the BioCompute Object user guides, defines a set of best practices for wrapping a BCO with a workflow, together with its exemplar outputs in an RO-Crate, which then provides typing and additional provenance metadata of the individual files, workflow definition, referenced data and the BCO metadata itself.

Here the BCO is responsible for describing the purpose of a workflow and its run at an abstraction level suitable for a domain scientist, while the more open-ended RO-Crate describes the surroundings of the workflow, classifying and relating its resources and providing provenance of their existence beyond the BCO. This emerging separation of concerns is shown in Figure 3, and highlights how RO-Crate is used side-by-side of existing standards and tooling, even where there are apparent partial overlaps.

A similar separation of concerns can be found if considering the RO-Crate as a set of files, where the transport-level metadata, such as checksum of files, are delegated to separate BagIt manifests, a standard focusing on the preservation challenges of digital libraries [74]. As such, RO-Crate metadata files are not required to iterate all the files in their folder hierarchy, only those that benefit from being described.

Specifically, a BCO description alone is insufficient for reliable re-execution of a workflow, which would need a compatible workflow engine depending on the original workflow definition language, so IEEE 2791 recommends using Common Workflow Language (CWL) [36] for interoperable pipeline execution. CWL itself relies on tool packaging in software containers using Docker or Conda. Thus, we can consider BCO RO-Crate as a stack: transport-level manifests of files (BagIt), provenance, typing and context of those files (RO-Crate), workflow overview and purpose (BCO), interoperable workflow definition (CWL) and tool distribution (Docker).

BioCompute Object (IEEE2791) is a JSON file that structurally explains the purpose and implementation of a computational workflow, for instance implemented in Common Workflow Language (CWL), that installs the workflow’s software dependencies as Docker containers or BioConda packages. An example execution of the workflow shows the different kinds of result outputs, which may be external, using GitHub LFS [85] to support larger data. RO-Crate gathers all these local and external resources, relating them and giving individual descriptions, for instance permanent DOI identifiers for reused datasets accessed from Zenodo, but also adding external identifiers to attribute authors using ORCID or to identify which licences apply to individual resources. The RO-Crate and its local files are captured in a BagIt whose checksum ensures completeness, combined with Big Data Bag [25] features to “complete” the bag with large external files such as the workflow outputs.

Separation of Concerns in BCO RO-Crate

BioCompute Object (IEEE2791) is a JSON file that structurally explains the purpose and implementation of a computational workflow, for instance implemented in Common Workflow Language (CWL), that installs the workflow’s software dependencies as Docker containers or BioConda packages. An example execution of the workflow shows the different kinds of result outputs, which may be external, using GitHub LFS [85] to support larger data. RO-Crate gathers all these local and external resources, relating them and giving individual descriptions, for instance permanent DOI identifiers for reused datasets accessed from Zenodo, but also adding external identifiers to attribute authors using ORCID or to identify which licences apply to individual resources. The RO-Crate and its local files are captured in a BagIt whose checksum ensures completeness, combined with Big Data Bag [25] features to “complete” the bag with large external files such as the workflow outputs.

Digital Humanities: Cultural Heritage

The Pacific And Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) [114] maintains a repository of more than 500,000 files documenting endangered languages across more than 16,000 items, collected and digitised over many years by researchers interviewing and recording native speakers across the region.

The Modern PARADISEC demonstrator has been proposed as an update to the 18 year old infrastructure, to also help long-term preservation of these artefacts in their digital form. The demonstrator uses RO-Crate to describe the overall structure and to capture the metadata of each item. The existing PARADISEC data collection has been ported and captured as RO-Crates. A Web portal then exposes the repository and its entries by indexing the RO-Crate metadata files, presenting a domain-specific view of the items — the RO-Crate is “hidden” and does not change the user interface.

The PARADISEC use case takes advantage of several RO-Crate features and principles. Firstly, the transcribed metadata are now independent of the PARADISEC platform and can be archived, preserved and processed in its own right, using schema.org as base vocabulary and extended with PARADISEC-specific terms.

In this approach, RO-Crate is the holder of itemised metadata, stored in regular files that are organised using Oxford Common File Layout (OCFL) [96], which ensures file integrity and versioning on a regular shared file system. This lightweight infrastructure also gives flexibility for future developments and maintenance. For example a consumer can use Linked Data software such as a graph database and query the whole corpora using SPARQL triple patterns across multiple RO-Crates. For long term digital preservation, beyond the lifetime of PARADISEC portals, a “last resort” fallback is storing the generic RO-Crate HTML preview [95]. Such human-readable rendering of RO-Crates can be hosted as static files by any Web server, in line with the approach taken by the Endings Project.11

Machine-actionable Data Management Plans

Machine-actionable Data Management Plans (maDMPs) have been proposed as an improvement to automate FAIR data management tasks in research [88]; maDMPs use PIDs and controlled vocabularies to describe what happens to data over the research life cycle [22]. The Research Data Alliance’s DMP Common Standard for maDMPs [121] is one such formalisation for expressing maDMPs, which can be expressed as Linked Data using the DMP Common Standard Ontology [21], a specialisation of the W3C Data Catalog Vocabulary (DCAT) [3]. RDA maDMPs are usually expressed using regular JSON, conforming to the DMP JSON Schema.

A mapping has been produced between Research Object Crates and Machine-actionable Data Management Plans [87], implemented by the RO-Crate RDA maDMP Mapper [7]. A similar mapping has been implemented by RO-Crate_2_ma-DMP [20]. In both cases, a maDMP can be converted to a RO-Crate, or vice versa. In [87] this functionality caters for two use cases:

  1. Start a skeleton data management plan based on an existing RO-Crate dataset, e.g. an RO-Crate from WorkflowHub.
  2. Instantiate an RO-Crate based on a data management plan.

An important nuance here is that data management plans are (ideally) written in advance of data production, while RO-Crates are typically created to describe data after it has been generated. What is significant to note in this approach is the importance of templating in order to make both tasks automatable and achievable, and how RO-Crate can fit into earlier stages of the research life cycle.

Institutional data repositories – Harvard Data Commons

The concept of a Data Commons for research collaboration was originally defined as “cyber-infrastructure that co-locates data, storage, and computing infrastructure with commonly used tools for analysing and sharing data to create an interoperable resource for the research community” [59]. More recently, Data Commons has been established to mean integration of active data-intensive research with data management and archival best practices, along with a supporting computational infrastructure. Furthermore, the Commons features tools and services, such as computation clusters and storage for scalability, data repositories for disseminating and preserving regular, but also large or sensitive datasets, and other research assets. Multiple initiatives were undertaken to create Data Commons on national, research, and institutional levels. For example, the Australian Research Data Commons (ARDC) [11] is a national initiative that enables local researchers and industries to access computing infrastructure, training, and curated datasets for data-intensive research. NCI’s Genomic Data Commons (GDC) [65] provides the cancer research community with access to a vast volume of genomic and clinical data. Initiatives such as Research Data Alliance (RDA) Global Open Research Commons propose standards for the implementation of Data Commons to prevent them becoming “data silos” and thus, enable interoperability from one Data Commons to another.

Harvard Data Commons [33] aims to address the challenges of data access and cross-disciplinary research within a research institution. It brings together multiple institutional schools, libraries, computing centres and the Harvard Dataverse data repository. Dataverse [32] is a free and open-source software platform to archive, share and cite research data. The Harvard Dataverse repository is the largest of 70 Dataverse installations worldwide, containing over 120K datasets with about 1.3M data files (as of 2021-11-16). Working toward the goal of facilitating collaboration and data discoverability and management within the university, Harvard Data Commons has the following primary objectives:

  1. The integration of Harvard Research Computing with Harvard Dataverse by leveraging Globus endpoints [27]; this will allow an automatic transfer of large datasets to the repository. In some cases, only the metadata will be transferred while the data stays stored in remote storage.
  2. Support for advanced research workflows and providing packaging options for assets such as code and workflows in the Harvard Dataverse repository to enable reproducibility and reuse.
  3. Integrating repositories supported by Harvard, which include DASH, the open access institutional repository, the Digital Repository Services (DRS) for preserving digital asset collections, and the Harvard Dataverse.

Particularly relevant to this article is the second objective of the Harvard Data Commons, which aims to support the deposit of research artefacts to Harvard Dataverse with sufficient information in the metadata to allow their future reuse (Figure 4). To support the incorporation of data, code, and other artefacts from various institutional infrastructures, Harvard Data Commons is currently working on RO-Crate adaptation. The RO-Crate metadata provides the necessary structure to make all research artefacts FAIR. The Dataverse software already has extensive support for metadata, including the Data Documentation Initiative (DDI), Dublin Core, DataCite, and schema.org. Incorporating RO-Crate, which has the flexibility to describe a wide range of research resources, will facilitate their seamless transition from one infrastructure to the other within the Harvard Data Commons.

Even though the Harvard Data Commons is specific to Harvard University, the overall vision and the three objectives can be abstracted and applied to other universities or research organisations. The Commons will be designed and implemented using standards and commonly-used approaches to make it interoperable and reusable by others.

Automatic encapsulation and deposit of artefacts from data management tools used during active research at the Harvard Dataverse repository.

One aspect of Harvard Data Commons

Automatic encapsulation and deposit of artefacts from data management tools used during active research at the Harvard Dataverse repository.

Related Work

With the increasing digitisation of research processes, there has been a significant call for the wider adoption of interoperable sharing of data and its associated metadata. We refer to [72] for a comprehensive overview and recommendations, in particular for data; notably that review highlights the wide variety of metadata and documentation that the literature prescribes for enabling data reuse. Likewise, we suggest [82] that covers the importance of metadata standards in reproducible computational research.

Here we focus on approaches for bundling research artefacts along with their metadata. This notion of publishing compound objects for scholarly communication has a long history behind it [29] [117], but recent approaches have followed three main strands: 1) publishing to centralised repositories; 2) packaging approaches similar to RO-Crate; and 3) bundling the computational workflow around a scientific experiment.

Bundling and Packaging Digital Research Artefacts

Early work making the case for publishing compound scholarly communication units [117] led to the development of the Object Re-Use and Exchange model (OAI-ORE), providing a structured resource map of the digital artefacts that together support a scholarly output.

The challenge of describing computational workflows was one of the main motivations for the early proposal of Research Objects (RO) [12] as first-class citizens for sharing and publishing. The RO approach involves bundling datasets, workflows, scripts and results along with traditional dissemination materials like journal articles and presentations, forming a single package. Crucially, these resources are not just gathered, but also individually typed, described and related to each other using semantic vocabularies. As pointed out in [12] an open-ended Linked Data approach is not sufficient for scholarly communication: a common data model is also needed in addition to common and best practices for managing and annotating lifecycle, ownership, versioning and attributions.

Considering the FAIR principles [123], we can say with hindsight that the initial RO approaches strongly targeted Interoperability, with a particular focus on the reproducibility of in-silico experiments involving computational workflows and the reuse of existing RDF vocabularies.

The first implementation of Research Objects for sharing workflows in myExperiment [57] was based on RDF ontologies [93], building on Dublin Core, FOAF, SIOC, Creative Commons and OAI-ORE to form myExperiment ontologies for describing social networking, attribution and credit, annotations, aggregation packs, experiments, view statistics, contributions, and workflow components [92].

This initially workflow-centric approach was further formalised as the Wf4Ever Research Object Model [14], which is a general-purpose research artefact description framework. This model is based on existing ontologies (FOAF, Dublin Core Terms, OAI-ORE and AO/OAC precursors to the W3C Web Annotation Model [28]) and adds specializations for workflow models and executions using W3C PROV-O [81]. The Research Object statements are saved in a manifest (the OAI-ORE resource map), with additional annotation resources containing user-provided details such as title and description.

We now claim that one barrier for wider adoption of the Wf4Eer Research Object model for general packaging digital research artefacts was exactly this re-use of multiple existing vocabularies (FAIR principle I2: Metadata use vocabularies that follow FAIR principles), which in itself is recognised as a challenge [67]. Adapters of the Wf4Ever RO model would have to navigate documentation of multiple overlapping ontologies, in addition to facing the usual Semantic Web development choices for RDF serialization formats, identifier minting and publishing resources on the Web.

Several developments for Research Objects improved on this situation, such as ROHub used by Earth Sciences [48], which provides a user-interface for making Research Objects, along with Research Object Bundle [111] (RO Bundle), which is a ZIP-archive embedding data files and a JSON-LD serialization of the manifest with mappings for a limited set of terms. RO Bundle was also used for storing detailed workflow run provenance (TavernaPROV [110]).

RO-Bundle evolved to Research Object BagIt archives, a variant of RO Bundle as a BagIt archive [74], used by Big Data Bags [25], CWLProv [68] and WholeTale [76] [26].

FAIR Digital Objects

FAIR Digital Objects (FDO) [38] have been proposed as a conceptual framework for making digital resources available in a Digital Objects (DO) architecture which encourages active use of the objects and their metadata. In particular, an FDO has five parts: (i) The FDO content, bit sequences stored in an accessible repository; (ii) a Persistent Identifier (PID) such as a DOI that identifies the FDO and can resolve these same parts; (iii) Associated rich metadata, as separate FDOs; (iv) Type definitions, also separate FDOs; (v) Associated operations for the given types. A Digital Object typed as a Collection aggregates other DOs by reference.

The Digital Object Interface Protocol [47] can be considered an “abstract protocol” of requirements, DOs could be implemented in multiple ways. One suggested implementation is the FAIR Digital Object Framework, based on HTTP and the Linked Data Principles. While there is agreement on using PIDs based on DOIs, consensus on how to represent common metadata, core types and collections as FDOs has not yet been reached. We argue that RO-Crate can play an important role for FDOs:

  1. By providing a predictable and extensible serialisation of structured metadata.
  2. By formalising how to aggregate digital objects as collections (and adding their context).
  3. By providing a natural Metadata FDO in the form of the RO-Crate Metadata File.
  4. By being based on Linked Data and schema.org vocabulary, meaning that PIDs already exist for common types and properties.

At the same time, it is clear that the goal of FDO is broader than that of RO-Crate; namely, FDOs are active objects with distributed operations, and add further constraints such as PIDs for every element. These features improve FAIR features of digital objects and are also useful for RO-Crate, but they also severely restrict the infrastructure that needs to be implemented and maintained in order for FDOs to remain accessible. RO-Crate, on the other hand, is more flexible: it can minimally be used within any file system structure, or ideally exposed through a range of Web-based scenarios. A FAIR profile of RO-Crate (e.g. enforcing PID usage) will fit well within a FAIR Digital Object ecosystem.

Packaging Workflows

The use of computational workflows, typically combining a chain of tools in an analytical pipeline, has gained prominence in particular in the life sciences. Workflows might be used primarily to improve computational scalability, as well as to assist in making computed data results FAIR [55], for instance by improving reproducibility [30], but also because programmatic data usage help propagate their metadata and provenance [69]. At the same time, workflows raise additional FAIR challenges, since they can be considered important research artefacts themselves. This viewpoint poses the problem of capturing and explaining the computational methods of a pipeline in sufficient machine-readable detail [80].

Even when researchers follow current best practices for workflow reproducibility [60] [30], the communication of computational outcomes through traditional academic publishing routes effectively adds barriers as authors are forced to rely on a textual manuscript representations. This hinder reproducibility and FAIR use of the knowledge previously captured in the workflow.

As a real-life example, let us look at a metagenomics article [4] that describes a computational pipeline. Here the authors have gone to extraordinary efforts to document the individual tools that have been reused, including their citations, versions, settings, parameters and combinations. The Methods section is two pages in tight double-columns with twenty four additional references, supported by the availability of data on an FTP server (60 GB) [43] and of open source code in GitHub Finn-Lab/MGS-gut [44], including the pipeline as shell scripts and associated analysis scripts in R and Python.

This attention to reporting detail for computational workflows is unfortunately not yet the norm, and although bioinformatics journals have strong data availability requirements, they frequently do not require authors to include or cite software, scripts and pipelines used for analysing and producing results [108] [archived 2021-05-04]. Indeed, in the absence of a specific requirement and an editorial policy to back it up – such as eliminating the reference limit – authors are effectively discouraged from properly and comprehensively citing software [53].

However detailed this additional information might be, another researcher who wants to reuse a particular computational method may first want to assess if the described tool or workflow is Re-runnable (executable at all), Repeatable (same results for original inputs on same platform), Reproducible (same results for original inputs with different platform or newer tools) and ultimately Reusable (similar results for different input data), Repurposable (reusing parts of the method for making a new method) or Replicable (rewriting the workflow following the method description) [15] [54].

Following the textual description alone, researchers would be forced to jump straight to evaluate “Replicable” by rewriting the pipeline from scratch. This can be expensive and error-prone. They would firstly need to install all the software dependencies and download reference datasets. This can be a daunting task, which may have to be repeated multiple times as workflows typically are developed at small scale on desktop computers, scaled up to local clusters, and potentially put into production using cloud instances, each of which will have different requirements for software installations.

In recent years the situation has been greatly improved by software packaging and container technologies like Docker and Conda, these technologies have been increasingly adopted in life sciences [90] thanks to collaborative efforts such as BioConda [61] and BioContainers [37], and support by Linux distributions (e.g. Debian Med [89]). As of November 2021, more than 9,000 software packages are available in BioConda alone, and 10,000 containers in BioContainers.

Docker and Conda have been integrated into workflow systems such as Snakemake [73], Galaxy [1] and Nextflow [39], meaning a downloaded workflow definition can now be executed on a “blank” machine (except for the workflow engine) with the underlying analytical tools installed on demand. Even with using containers there is a reproducibility challenge, for instance Docker Hub’s retention policy will expire container images after six months, or a lack of recording versions of transitive dependencies of Conda packages could cause incompatibilities if the packages are subsequently updated.

These container and package systems only capture small amounts of metadata12. In particular, they do not capture any of the semantic relationships between their content. Understanding these relationships is made harder by the opaque wrapping of arbitrary tools with unclear functionality, licenses and attributions.

From this we see that computational workflows are themselves complex digital objects that need to be recorded not just as files, but in the context of their execution environment, dependencies and analytical purpose in research – as well as other metadata (e.g. version, license, attribution and identifiers).

It is important to note that having all these computational details in order to represent them in an RO-Crate is an ideal scenario – in practice there will always be gaps of knowledge, and exposing all provenance details automatically would require improvements to the data sources, workflow, workflow engine and its dependencies. RO-Crate can be seen as a flexible annotation mechanism for augmenting automatic workflow provenance. Additional metadata can be added manually, e.g. for sensitive clinical data that cannot be publicly exposed13, or to cite software that lack persistent identifiers. This inline FAIRifying allows researchers to achieve “just enough FAIR” to explain their computational experiments.

Conclusion

RO-Crate has been established as an approach to packaging digital research artefacts with structured metadata. This approach assists developers and researchers to produce and consume FAIR archives of their research.

RO-Crate is formed by a set of best practice recommendations, developed by an open and broad community. These guidelines show how to use “just enough” standards in a consistent way. The use of structured metadata with a rich base vocabulary can cover general-purpose contextual relations, with a Linked Data foundation that ensures extensibility to domain- and application-specific uses. We can therefore consider an RO-Crate not just as a structured data archive, but as a multimodal scholarly knowledge graph that can help “FAIRify” and combine metadata of existing resources.

The adoption of simple Web technologies in the RO-Crate specification has helped a rapid development of a wide variety of supporting open source tools and libraries. RO-Crate fits into the larger landscape of open scholarly communication and FAIR Digital Object infrastructure, and can be integrated into data repository platforms. RO-Crate can be applied as a data/metadata exchange mechanism, assist in long-term archival preservation of metadata and data, or simply used at a small scale by individual researchers. Thanks to its strong community support, new and improved profiles and tools are being continuously added to the RO-Crate landscape, making it easier for adopters to find examples and support for their own use case.

Strictness vs flexibility

There is always a tradeoff between flexibility and strictness [116] when deciding on semantics of metadata models. Strict requirements make it easier for users and code to consume and populate a model, by reducing choices and having mandated “slots” to fill in. But such rigidity can also restrict richness and applicability of the model, as it in turn enforce the initial assumptions about what can be described.

RO-Crate attempts to strike a balance between these tensions, and provides a common metadata framework that encourages extensions. However, just like the RO-Crate specification can be thought of as a core profile of schema.org in JSON-LD, we cannot stress the importance of also establishing domain-specific RO-Crate profiles and conventions, as explored in sections extensibility and profiles in use. Specialization comes hand-in-hand with the principle of graceful degradation; RO-Crate applications and users are free to choose the semantic detail level they participate at, as long as they follow the common syntactic requirements.

Future Work

The direction of future RO-Crate work is determined by the community around it as a collaborative effort. We currently plan on further outreach, building training material (including a comprehensive entry-level tutorial) and maturing the reference implementation libraries. We will also collect and build examples of RO-Crate consumption, e.g. Jupyter Notebooks that query multiple crates using knowledge graphs. In addition, we are exploring ways to support some entity types requested by users, e.g. detailed workflow runs or container provenance, which do not have a good match in schema.org. Such support could be added, for instance, by integrating other vocabularies or by having separated (but linked) metadata files.

Furthermore, we want to better understand how the community uses RO-Crate in practice and how it contrasts with other related efforts; this will help us to improve our specification and tools. By discovering commonalities in emerging usage (e.g. additional schema.org types), the community helps to reduce divergence that could otherwise occur with proliferation of further RO-Crate profiles. We plan to gather feedback via user studies, with the Linked Open Data community or as part of EOSC Bring-your-own-Data training events.

We operate in an open community where future and potential users of RO-Crate are actively welcomed to participate and contribute feedback and requirements. In addition, we are targeting a wider audience through extensive outreach activities and by initiating new connections. Recent contacts include American Geophysical Union (AGU) on Data Citation Reliquary [2], National Institute of Standards and Technology (NIST) on material science, and InvenioRDM used by the Zenodo data repository. New Horizon Europe projects adapting RO-Crate include BY-COVID, which aims to improve FAIR access to data on COVID-19 and other infectious diseases.

The main addition in the upcoming 1.2 release of the RO-Crate specifications will be the formalization of profiles for different categories of crates. Additional entity types have been requested by users, e.g. workflow runs, business workflows, containers and software packages, tabular data structures; these are not always matched well with existing schema.org types but may benefit from other vocabularies or even separate metadata files, e.g. from Frictionless Data. We will be further aligning and collaborating with related research artefact description efforts like CodeMeta for software metadata, Science-on-Schema.org [66] for datasets, FAIR Digital Objects [38] and activities in EOSC task forces including the EOSC Interoperability Framework [75].

Acknowledgements

This work has received funding from the European Commission’s Horizon 2020 research and innovation programme for projects BioExcel-2 (H2020-INFRAEDI-2018-1 823830), IBISBA 1.0 (H2020-INFRAIA-2017-1-two-stage 730976), PREP-IBISBA (H2020-INFRADEV-2019-2 871118), EOSC-Life (H2020-INFRAEOSC-2018-2 824087), SyntheSys+ (H2020-INFRAIA-2018-1 823827). From the Horizon Europe Framework Programme this work has received funding for BY-COVID (HORIZON-INFRA-2021-EMERGENCY-01 101046203).

Björn Grüning is supported by DataPLANT (NFDI 7/1 – 42077441), part of the German National Research Data Infrastructure (NFDI), funded by the Deutsche Forschungsgemeinschaft (DFG).

Ana Trisovic is funded by the Alfred P. Sloan Foundation (grant number P-2020-13988). Harvard Data Commons is supported by an award from Harvard University Information Technology (HUIT).

Contributions

Author contributions to this article and the RO-Crate project according to the Contributor Roles Taxonomy CASRAI CrEDiT [19]:

Stian Soiland-Reyes
Conceptualization, Data curation, Formal Analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing
Peter Sefton
Conceptualization, Investigation, Methodology, Project administration, Resources, Software, Writing – review & editing
Mercè Crosas
Writing – review & editing
Leyla Jael Castro
Methodology, Writing – review & editing
Frederik Coppens
Writing – review & editing
José M. Fernández
Methodology, Software, Writing – review & editing
Daniel Garijo
Methodology, Writing – review & editing
Björn Grüning
Writing – review & editing
Marco La Rosa
Software, Methodology, Writing – review & editing
Simone Leo
Software, Methodology, Writing – review & editing
Eoghan Ó Carragáin
Investigation, Methodology, Project administration, Writing – review & editing
Marc Portier
Methodology, Writing – review & editing
Ana Trisovic
Software, Writing – review & editing
RO-Crate Community
Investigation, Software, Validation, Writing – review & editing
Paul Groth
Methodology, Supervision, Writing – original draft, Writing – review & editing
Carole Goble
Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Visualization, Writing – review & editing

We would also like to acknowledge contributions from:

Finn Bacall
Software, Methodology
Herbert Van de Sompel
Writing – review & editing
Ignacio Eguinoa
Software, Methodology
Nick Juty
Writing – review & editing
Oscar Corcho
Writing – review & editing
Stuart Owen
Writing – review & editing
Laura Rodríguez-Navas
Software, Visualization, Writing – review & editing
Alan R. Williams
Writing – review & editing

Appendix A: Formalizing RO-Crate in First Order Logic

Appendix A is a formalization of the concept of RO-Crate as a set of relations using First Order Logic:

Appendix B: RO-Crate Community

As of 2021-10-04, the RO-Crate Community members are:


References

[1] Enis Afgan, Dannon Baker, Bérénice Batut, Marius van den Beek, Dave Bouvier, Martin Čech, John Chilton, Dave Clements, Nate Coraor, Björn A Grüning, Aysam Guerler, Jennifer Hillman-Jackson, Saskia Hiltemann, Vahid Jalili, Helena Rasche, Nicola Soranzo, Jeremy Goecks, James Taylor, Anton Nekrutenko, Daniel Blankenberg (2018):
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.
Nucleic Acids Research 46(W1) W537–W544
https://doi.org/10.1093/nar/gky379

[2] Deborah Agarwal, Carole Goble, Stian Soiland-Reyes, Ugis Sarkans, Daniel Noesgaard, Uwe Schindler, Martin Fenner, Paolo Manghi, Shelley Stall, Caroline Coward, Chris Erdmann (2021):
Data Citation Community of Practice – 8 June 2021 Workshop.
Zenodo/AGU
https://data.agu.org/DataCitationCoP/2nd-workshop-data-citation
https://doi.org/10.5281/zenodo.4916734

[3] Riccardo Albertoni, David Browning, Simon Cox, Alejandra Gonzalez Beltran, Andrea Perego, Peter Winstanley, Dataset Exchange Working Group (2020):
Data Catalog Vocabulary (DCAT) – Version 2.
W3C Recommendation (2020)
https://www.w3.org/TR/2020/REC-vocab-dcat-2-20200204/

[4] Alexandre Almeida, Alex L. Mitchell, Miguel Boland, Samuel C. Forster, Gregory B. Gloor, Aleksandra Tarkowska, Trevor D. Lawley, Robert D. Finn (2019):
A new genomic blueprint of the human gut microbiota.
Nature 568(7753) 499–504.
https://doi.org/10.1038/s41586-019-0965-1

[5] Gil Alterovitz, Dennis A Dean II, Carole Goble, Michael R Crusoe, Stian Soiland-Reyes, Amanda Bell, Anais Hayes, Anita Suresh, Charles Hadley S King IV, Dan Taylor, KanakaDurga Addepalli, Elaine Johanson, Elaine E Thompson, Eric Donaldson, Hiroki Morizono, Hsinyi Tsang, Jeet K Vora, Jeremy Goecks, Jianchao Yao, Jonas S Almeida, Jonathon Keeney, KanakaDurga Addepalli, Konstantinos Krampis, Krista Smith, Lydia Guo, Mark Walderhaug, Marco Schito, Matthew Ezewudo, Nuria Guimera, Paul Walsh, Robel Kahsay, Srikanth Gottipati, Timothy C Rodwell, Toby Bloom, Yuching Lai, Vahan Simonyan, Raja Mazumder (2018):
Enabling precision medicine via standard communication of HTS provenance, analysis, and results.
PLOS Biology 16(12):e3000099
https://doi.org/10.1371/journal.pbio.3000099

[6] Ricardo Carvalho Amorim, João Aguiar Castro, João Rocha da Silva, Cristina Ribeiro (2016):
A comparison of research data management platforms: Architecture, flexible metadata and interoperability.
Universal Access in the Information Society 16 pp 851–862.
https://doi.org/10.1007/s10209-016-0475-y

[7] Ghaith Arfaoui, Maroua Jaoua (2020):
RO-Crate RDA maDMP Mapper.
Zenodo https://github.com/GhaithArf/ro-crate-rda-madmp-mapper
https://doi.org/10.5281/zenodo.3922136

[8] Finn Bacall, Stian Soiland-Reyes, Marina Soares e Silva (2019):
eScienceLab: RO-Composer.
https://esciencelab.org.uk/projects/ro-composer/
https://github.com/ResearchObject/research-object-composer

[9] Finn Bacall, Martyn Whitwell (2022):
GitHub – ResearchObject/ro-crate-ruby: A Ruby gem for creating, manipulating and reading RO-Crates.
https://github.com/ResearchObject/ro-crate-ruby

[10] Dannon Baker, Marius van den Beek, Daniel Blankenberg, Dave Bouvier, John Chilton, Nate Coraor, Frederik Coppens, Ignacio Eguinoa, Simon Gladman, Björn Grüning, Nicholas Keener, Delphine Larivière, Andrew Lonie, Sergei Kosakovsky Pond, Wolfgang Maier, Anton Nekrutenko, James Taylor, Steven Weaver (2020):
No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics.
PLOS Pathogens 16(8):e1008643.
https://doi.org/10.1371/journal.ppat.1008643

[11] Michelle Barker, Ross Wilkinson, Andrew Treloar (2019):
The Australian Research Data Commons, Data Science Journal 18 (2019).
https://doi.org/10.5334/dsj-2019-044

[12] Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Phillip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble (2013):
Why Linked Data is not enough for scientists.
Future Generation Computer Systems 29(2), pp. 599–611.
https://doi.org/10.1016/j.future.2011.08.004

[13] Kostadin Belchev (2021):
KockataEPich/CheckMyCrate: A command line application for validating a RO-Crate object against a JSON profile.
GitHub.
https://github.com/KockataEPich/CheckMyCrate

[14] Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina Hettne, Raul Palma, Eleni Mina, Oscar Corcho, José Manuel Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole Goble (2015):
Using a suite of ontologies for preserving workflow-centric research objects.
Web Semantics: Science, Services and Agents on the World Wide Web 32 pp. 16–42.
https://doi.org/10.1016/j.websem.2015.01.003

[15] Fabien C. Y. Benureau, Nicolas P. Rougier (2017):
Re-run, repeat, reproduce, reuse, replicate: Transforming code into scientific contributions.
Frontiers in Neuroinformatics 11:69.
https://doi.org/10.3389/fninf.2017.00069

[16] Helen Berman, Kim Henrick, Haruki Nakamura, John L Markley (2007):
The worldwide Protein Data Bank (wwPDB): Ensuring a single, uniform archive of PDB data.
Nucleic Acids Research 35(Database issue) , D301–D303.
https://doi.org/10.1093/nar/gkl971

[17] Florence Bietrix, José Maria Carazo, Salvador Capella-Gutierrez, Frederik Coppens, Maria Luisa Chiusano, Romain David, Jose Maria Fernandez, Maddalena Fratelli, Jean-Karim Heriche, Carole Goble, Philip Gribbon, Petr Holub, Robbie Joosten, Simone Leo, Stuart Owen, Helen Parkinson, Roland Pieruschka, Luca Pireddu, Luca Porcu, Michael Raess, Laura Rodriguez-Navas, Andreas Scherer, Stian Soiland-Reyes, Jing Tang (2021):
EOSC-life methodology framework to enhance reproducibility within EOSC-life.
Zenodo
https://doi.org/10.5281/zenodo.4705078

[18] Christian Bizer, Tom Heath, Tim Berners-Lee (2011):
Linked data: The story so far.
In Semantic Services, Interoperability and Web Applications: Emerging Concepts, Amit Sheth (ed.) ISBN 9781609605933
https://doi.org/10.4018/978-1-60960-593-3.ch008

[19] Amy Brand, Liz Allen, Micah Altman, Marjorie Hlava, Jo Scott (2015):
Beyond authorship: Attribution, contribution, collaboration, and credit.
Learned Publishing 28(2) pp. 151–155.
https://doi.org/10.1087/20150211

[20] Gabriel Brenner (2020):
BrennerG/Ro-Crate_2_ma-DMP: v1.0.0.
https://github.com/BrennerG/Ro-Crate_2_ma-DMP
https://doi.org/10.5281/zenodo.3903463

[21] J. Cardoso, L.J. Garcia Castro, F. Ekaputra, M.-C. Jacquemot-Perbal, T. Miksa and J. Borbinha (2020):
Towards Semantic Representation of Machine-Actionable Data Management Plans.
PUBLISSO.
https://repository.publisso.de/resource/frl:6423289
https://doi.org/10.4126/frl01-006423289

[22] João Cardoso, Diogo Proença, José Borbinha (2020):
Machine-actionable data management plans: A knowledge retrieval approach to automate the assessment of funders’ requirements.
ECIR 2020: Advances in Information Retrieval
ISBN 978-3-030-45442-5.
https://doi.org/10.1007/978-3-030-45442-5_15

[23] Eoghan Ó Carragáin, Carole Goble, Peter Sefton, Stian Soiland-Reyes (2019):
A lightweight approach to research object data packaging.
Bioinformatics Open Source Conference (BOSC2019), 2019-07-24/2019-07-25, Basel, Switzerland.
Zenodo.
https://doi.org/10.5281/zenodo.3250687

[24] Lois Mai Chan (1995):
Library of Congress Subject Headings: Principles and Application, 3rd edn, p. 556.
ISBN 9781563081910.

[25] Kyle Chard, Mike D’ Arcy, Ben Heavner, Ian Foster, Carl Kesselman, Ravi Madduri, Alexis Rodriguez, Stian Soiland-Reyes, Carole Goble, Kristi Clark, Eric W. Deutsch, Ivo Dinov, Nathan Price, Arthur Toga (2016):
I’ll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets.
2016 IEEE International Conference on Big Data (Big Data), IEEE, pp. 319–328.
ISBN 978-1-4673-9005-7.
https://static.aminer.org/pdf/fa/bigdata2016/BigD418.pdf
https://doi.org/10.1109/BigData.2016.7840618

[26] Kyle Chard, Niall Gaffney, Matthew B. Jones, Kacper Kowalik, Bertram Ludascher, Timothy McPhillips, Jarek Nabrzyski, Victoria Stodden, Ian Taylor, Thomas Thelen, Matthew J. Turk, Craig Willis (2019):
Application of BagIt-serialized research object bundles for packaging and re-execution of computational analyses.
15th International Conference on eScience (eScience 2019), IEEE, pp. 514–521.
ISBN 978-1-7281-2451-3.
https://zenodo.org/record/3381754
https://doi.org/10.1109/eScience.2019.00068

[27] Kyle Chard, Steven Tuecke and Ian Foster (2014):
Efficient and secure transfer, synchronization, and sharing of big data.
IEEE Cloud Computing 1(3) pp. 46–55.
https://doi.org/10.1109/MCC.2014.52

[28] Paolo Ciccarese, Robert Sanderson, Benjamin Young (2017):
Web Annotation Data Model.
W3C Recommendation 23 February 2017. https://www.w3.org/TR/2017/REC-annotation-model-20170223/

[29] Jon F. Claerbout, Martin Karrenbach (1992):
Electronic documents give reproducible research a new meaning.
SEG Technical Program Expanded Abstracts 1992, Society of Exploration Geophysicists, pp. 601–604.
https://doi.org/10.1190/1.1822162

[30] Sarah Cohen-Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard, Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, Fabien Mareuil, Hervé Ménager, Christophe Pradal, Christophe Blanchet (2017):
Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities.
Future Generation Computer Systems 75 pp. 284–298.
https://doi.org/10.1016/j.future.2017.01.012

[31] Stefano Cossu, Esmé Cowles, Karen Estlund, Christina Harlow, Tom Johnson, Mark Matienzo, Danny Lamb, Lynette Rayle, Rob Sanderson, Jon Stroop, Andrew Woods (2018):
Portland Common Data Model.
GitHub duraspace/pcdm Wiki (2018-06-15) https://github.com/duraspace/pcdm/wiki

[32] Mercè Crosas (2011):
The DataVerse Network: An open-source application for sharing, discovering and preserving data.
D-Lib Magazine 17(1/2)a
https://doi.org/10.1045/january2011-crosas

[33] Mercè Crosas (2020):
Harvard Data Commons.
European Dataverse Workshop 2020, Tromsø, Norway. ISSN 2387-3086.
https://doi.org/10.7557/5.5422

[34] Lindsey C Crosswell, Janet M Thornton (2012):
ELIXIR: A distributed infrastructure for European biological data.
Trends in Biotechnology 30(5) pp. 241–242.
https://doi.org/10.1016/j.tibtech.2012.02.002

[35] CRS4 (2022):
LifeMonitor, a testing and monitoring service for scientific workflows.
https://about.lifemonitor.eu/

[36] Michael R. Crusoe, Sanne Abeln, Alexandru Iosup, Peter Amstutz, John Chilton, Nebojša Tijanić, Hervé Ménager, Stian Soiland-Reyes, Carole Goble (2022):
Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language.
Communications of the ACM, accepted. https://arxiv.org/abs/2105.07028
https://doi.org/10.1145/3486897

[37] Felipe da Veiga Leprevost, Björn A Grüning, Saulo Alves Aflitos, Hannes L Röst, Julian Uszkoreit, Harald Barsnes, Marc Vaudel, Pablo Moreno, Laurent Gatto, Jonas Weber, Mingze Bai, Rafael C Jimenez, Timo Sachsenberg, Julianus Pfeuffer, Roberto Vera Alvarez, Johannes Griss, Alexey I Nesvizhskii, Yasset Perez-Riverol (2017):
BioContainers: An open-source and community-driven framework for software standardization.
Bioinformatics 33(16) pp. 2580–2582.
https://doi.org/10.1093/bioinformatics/btx192

[38] Koenraad De Smedt, Dimitris Koureas, Peter Wittenburg (2020):
FAIR digital objects for science: From data pieces to actionable knowledge units.
Publications 8(2):21
https://doi.org/10.3390/publications8020021

[39] Paolo Di Tommaso, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, Cedric Notredame (2017):
Nextflow enables reproducible computational workflows.
Nature Biotechnology 35(4) (2017), 316–319.
https://doi.org/10.1038/nbt.3820

[40] Mathias Dillen, Quentin Groom, Donat Agosti, Lars Nielsen (2019):
Zenodo, an archive and publishing repository: A tale of two herbarium specimen pilot projects.
Biodiversity Information Science and Standards 3:e37080 (2019).
https://doi.org/10.3897/biss.3.37080

[41] Bert Droesbeke, Ignacio Eguinoa, Alban Gaignard, Leo Simone, Luca Pireddu, Laura Rodríguez-Navas, Stian Soiland-Reyes (2022):
GitHub – ResearchObject/ro-crate-py: Python library for RO-Crate.
https://github.com/researchobject/ro-crate-py
https://doi.org/10.5281/zenodo.3956493

[42] M. Duerst and M. Suignard (2005):
Internationalized resource identifiers (IRIs).
RFC 3987, Internet Requests for Comments, RFC Editor, (2005).
https://doi.org/10.17487/rfc3987

[43] EMBL-EBI Microbiome Informatics Team (2019):
FTP index of /pub/databases/metagenomics/umgs_analyses/.
http://ftp.ebi.ac.uk/pub/databases/metagenomics/umgs_analyses/

[44] EMBL-EBI Microbiome Informatics Team (2020):
GitHub – Finn-Lab/MGS-gut: Analysing Metagenomic Species (MGS).
https://github.com/Finn-Lab/MGS-gut

[45] Philip A Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso, Sven Nahnsen (2020):
The nf-core framework for community-curated bioinformatics pipelines.
Nature Biotechnology 38(3), 276–278.
https://doi.org/10.1038/s41587-020-0439-x

[46] Sharon Farnel, Ali Shiri (2014):
Metadata for research data: Current practices and trends.
2014 Proceedings of the International Conference on Dublin Core and Metadata Applications, W. Moen and A. Rushing, eds, Dublin Core Metadata Initiative, ISSN 1939-1366.
https://dcpapers.dublincore.org/pubs/article/view/3714.

[47] DONA Foundation (2018):
*Digital Object Interface Protocol Specification, version 2.0.
Technical Report.
https://www.dona.net/sites/default/files/2018-11/DOIPv2Spec_1.pdf

[48] Andres Garcia-Silva, Jose Manuel Gomez-Perez, Raul Palma, Marcin Krystek, Simone Mantovani, Federica Foglini, Valentina Grande, Francesco De Leo, Stefano Salvi, Elisa Trasatti, Vito Romaniello, Mirko Albani, Cristiano Silvagni, Rosemarie Leone, Fulvio Marelli, Sergio Albani, Michele Lazzarini, Hazel J. Napier, Helen M. Glaves, Timothy Aldridge, Charles Meertens, Fran Boler, Henry W. Loescher, Christine Laney, Melissa A. Genazzio, Daniel Crawl, Ilkay Altintas (2019):
Enabling FAIR research in Earth science through research objects.
Future Generation Computer Systems 98 pp. 550–564.
https://arxiv.org/abs/1809.10617
https://doi.org/10.1016/j.future.2019.03.046

[49] Peter Sefton, Mike Lynch, Stian Soiland-Reyes (2021):
GitHub – UTS-eResearch/ro-crate-js: Research Object Crate (RO-Crate) utilities.
https://github.com/UTS-eResearch/ro-crate-js

[50] Ignacio Eguinoa, Stian Soiland-Reyes, Bert Droesbeke, Michael R. Crusoe (2020):
GitHub – workflowhub-eu/galaxy2cwl: Standalone version tool to get cwl descriptions (initially an abstract cwl interface) of galaxy workflows and Galaxy workflows executions.
https://github.com/workflowhub-eu/galaxy2cwl

[51] Marco La Rosa (2021):
GitHub – CoEDL/modpdsc, https://github.com/CoEDL/modpdsc/

[52] Marco La Rosa (2021):
GitHub – CoEDL/ocfl-tools: Tools to process and manipulate an OCFL tree.
https://github.com/CoEDL/ocfl-tools

[53] Giving software its due.
Nature Methods 16(3) (2019), 207–207.
https://doi.org/10.1038/s41592-019-0350-x

[54] Carole Goble (2016):
What Is Reproducibility? The R* Brouhaha.
SciRepro Workshop, TPDL, Hannover, Germany, 2016. http://repscience2016.research-infrastructures.eu/img/CaroleGoble-ReproScience2016v2.pdf

[55] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, Daniel Schober (2019):
FAIR Computational Workflows.
Data Intelligence 2(1–2) pp. 108–121.
https://doi.org/10.1162/dint_a_00033

[56] Carole Goble, Stian Soiland-Reyes, Finn Bacall, Stuart Owen, Alan Williams, Ignacio Eguinoa, Bert Droesbeke, Simone Leo, Luca Pireddu, Laura Rodriguez-Navas, José Mª Fernández, Salvador Capella-Gutierrez, Hervé Ménager, Björn Grüning, Beatriz Serrano-Solano, Philip Ewels, Frederik Coppens (2021):
Implementing FAIR digital objects in the EOSC-life workflow collaboratory.
Zenodo. https://doi.org/10.5281/zenodo.4605654

[57] Carole A Goble, Jiten Bhagat, Sergejs Aleksejevs, Don Cruickshank, Danius Michaelides, David Newman, Mark Borkum, Sean Bechhofer, Marco Roos, Peter Li, David De Roure (2010):
myExperiment: A repository and social network for the sharing of bioinformatics workflows.
Nucleic Acids Research 38(Web Server issue) W677–W682.
https://doi.org/10.1093/nar/gkq429

[58] Alasdair Gray, Carole Goble, Rafael Jimenez, Bioschemas Community(2017):
Bioschemas: From Potato Salad to Protein Annotation.
ISWC, Vienna, Austria.
https://iswc2017.semanticweb.org/paper-579/

[59] Robert L Grossman, Allison Heath, Mark Murphy, Maria Patterson, Walt Wells (2016):
A case for data commons: Toward data science as a service.
Computing in Science & Engineering 18(5) pp. 10–20.
https://doi.org/10.1109/MCSE.2016.92

[60] Björn Grüning, John Chilton, Johannes Köster, Ryan Dale, Nicola Soranzo, Marius van den Beek, Jeremy Goecks, Rolf Backofen, Anton Nekrutenko, James Taylor (2018):
Practical computational reproducibility in the life sciences.
Cell Systems 6(6) pp. 631–635.
https://doi.org/10.1016/j.cels.2018.03.014

[61] Björn Grüning, Ryan Dale, Andreas Sjödin, Brad A Chapman, Jillian Rowe, Christopher H Tomkins-Tinch, Renan Valieris, Johannes Köster, Bioconda Team (2018):
Bioconda: Sustainable and comprehensive software distribution for the life sciences.
Nature Methods 15(7) pp. 475–476.
https://doi.org/10.1038/s41592-018-0046-7

[62] Ramanathan V Guha, Dan Brickley, Steve Macbeth (2015):
Schema.org: Evolution of Structured Data on the Web: Big data makes common schemas even more necessary.
Queue 13(9) pp. 10–37.
https://doi.org/10.1145/2857274.2857276

[63] Tom Heath, Christian Bizer (2011):
Linked Data: Evolving the Web into a Global Data Space.
Synthesis Lectures on the Semantic Web: Theory and Technology 1 pp. 1–136, ISSN 2160-4711. ISBN 9781608454310 / ISBN 9781608454303. https://identifiers.org/isbn/9781608454303
https://doi.org/10.2200/S00334ED1V01Y201102WBE001

[64] IEEE Standard for Bioinformatics Analyses Generated by High-Throughput Sequencing (HTS) to Facilitate Communication (2020).
IEEE Std 2791-2020.
ISBN 978-1-5044-6466-6.
https://www.research.manchester.ac.uk/portal/en/publications/ieee-2791(936de52b-ac53-4f0e-9927-77fd7073e88d).html
https://doi.org/10.1109/ieeestd.2020.9094416

[65] Mark A Jensen, Vincent Ferretti, Robert L Grossman, Louis M Staudt (2017):
The NCI Genomic Data Commons as an engine for precision medicine.
Blood 130(4) pp. 453–459.
https://doi.org/10.1182/blood-2017-03-735654

[66] Matthew B. Jones, Stephen Richard, Dave Vieglais, Adam Shepherd, Ruth Duerr, Dougl Fils, Lewis McGibbney (2021):
Science-on-Schema.org v1.2.0
https://doi.org/10.5281/zenodo.4477164

[67] Megan Katsumi, Michael Grüninger (2016):
What is ontology reuse?.
In: Formal Ontology in Information Systems, R. Ferrario and W. Kuhn, eds,
Frontiers in Artificial Intelligence and Applications 283
ISBN 978-1-61499-660-6.
https://doi.org/10.3233/978-1-61499-660-6-9

[68] Farah Zaib Khan, Stian Soiland-Reyes, Richard O. Sinnott, Andrew Lonie, Carole Goble, Michael R. Crusoe (2019):
Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv.
GigaScience 8(11).
https://doi.org/10.1093/gigascience/giz095

[69] Jihie Kim, Ewa Deelman, Yolanda Gil, Gaurang Mehta, Varun Ratnakar (2008):
Provenance trails in the Wings/Pegasus system.
Concurrency and Computation: Practice and Experience 20(5) pp. 587–597.
https://doi.org/10.1002/cpe.1228

[70] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, Jupyter Development Team (2016):
Jupyter Notebooks – a publishing format for reproducible computational workflows.
in: Positioning and Power in Academic Publishing: Players, Agents and Agendas,
Proceedings of the 20th International Conference on Electronic Publishing, pp. 87–90, ISBN 978-1-61499-649-1.
https://doi.org/10.3233/978-1-61499-649-1-87

[71] Laura Koesten, Kathleen Gregory, Paul Groth, Elena Simperl (2021):
Talking datasets – understanding data sensemaking behaviours.
International journal of human-computer studies 146:102562.
https://doi.org/10.1016/j.ijhcs.2020.102562

[72] Laura Koesten, Pavlos Vougiouklis, Elena Simperl, Paul Groth (2020):
Dataset reuse: Toward translating principles to practice.
Patterns 1(8):100136.
https://doi.org/10.1016/j.patter.2020.100136

[73] Johannes Köster, Sven Rahmann (2012):
Snakemake – a scalable bioinformatics workflow engine.
Bioinformatics 28(19) pp. 2520–2522.
https://doi.org/10.1093/bioinformatics/bts480

[74] J. Kunze, J. Littman, E. Madden, J. Scancella, C. Adams (2018):
The BagIt File Packaging Format, (V1.0), RFC 8493, Internet Requests for Comments, RFC Editor.
https://doi.org/10.17487/RFC8493

[75] Krzysztof Kurowski, Oscar Corcho, Christine Choirat, Magnus Eriksson, Frederik Coppens, Mark van de Sanden, Milan Ojsteršek (2021):
EOSC Interoperability Framework.
Publications Office of the EU, Technical Report, 2021.
https://doi.org/10.2777/620649

[76] Kyle Chard, Niall Gaffney, Mihael Hategan, Kacper Kowalik, Bertram Ludäscher, Timothy McPhillips, Jarek Nabrzyski, Victoria Stodden, Ian Taylor, Thomas Thelen, Matthew J. Turk, Craig Willis (2020):
Toward enabling reproducibility for data-intensive research using the Whole Tale platform.
Advances in Parallel Computing 36 pp 766–778.
https://doi.org/10.3233/APC200107

[77] M. La Rosa (2021):
Arkisto Platform: Describo Online.
https://arkisto-platform.github.io/describo-online/

[78] M. La Rosa and Peter Sefton (2021):
Arkisto Platform: Describo.
https://arkisto-platform.github.io/describo/

[79] R. Lammey (2020):
Solutions for identification problems: A look at the research organization registry.
Science Editing 7(1) pp. 65–69.
https://doi.org/10.6087/kcse.192

[80] Anna-Lena Lamprecht, Leyla Garcia, Mateusz Kuzak, Carlos Martinez, Ricardo Arcila, Eva Martin Del Pico, Victoria Dominguez Del Angel, Stephanie Van De Sandt, Jon Ison, Paula Andrea Martinez, Peter Mcquilton, Alfonso Valencia, Jennifer Harrow, Fotis Psomopoulos, Josep Ll. Gelpi, Neil Chue Hong, Carole Goble, Salvador Capella-Gutierrez (2019):
Towards FAIR principles for research software.
Data Science 3(1) pp. 1–23.
https://doi.org/10.3233/DS-190026

[81] T. Lebo, S. Sahoo, D. McGuinness, K. Belhajjame, J. Cheney, D. Corsar, D. Garijo, Stian Soiland-Reyes, S. Zednik and J. Zhao (2013):
PROV-O: The PROV Ontology.
W3C Recommendation 30 April 2013. https://www.w3.org/TR/2013/REC-prov-o-20130430/

[82] J. Leipzig, D. Nüst, C.T. Hoyt, K. Ram and J. Greenberg (2021):
The role of metadata in reproducible computational research.
Patterns 2(9):100322.
https://doi.org/10.1016/j.patter.2021.100322

[83] D. Lowe and G. Bayarri (2021):
Protein Ligand Complex MD Setup tutorial using BioExcel Building Blocks (biobb) (jupyter notebook).
https://doi.org/10.48546/workflowhub.workflow.56.1

[84] M. Lynch and Peter Sefton (2022):
npm: ro-crate-excel.
npm https://www.npmjs.com/package/ro-crate-excel

[85] GitHub (2021):
Managing large files – GitHub Docs.
https://docs.github.com/en/repositories/working-with-files/managing-large-files

[86] Julie A McMurry, Nick Juty, Niklas Blomberg, Tony Burdett, Tom Conlin, Nathalie Conte, Mélanie Courtot, John Deck, Michel Dumontier, Donal K Fellows, Alejandra Gonzalez-Beltran, Philipp Gormanns, Jeffrey Grethe, Janna Hastings, Jean-Karim Hériché, Henning Hermjakob, Jon C Ison, Rafael C Jimenez, Simon Jupp, John Kunze, Camille Laibe, Nicolas Le Novère, James Malone, Maria Jesus Martin, Johanna R McEntyre, Chris Morris, Juha Muilu, Wolfgang Müller, Philippe Rocca-Serra, Susanna-Assunta Sansone, Murat Sariyar, Jacky L Snoep, Stian Soiland-Reyes, Natalie J Stanford, Neil Swainston, Nicole Washington, Alan R Williams, Sarala M Wimalaratne, Lilly M Winfree, Katherine Wolstencroft, Carole Goble, Cristopher J Mungall, Melissa A Haendel, Helen Parkinson (2017):
Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data.
PLOS Biology 15(6):e2001414.
https://doi.org/10.1371/journal.pbio.2001414

[87] T. Miksa, M. Jaoua and G. Arfaoui (2020):
Research object crates and machine-actionable data management plans.
1st Workshop on Research Data Management for Linked Open Science.
https://doi.org/10.4126/frl01-006423291

[88] T. Miksa, S. Simms, D. Mietchen and S. Jones (2019):
Ten principles for machine-actionable data management plans.
PLOS Computational Biology 15(3): e1006750.
https://doi.org/10.1371/journal.pcbi.1006750

[89] Steffen Möller, Hajo Nils Krabbenhöft, Andreas Tille, David Paleino, Alan Williams, Katy Wolstencroft, Carole Goble, Richard Holland, Dominique Belhachemi, Charles Plessy (2010):
Community-driven computational biology with Debian Linux.
BMC Bioinformatics 11(Suppl 12):S5.
https://doi.org/10.1186/1471-2105-11-S12-S5

[90] Steffen Möller, Stuart W. Prescott, Lars Wirzenius; Petter Reinholdtsen, Brad Chapman, Pjotr Prins, Stian Soiland-Reyes, Fabian Klötzl, Andrea Bagnacani, Matúš Kalaš, Andreas Tille, Michael R. Crusoe (2017):
Robust cross-platform workflows: How technical and scientific communities collaborate to develop, test and share best practices for data analysis.
Data Science and Engineering 2(3) pp. 232–244.
https://doi.org/10.1007/s41019-017-0050-4

[91] Barend Mons (2018):
Data Stewardship for Open Science, 1st edn. Taylor & Francis, p. 240. ISBN 9781315351148.

[92] myExperiment (2009):
myExperiment Ontology Modules.
myExperiment / Internet Archive
https://web.archive.org/web/20091115080336/http%3a%2f%2frdf.myexperiment.org/ontologies

[93] D. Newman, S. Bechhofer and D. De Roure (2009):
myExperiment: An ontology for e-Research.
in: Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009), T. Clark, J.S. Luciano, M.S. Marshall, E. Prud’Hommeaux and S. Stephens, eds,
CEUR Workshop Proceedings 523. ISSN 1613-0073.
http://ceur-ws.org/Vol-523/Newman.pdf

[94] Cameron Neylon (2017):
As a researcher … I’m a bit bloody fed up with Data Management.
Science in the Open (blog) https://cameronneylon.net/blog/as-a-researcher-im-a-bit-bloody-fed-up-with-data-management/.

[95] npm:
ro-crate-html-js
https://www.npmjs.com/package/ro-crate-html-js

[96] OCFL, Oxford Common File Layout Specification, Recommendation, 2020.
https://ocfl.io/1.0/spec/

[97] A. Piper (2020):
Digital crowdsourcing and public understandings of the past: Citizen historians meet criminal characters.
History Australia 17(3) pp. 525–541.
https://doi.org/10.1080/14490854.2020.1796500

[98] RDF Working Group (2014):
RDF 1.1 Concepts and Abstract Syntax.
W3C Recommendation 25 Feb 2014. https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.

[99] Heidi L. Rehm, Angela J.H. Page, Lindsay Smith, Jeremy B. Adams, Gil Alterovitz, Lawrence J. Babb, Maxmillian P. Barkley, Michael Baudis, Michael J.S. Beauvais, Tim Beck, Jacques S. Beckmann, Sergi Beltran, David Bernick, Alexander Bernier, James K. Bonfield, Tiffany F. Boughtwood, Guillaume Bourque, Sarion R. Bowers, Anthony J. Brookes, Michael Brudno, Matthew H. Brush, David Bujold, Tony Burdett, Orion J. Buske, Moran N. Cabili, Daniel L. Cameron, Robert J. Carroll, Esmeralda Casas-Silva, Debyani Chakravarty, Bimal P. Chaudhari, Shu Hui Chen, J. Michael Cherry, Justina Chung, Melissa Cline, Hayley L. Clissold, Robert M. Cook-Deegan, Mélanie Courtot, Fiona Cunningham, Miro Cupak, Robert M. Davies, Danielle Denisko, Megan J. Doerr, Lena I. Dolman, Edward S. Dove, L. Jonathan Dursi, Stephanie O.M. Dyke, James A. Eddy, Karen Eilbeck, Kyle P. Ellrott, Susan Fairley, Khalid A. Fakhro, Helen V. Firth, Michael S. Fitzsimons, Marc Fiume, Paul Flicek, Ian M. Fore, Mallory A. Freeberg, Robert R. Freimuth, Lauren A. Fromont, Jonathan Fuerth, Clara L. Gaff, Weiniu Gan, Elena M. Ghanaim, David Glazer, Robert C. Green, Malachi Griffith, Obi L. Griffith, Robert L. Grossman, Tudor Groza, Jaime M. Guidry Auvil, Roderic Guigó, Dipayan Gupta, Melissa A. Haendel, Ada Hamosh, David P. Hansen, Reece K. Hart, Dean Mitchell Hartley, David Haussler, Rachele M. Hendricks-Sturrup, Calvin W.L. Ho, Ashley E. Hobb, Michael M. Hoffman, Oliver M. Hofmann, Petr Holub, Jacob Shujui Hsu, Jean-Pierre Hubaux, Sarah E. Hunt, Ammar Husami, Julius O. Jacobsen, Saumya S. Jamuar, Elizabeth L. Janes, Francis Jeanson, Aina Jené, Amber L. Johns, Yann Joly, Steven J.M. Jones, Alexander Kanitz, Kazuto Kato, Thomas M. Keane, Kristina Kekesi-Lafrance, Jerome Kelleher, Giselle Kerry, Seik-Soon Khor, Bartha M. Knoppers, Melissa A. Konopko, Kenjiro Kosaki, Martin Kuba, Jonathan Lawson, Rasko Leinonen, Stephanie Li, Michael F. Lin, Mikael Linden, Xianglin Liu, Isuru Udara Liyanage, Javier Lopez, Anneke M. Lucassen, Michael Lukowski, Alice L. Mann, John Marshall, Michele Mattioni, Alejandro Metke-Jimenez, Anna Middleton, Richard J. Milne, Fruzsina Molnár-Gábor, Nicola Mulder, Monica C. Munoz-Torres, Rishi Nag, Hidewaki Nakagawa, Jamal Nasir, Arcadi Navarro, Tristan H. Nelson, Ania Niewielska, Amy Nisselle, Jeffrey Niu, Tommi H. Nyrönen, Brian D. O’Connor, Sabine Oesterle, Soichi Ogishima, Vivian Ota Wang, Laura A.D. Paglione, Emilio Palumbo, Helen E. Parkinson, Anthony A. Philippakis, Angel D. Pizarro, Andreas Prlic, Jordi Rambla, Augusto Rendon, Renee A. Rider, Peter N. Robinson, Kurt W. Rodarmer, Laura Lyman Rodriguez, Alan F. Rubin, Manuel Rueda, Gregory A. Rushton, Rosalyn S. Ryan, Gary I. Saunders, Helen Schuilenburg, Torsten Schwede, Serena Scollen, Alexander Senf, Nathan C. Sheffield, Neerjah Skantharajah, Albert V. Smith, Heidi J. Sofia, Dylan Spalding, Amanda B. Spurdle, Zornitza Stark, Lincoln D. Stein, Makoto Suematsu, Patrick Tan, Jonathan A. Tedds, Alastair A. Thomson, Adrian Thorogood, Timothy L. Tickle, Katsushi Tokunaga, Juha Törnroos, David Torrents, Sean Upchurch, Alfonso Valencia, Roman Valls Guimera, Jessica Vamathevan, Susheel Varma, Danya F. Vears, Coby Viner, Craig Voisin, Alex H. Wagner, Susan E. Wallace, Brian P. Walsh, Marc S. Williams, Eva C. Winkler, Barbara J. Wold, Grant M. Wood, J. Patrick Woolley, Chisato Yamasaki, Andrew D. Yates, Christina K. Yung, Lyndon J. Zass, Ksenia Zaytseva, Junjun Zhang, Peter Goodhand, Kathryn North, Ewan Birney (2021):
GA4GH: International policies and standards for data sharing across genomic research and healthcare.
Cell Genomics 1(2):100029.
https://doi.org/10.1016/j.xgen.2021.100029

[100] N. Rettberg and B. Schmidt (2015):
OpenAIRE: Supporting a European open access mandate.
College & Research Libraries News 76(6) pp. 306–310. http://resolver.sub.uni-goettingen.de/purl?gs-1/11942
https://doi.org/10.5860/crln.76.6.9326

[101] G.K. Sandve, A. Nekrutenko, J. Taylor and E. Hovig (2013):
Ten simple rules for reproducible computational research.
PLOS Computational Biology 9(10):e1003285.
https://doi.org/10.1371/journal.pcbi.1003285

[102] Lynn M. Schriml, Maria Chuvochina, Neil Davies, Emiley A. Eloe-Fadrosh, Robert D. Finn, Philip Hugenholtz, Christopher I. Hunter, Bonnie L. Hurwitz, Nikos C. Kyrpides, Folker Meyer, Ilene Karsch Mizrachi, Susanna-Assunta Sansone, Granger Sutton, Scott Tighe, Ramona Walls (2020):
COVID-19 pandemic reveals the peril of ignoring metadata standards.
Scientific Data 7(1):188.
https://doi.org/10.1038/s41597-020-0524-5

[103] Peter Sefton, G. Devine, C. Evenhuis, M. Lynch, S. Wise, M. Lake and D. Loxton (2018):
DataCrate: a method of packaging, distributing, displaying and archiving Research Objects.
in: Workshop on Research Objects (RO 2018), 29 Oct 2018 at IEEE eScience 2018, Amsterdam, Netherland. Zenodo
https://doi.org/10.5281/zenodo.1445817

[104] Peter Sefton (2021):
FAIR Data Management; It’s a lifestyle not a lifecycle.
ptsefton.com. http://ptsefton.com/2021/04/07/rdmpic/

[105] Peter Sefton, Eoghan Ó Carragáin, Stian Soiland-Reyes, Oscar Corcho, Daniel Garijo, Raul Palma, Frederik Coppens, Carole Goble, José María Fernández, Kyle Chard, Jose Manuel Gomez-Perez, Michael R Crusoe, Ignacio Eguinoa, Nick Juty, Kristi Holmes, Jason A. Clark, Salvador Capella-Gutierrez, Alasdair J. G. Gray, Stuart Owen, Alan R. Williams, Giacomo Tartari, Finn Bacall, Thomas Thelen (2019):
RO-Crate Metadata Specification 1.0.
https://doi.org/10.5281/zenodo.3541888

[106] Peter Sefton, Eoghan Ó Carragáin, Stian Soiland-Reyes, Oscar Corcho, Daniel Garijo, Raul Palma, Frederik Coppens, Carole Goble, José María Fernández, Kyle Chard, Jose Manuel Gomez-Perez, Michael R Crusoe, Ignacio Eguinoa, Nick Juty, Kristi Holmes, Jason A. Clark, Salvador Capella-Gutierrez, Alasdair J. G. Gray, Stuart Owen, Alan R. Williams, Giacomo Tartari, Finn Bacall, Thomas Thelen, Hervé Ménager, Laura Rodríguez-Navas, Paul Walk, brandon whitehead, Mark Wilkinson, Paul Groth, Erich Bremer, LJ Garcia Castro, Karl Sebby, Alexander Kanitz, Ana Trisovic, Gavin Kennedy, Mark Graves, Jasper Koehorst, Simone Leo, Marc Portier (2021):
RO-Crate Metadata Specification 1.1.1.
https://doi.org/10.5281/zenodo.4541002

[107] Peter Sefton, Eoghan Ó Carragáin, Stian Soiland-Reyes, Oscar Corcho, Daniel Garijo, Raul Palma, Frederik Coppens, Carole Goble, José María Fernández, Kyle Chard, Jose Manuel Gomez-Perez, Michael R Crusoe, Ignacio Eguinoa, Nick Juty, Kristi Holmes, Jason A. Clark, Salvador Capella-Gutierrez, Alasdair J. G. Gray, Stuart Owen, Alan R. Williams, Giacomo Tartari, Finn Bacall, Thomas Thelen, Hervé Ménager, Laura Rodríguez-Navas, Paul Walk, brandon whitehead, Mark Wilkinson, Paul Groth, Erich Bremer, LJ Garcia Castro, Karl Sebby, Alexander Kanitz, Ana Trisovic, Gavin Kennedy, Mark Graves, Jasper Koehorst, Simone Leo (2020):
RO-Crate Metadata Specification 1.1.
https://doi.org/10.5281/zenodo.4031327

[108] Stian Soiland-Reyes (2020):
I am looking for which bioinformatics journals encourage authors to submit their code/pipeline/workflow supporting data analysis.
Twitter
https://twitter.com/soilandreyes/status/1250721245622079488
[archived 2021-05-04]

[109] Stian Soiland-Reyes (2021):
Describing and packaging workflows using RO-Crate and BioCompute Objects Zenodo, Webinar for U.S. Food and Drug Administration (FDA), 2021-05-12.
https://doi.org/10.5281/zenodo.4633732

[110] Stian Soiland-Reyes, P. Alper and Carole Goble (2016):
Tracking Workflow Execution With TavernaPROV, ProvenanceWeek 2016, session “PROV: Three Years Later”. https://s11.no/2016/provweek-tavernaprov/
https://doi.org/10.5281/zenodo.51314

[111] Stian Soiland-Reyes, M. Gamble and R. Haines (2014):
Research Object Bundle 1.0.
https://w3id.org/bundle/2014-11-05/ https://doi.org/10.5281/zenodo.12586.

[112] M. Sporny, D. Longley, G. Kellogg, M. Lanthaler and N. Lindström (2014):
JSON-LD 1.0, W3C Recommendation. https://www.w3.org/TR/2014/REC-json-ld-20140116/

[113] V. Stodden, M. McNutt, D.H. Bailey, E. Deelman, Y. Gil, B. Hanson, M.A. Heroux, J.P.A. Ioannidis and M. Taufer (2016):
Enhancing reproducibility for computational methods.
Science 354(6317) pp. 1240–1241.
https://doi.org/10.1126/science.aah6168

[114] N. Thieberger and L. Barwick (2012):
Keeping records of language diversity in melanesia: The Pacific and regional archive for digital sources in endangered cultures (PARADISEC), in: Melanesian Languages on the Edge of Asia: Challenges for the 21st Century, N. Evans and M. Klamer, eds,
Language Documentation & Conservation Special Publication SP05 University of Hawai’i Press, pp. 239–253. ISBN 978-0-9856211-2-4
http://hdl.handle.net/10125/4567

[115] Tools: Data Portal & Discovery.
https://arkisto-platform.github.io/tools/portal/

[116] R. Troncy, W. Bailer, M. Höffernig and M. Hausenblas (2010):
VAMP: A service for validating MPEG-7 descriptions w.r.t. to formal profile definitions.
Multimedia tools and applications 46(2–3) pp. 307–329.
https://www.persistent-identifier.nl/urn:nbn:nl:ui:18-14511
https://doi.org/10.1007/s11042-009-0397-2

[117] Herbert Van de Sompel, Carl Lagoze (2007):
Interoperability for the discovery, use, and re-use of units of scholarly communication.
CTWatch Quarterly 3(3).
http://icl.utk.edu/ctwatch/quarterly/articles/2007/08/interoperability-for-the-discovery-use-and-re-use-of-units-of-scholarly-communication/

[118] T. Vergoulis, K. Zagganas, L. Kavouras, M. Reczko, S. Sartzetakis and T. Dalamagas (2021):
SCHeMa: Scheduling Scientific Containers on a Cluster of Heterogeneous Machines.
https://arxiv.org/abs/2103.13138v1

[119] C.J. Volk, Y. Lucero and K. Barnas (2014):
Why is data sharing in collaborative natural resource efforts so hard and what can we do to improve it?.
Environmental Management 53(5) pp. 883–893.
https://doi.org/10.1007/s00267-014-0258-2

[120] W3C Technical Architecture Group (2007):
Dereferencing HTTP URIs.
Draft Tag Finding, 2007. https://www.w3.org/2001/tag/doc/httpRange-14/2007-08-31/HttpRange-14.html

[121] P. Walk, T. Miksa and P. Neish (2019):
RDA DMP Common Standard for Machine-Actionable Data Management Plans.
Research Data Alliance
https://doi.org/10.15497/rda00039

[122] Stephanie Walton, Laurence Livermore, Olaf Bánki, Robert W. N. Cubey, Robyn Drinkwater, Markus Englund, Carole Goble, Quentin Groom, Christopher Kermorvant, Isabel Rey, Celia M Santos, Ben Scott, Alan R. Williams, Zhengzhe Wu (2020):
Landscape analysis for the specimen data refinery.
Research Ideas and Outcomes 6.
https://doi.org/10.3897/rio.6.e57602

[123] Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J.G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A.C ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, Barend Mons (2016):
The FAIR guiding principles for scientific data management and stewardship.
Scientific Data 3:160018.
https://doi.org/10.1038/sdata.2016.18

[124] WorkflowHub project: Project pages for developing and running the WorkflowHub, a registry of scientific workflows.
https://w3id.org/workflowhub/

[125] Jun Zhao, Jose Manuel Gomez-Perezy, Khalid Belhajjame, Graham Klyne, Esteban Garcia-Cuestay, Aleix Garridoy, Kristina Hettne, Marco Roos, David De Roure, Carole Goble (2012):
Why workflows break – understanding and combating decay in taverna workflows.
2012 IEEE 8th International Conference on e-Science, IEEE. https://www.research.manchester.ac.uk/portal/files/174861334/why_decay.pdf ISBN 978-1-4673-4466-1.
https://doi.org/10.1109/eScience.2012.6404482

[126] F. Zoubek and M. Winkler (2021):
RO Crates and Excel.
https://github.com/e11938258/RO-Crates-and-Excel https://doi.org/10.5281/zenodo.5068950

[127] M. Žumer (2009):
National Bibliographies in the Digital Age: Guidance and New Directions.
IFLA Series on Bibliographic Control, IFLA Working Group on Guidelines for National Bibliographies, Walter de Gruyter – K. G. Saur, 2009, ISSN 1868-8438. ISBN 9783598441844.
https://doi.org/10.1515/9783598441844


  1. IRIs [42] are a generalisation of URIs (which include well-known http/https URLs), permitting international Unicode characters without percent encoding, commonly used on the browser address bar and in HTML5. ↩︎

  2. Some consideration is needed in processing of RO-Crates as knowledge graphs, e.g. establishing absolute IRIs for files inside a ZIP archive, detailed in the RO-Crate specification ↩︎

  3. Note that an RO-Crate is not required to be published on the Web, see section on self-described↩︎

  4. The avid reader may spot that the RO-Crate Metadata file use the extension .json instead of .jsonld, this is to emphasise the developer expectations as a JSON format, while the file’s JSON-LD nature is secondary. See ResearchObject/ro-crate#82↩︎

  5. Recommended properties for types shown in Listing 1 also include affiliation, citation, contactPoint, description, encodingFormat, funder, geo, identifier, keywords, publisher; these properties and corresponding contextual entities are excluded here for brevity. See complete example↩︎

  6. Several new implementations have appeared since the publication of this article, see chapter 6↩︎

  7. This was implemented after publication of this article – see chapter 6↩︎

  8. See Section 5.4↩︎

  9. CWLProv and TavernaProv predate RO-Crate, but use RO-Bundle [111], a similar Research Object packaging method with JSON-LD metadata. ↩︎

  10. IEEE 2791-2020 do permit user extensions in the extension domain by referencing additional JSON Schemas. ↩︎

  11. The Endings Project is a five-year project funded by the Social Sciences and Humanities Research Council (SSHRC) that is creating tools, principles, policies and recommendations for digital scholarship practitioners to create accessible, stable, long-lasting resources in the humanities. ↩︎

  12. Docker and Conda can use build recipes, a set of commands that construct the container image through downloading and installing its requirements. However these recipes are effectively another piece of software code, which may itself decay and become difficult to rerun. ↩︎

  13. FAIR principle A2: Metadata are accessible, even when the data are no longer available. [123↩︎

Formalizing RO-Crate in First Order Logic
Appendix from Journal article published in Data Science