Chapter 2: Background

In the following chapter, we discuss the related work with respect to FAIR Digital Objects and Linked Data. We do so by looking through the lens of development of these technologies over time, including future directions. This primarily motivates RQ1 addressed by chapter 3, but in addition both technologies are foundational for the implementations in chapter 4 and chapter 5.

FAIR Digital Object

The concept of FAIR Digital Objects [Schultes 2019] has been introduced as way to expose research data as active objects that conform to the FAIR principles [Wilkinson 2016]. This builds on the Digital Object (DO) concept [Kahn 2006], first introduced by [Kahn 1995] as a system of repositories containing digital objects identified by handles and described by metadata which may have references to other handles. DO was the inspiration for the ITU-T X.1255 framework which introduced an abstract Digital Entity Interface Protocol for managing such objects programmatically, first realised by the Digital Object Interface Protocol (DOIP) [Reilly 2009].

In brief, the structure of a FAIR Digital Object (FDO) is to, given a persistent identifier (PID) such as a DOI, resolve to a PID Record that gives the object a type along with a mechanism to retrieve its bit sequences, metadata and references to further programmatic operations (Figure 1). The type of an FDO (itself an FDO) defines attributes to semantically describe and relate such FDOs to other concepts (typically other FDOs referenced by PIDs). The premise of systematically building an ecosystem of such digital objects is to give researchers a way to organise complex digital entities, associated with identifiers, metadata, and supporting automated processing [Wittenburg 2019].

Idealised overview of a FAIR Digital Object

The persistent identifier (PID), (e.g. a Handle, DOI or permalink), refers to an FDO through a PID Record, which may reference downloadble bytes, and optionally additional metadata in another FDO. A series of operations are accessible from an FDO (for instance retrieving the bytes). Similar to in object-oriented programming, the FDO Type indicates which operations and attributes are applicable to an FDO. FDOs can be cross-related using the PIDs, a Collection is then another such FDO which aggregates other FDOs by reference.

The configuration shown here is just one of many possible [Lannom 2022a], along with the choice of PID system, nature of the PID Record and metadata vocabularies, which are identified through an FDO Profile. In practice, some compromises from this idealised picture are taken depending on the implementation, for instance attribute keys may be simple strings rather than PIDs, and default operations are not explicitly declared.

FDOs have been recognised by the European Open Science Cloud (EOSC) as a suggested part of its Interoperability Framework [Corcho 2021], in particular for deploying active and interoperable FAIR resources that are machine actionable¹. Development of the FDO concept continued within Research Data Alliance (RDA) groups and EOSC projects like GO-FAIR, concluding with a set of guidelines for implementing FDO [Bonino 2019]. The FAIR Digital Objects Forum has since taken over the maturing of FDO through focused working groups which have currently drafted several more detailed specification documents (see Next steps for FDO).

FDO approaches

FDO is an evolving concept. A set of FDO Demonstrators [Wittenburg 2022b] highlights how current adapters are approaching implementations of FDO from different angles:

Building on the Digital Object concept, using the simplified [DONA 2018] specification, which detail how to exchange JSON objects through a text-based protocol² (usually TCP/IP over TLS). The main DOIP operations are retrieving, creating and updating digital objects. These are mostly realised using the reference implementation Cordra [Tupelo-Schneck 2022]. FDO types are registered in the local Cordra instance, where they are specified using JSON Schema [Wright 2022] and PIDs are assigned using the Handle system. Several type registries have been established.
Following a Linked Data approach, but using the DOIP protocol, e.g. using JSON-LD and schema.org within DOIP in Materal Sciences archives [Riccardi 2022].
Approaching the FDO principles from existing Linked Data practices on the Web, e.g. WorkflowHub use of RO-Crate and schema.org [Soiland-Reyes 2022a].

From this it becomes apparent that there is a potentially large overlap between the goals and approaches of FAIR Digital Objects and Linked Data, which we will cover in section 2.2.

An overview of upcoming FDO specifications

The FAIR Digital Object Forum working groups have prepared detailed requirement documents [FDO 2022] setting out the path for realising FDOs, named FDO Recommendations. As of 2023-06-17, most of these documents are open for public review, while some are still in draft stages for internal review. As these documents clarify the future aims and focus of FAIR Digital Objects [Lannom 2022b], we provide a brief summary of each:

FAIR Digital Object Overview and Specifications [Anders 2023] is a comprehensive overview of FAIR Digital Object specifications listed below. It serves as a primer that introduces FDO concepts and the remaining documents. It is accompanied by an FDO Glossary [Broeder 2022a].

The FDO Forum Document Standards [Weiland 2022a] documents the recommendation process within the forum, starting at Working Draft (WD) status within the closed working group and later within the open forum, then Proposed Recommendation (PR) published for public review, finalised as FDO Forum Recommendation (REC) following any revisions. In addition, the forum may choose to endorse existing third-party notes and specifications.

The FDO Requirement Specifications [Anders 2023] is an update of [Bonino 2019] as the foundational definition of FDO. This sets the criteria for classifying an digital entity as a FAIR Digital Object, allowing for multiple implementations. The requirements shown in Table 3 are largely equivalent, but in this specification clarified with references to other FDO documents.

Machine actionability [Weiland 2022b] sets out to define what is meant by machine actionability for FDOs. Machine readable is defined as elements of bit-sequences defined by structural specification, machine interpretable elements that can be identified and related with semantic artefacts, while machine actionable are elements with a type with operations in a symbolic grammar. The document largely describes requirements for resolving an FDO to metadata, and how types should be related to possible operations.

Configuration Types [Lannom 2022a] classifies different granularities for organising FDOs in terms of PIDs, PID Records, Metadata and bit sequences, e.g. as a single FDO or several daisy-chained FDOs. Different patterns used by current DOIP deployments are considered, as well as FAIR Signposting [Van de Sompel 2015, Van de Sompel 2022].

PID Profiles & Attributes [Anders 2022] specifies that PIDs must be formally associated with a PID Profile, a separate FDO that defines attributes required and recommended by FDOs following said profile. This forms the kernel attributes, building on recommendations from RDA’s PID Information Types working group [Broeder 2022b]. This document makes a clear distinction between a minimal set of attributes needed for PID resolution and FDO navigation, which needs to be part of the PID Record [Islam 2023], compared with a richer set of more specific attributes as part of the metadata for an FDO, possibly represented as a separate FDO.

Kernel Attributes & Metadata [Broeder 2022b] elaborates on categories of FDO Mandatory, FDO Optional and Community Attributes, recommending kernel attributes like dateCreated, ScientificDomain, PersistencePolicy, digitalObjectMutability, etc. This document expands on RDA Recommendation on PID Kernel Information [Weigel 2018]. It is worth noting that both documents are relatively abstract and do not establish PIDs or namespaces for the kernel attributes.

Granularity, Versioning, Mutability [Hellström 2022] considers how granularity decisions for forming FDOs must be agreed by different communities depending on their pragmatic usage requirements. The affect on versioning, mutability and changes to PIDs are considered, based on use cases and existing PID practices.

DOIP Endorsement Request [Schwardmann 2022a] is an endorsement of the DOIP v2.0 [DONA 2018] specification as a potential FDO implementation, as it has been applied by several institutions [Wittenburg 2022b]. The document proposes that DOIP shall be assessed for completeness against FDO – in this initial draft this is justified as “we can state that DOIP is compliant with the FDO specification documents in process” (the documents listed above).

Upload of FDO [Blanchi 2022a] illustrates the operations for uploading an FDO to a repository, what checks it should do (for instance conformance with the PID Profile, if PIDs resolve). ResourceSync [ANSI 2017] is suggested as one type of service to list FDOs. This document highlights potential practices by repositories and their clients, without adding any particular requirements.

Typing FAIR Digital Objects [Lannom 2022a] defines what type means for FDOs, primarily to enable machine actionability and to define an FDO’s purpose. This document lays out requirements for how FDO Types should themselves be specified as FDOs, and how an FDO Type Framework allows organising and locating types. Operations applicable to an FDO is not predefined for a type, however operations naturally will require certain FDO types to work. How to define such FDO operations is not specified.

Implementation of Attributes, Types, Profiles and Registries [Blanchi 2022b] details how to establish FDO registries for types and FDO profiles, with their association with PID systems. This document suggest policies and governance structures, together with guidelines for implementations, but without mandating any explicit technology choices. Differences in use of attributes are examplified using FDO PIDs for scientific instruments, and the proto-FDO approach of DARIAH-DE [Schwardmann 2022b].

It is worth pointing out that, except for the DOIP endorsement, all of these documents are conceptual, in the sense that they permit any technical implementation of FDO, if used according to the recommendations. Existing FDO implementations [Wittenburg 2022b] are thus not fully consolidated in choices such as protocols, type systems and serialisations – this divergence and corresponding additional technical requirements mean that FDOs are not yet in a single ecosystem.

From the Semantic Web to Linked Data

In order to describe Linked Data as it is used today, we’ll start with an (opinionated) description of the evolution of its foundation, the Semantic Web.

A brief history of the Semantic Web

The Semantic Web was developed as a vision by Tim Berners-Lee [Berners-Lee 1999], at a time that the Web had already become widely established for information exchange, being a global set of hypermedia documents which are cross-related using universal links in the form of URLs. The foundations of the Web (e.g. URLs, HTTP, SSL/TLS, HTML, CSS, ECMAScript/JavaScript, media types) were standardised by W3C, Ecma, IETF and later WHATWG. The goal of Semantic Web was to further develop the machine-readable aspects of the Web, in particular adding meaning (or semantics) to not just the link relations, but also to the resources that the URLs identified, and for machines thus being able to meaningfully navigate across such resources, e.g. to answer a particular query.

Through W3C, the Semantic Web was realised with the Resource Description Framework (RDF) [Schreiber 2014] that used triples of subject-predicate-object statements, with its initial serialisation format [Lassila 1999] being RDF/XML (XML was at the time seen as a natural data-focused evolution from the document-centric SGML and HTML).

While triple-based knowledge representations were not new [Stanczyk 1987], the main innovation of RDF was the use of global identifiers in the form of URIs³ as the primary identifier of the subject (what the statement is about), predicate (relation/attribute of the subject) and object (what is pointed to – see Figure 2). By using URIs not just for documents⁴, the Semantic Web builds a self-described system of types and properties, where the meaning of a relation can be resolved by following its hyperlink to the definition within a vocabulary. By applying these principles as well to any kind of resource that could be described at a URL, this then forms a global distributed Semantic Web.

Each resource in an RDF graph has an identifier, here shown as absolute URIs, a type and a series of properties. A property value can either be a literal (e.g. “Josiah Carberry”) or another resource (e.g. https://ror.org/03f0f6041). A graph is formed by such cross-references across resources. In the idealised Semantic Web, every URI would resolve to a description of its resource in RDF. In practice there can be misalignments of identifiers, vocabularies, resolution mechanisms, or simply lack of RDF adoption. Therefore, any RDF graph can describe any Web resource identified by its URI, and these descriptions, using an open world assumption [Drummond 2006], can be merged with other graphs describing the same resource. For brevity and comparison from later chapters this figure uses the newer RDF format JSON-LD cite{w3-json-ld}, which can be expanded with context http://schema.org/ (not shown) to anchor types and properties as absolute URIs and generate corresponding RDF triples (Listing 1). — Example of linked RDF resources
Each *resource* in an RDF graph has an identifier, here shown as absolute URIs, a type and a series of properties. A property value can either be a *literal* (e.g. `“Josiah Carberry”`) or another resource (e.g. `https://ror.org/03f0f6041`). A graph is formed by such cross-references across resources. In the idealised Semantic Web, every URI would resolve to a description of its resource in RDF. In practice there can be misalignments of identifiers, vocabularies, resolution mechanisms, or simply lack of RDF adoption.
Therefore, any RDF graph can describe any Web resource identified by its URI, and these descriptions, using an *open world assumption* [Drummond 2006], can be merged with other graphs describing the same resource. For brevity and comparison from later chapters this figure uses the newer RDF format JSON-LD cite{w3-json-ld}, which can be expanded with context `http://schema.org/` (not shown) to anchor types and properties as absolute URIs and generate corresponding RDF triples (Listing 1).

<http://example.com/figure.png> a <http://schema.org/ImageObject> .
<http://example.com/figure.png> <http://schema.org/name> "XXL-CT-scan of an XXL Tyrannosaurus rex skull" .
<http://example.com/figure.png> <http://schema.org/author> <https://orcid.org/0000-0002-1825-0097> .
<http://example.com/figure.png> <http://schema.org/encodingFormat> "image/png" .

<https://orcid.org/0000-0002-1825-0097> a <http://schema.org/Person> .
<https://orcid.org/0000-0002-1825-0097> <http://schema.org/name> "Josiah Carberry" .
<https://orcid.org/0000-0002-1825-0097> <http://schema.org/affiliation> <https://ror.org/03f0f6041> .

<https://ror.org/03f0f6041> a <http://schema.org/Organization> .
<https://ror.org/03f0f6041> <http://schema.org/name> "University of Technology Sydney" .
<https://ror.org/03f0f6041> <http://schema.org/url> "https://www.uts.edu.au/" .

Listing 1: Example of RDF triples. These triples correspond to figure 2 after expansion with a JSON-LD context. In this example the properties and types are all using the same vocabulary http://schema.org/, in the traditional Semantic Web it is common to mix vocabularies. This uses the RDF syntax N-Triples where each line indicates subject, predicate and object. Notable here is the syntactical difference between an URI reference that is part of the graph https://ror.org/03f0f6041 and a string literal "https://www.uts.edu.au/" which just happens to be a URI.

The early days of the Semantic Web saw fairly lightweight approaches with the establishment of vocabularies such as FOAF (to describe people and their affiliations) and Dublin Core (for bibliographic data). Vocabularies themselves were formalised using RDFS or simply as human-readable HTML web pages defining each term. The main approach of this Web of Data was that a URI identified a resource (e.g. an author) with a HTML representation for human readers, along with a RDF representation for machine-readable data of the same resource. By using content negotiation in HTTP, the same identifier could be used in both views, avoiding index.html vs index.rdf exposure in the URLs. The concept of namespaces gave a way to give a group of RDF resources with the same URI base from a Semantic Web-aware service a common prefix, avoiding repeated long URLs.

The mid-2000s saw large academic interest and growth of the Semantic Web, with the development of more formal representation system for ontologies, such as OWL [W3C 2012], allowing complex class hierarchies and logic inference rules following open world paradigm. More human-readable syntaxes for RDF such as Turtle evolved at this time, and conferences such as ISWC [Horrocks 2002] gained traction, with a large interest in knowledge representation and logic systems based on Semantic Web technologies evolving at the same time.

Established Semantic Web services and standards include: SPARQL [W3C 2013] (pattern-based triple queries), named graphs [Wood 2014] (triples expanded to quads to indicate statement source or represent conflicting views), triple/quad stores (graph databases such as OpenLink Virtuoso, GraphDB, 4Store), mature RDF libraries (including Redland RDF, Apache Jena, Eclipse RDF4J, RDFLib, RDF.rb, rdflib.js), and graph visualisation.

RDF is one way to implement knowledge graphs, a system of named edges and nodes⁵ [Nurdiati 2008], which when used to represent a sufficiently detailed model of the world, can then be queried and processed to answer detailed research questions. The creation of RDF-based knowledge graphs grew particularly in fields like bioinformatics, e.g. for describing genomes and proteins [Goble 2008, Williams 2012]. In theory, the use of RDF by the life sciences would enable interoperability between the many data repositories and support combined views of the many aspects of bio-entities – however in practice most institutions ended up making their own ontologies and identifiers, for what to the untrained eye would mean roughly the same. One can argue that the toll of adding the semantic logic system of rich ontologies meant that small, but fundamental, differences in opinion (e.g. should a gene identifier signify just the particular DNA sequence letters, or those letters as they appear in a particular position on a human chromosome?) led to large differences in representational granularity, and thus needed different identifiers.

Facing these challenges, thanks to the use of universal identifiers in the form of URIs, mappings could retrospectively be developed not just between resources, but also across vocabularies. Such mappings can themselves be expressed using lightweight and flexible RDF vocabularies such as SKOS [Isaac 2009] (e.g. dct:title skos:closeMatch schema:name to indicate near equivalence of two properties). Exemplifying the need for such cross-references, automated ontology mappings have identified large potential overlaps like 372 definitions of Person [Hu 2011].

The move towards Open Science data sharing practices did from the late 2000s encourage knowledge providers to distribute collections of RDF descriptions as downloadable datasets,⁶ so that their clients can avoid thousands of HTTP requests for individual resources. This enabled local processing, mapping and data integration across datasets (e.g. Open PHACTS [Groth 2014]), rather than relying on the providers’ RDF and SPARQL endpoints (which could become overloaded when handling many concurrent, complex queries).

With these trends, an emerging problem was that adopters of the Semantic Web primarily utillised it as a set of graph technologies, with little consideration to existing Web resources. This meant that links stayed mainly within a single information system, with little URI reuse even with large term overlaps [Kamdar 2017]. Just like link rot affect regular Web pages and their citations from scholarly communication [Klein 2014], for a majority of described RDF resources in the Linked Open Data (LOD) Cloud’s gathering of more than thousand datasets, unfortunately do not actually link to (still) downloadable (dereferenceable) Linked Data [Polleres 2020]. Another challenge facing potential adopters is the plethora of choices, not just to navigate, understand and select to reuse the many possible vocabularies and ontologies [Carriero 2020], but also technological choices on RDF serialisation (at least 7 formats), type system (RDFS [Guha 2014], OWL [W3C 2012], OBO [Tirmizi 2011], SKOS [Isaac 2009]), and deployment challenges [Sauermann 2008] (e.g. hash vs slash in namespaces, HTTP status codes and PID redirection strategies).

Linked Data: Rebuilding the Web of Data

The Linked Data (LD) concept [Bizer 2009] was kickstarted as a set of best practices [Berners-Lee 2006] to bring the Web aspect of the Semantic Web back into focus. Crucial to Linked Data is the reuse of existing URIs, rather than making new identifiers. This means a loosening of the semantic restrictions previously applied, and an emphasis on building navigable data resources, rather than elaborate graph representations.

Vocabularies like schema.org evolved not long after, intended for lightweight semantic markup of existing Web pages, primarily to improve search engines’ understanding of types and embedded data. In addition to several such embedded microformats [OGP, WHATWG 2023, Sporny 2015], we find JSON-LD [Sporny 2020] as a Web-focused RDF serialisation that aims for improved programmatic generation and consumption, including from Web applications. JSON-LD is as of 2023-05-18 used⁷ by 45% of the top 10 million websites [W3Tech 2023].

Recently there has been a renewed emphasis to improve the Developer Experience [Verborgh 2018] for consumption of Linked Data, for instance RDF Shapes – expressed in SHACL [Kontokostas 2017] or ShEx [Baker 2019] – can be used to validate RDF Data [Gayo 2017, Thornton 2019] before consuming it programmatically, or reshaping data to fit other models. While a varied set of tools for Linked Data consumptions have been identified, most of them still require developers to gain significant knowledge of the underlying Semantic Web technologies, which hampers adaption by non-LD experts [Klímek 2019], which then tend to prefer non-semantic two-dimensional formats such as CSV files.

A valid concern is that the Semantic Web research community has still not fully embraced the Web, and that the “final 20%” engineering effort is frequently overlooked in favour of chasing new trends such as Big Data and AI, rather than making powerful Linked Data technologies available to the wider groups of Web developers [Verborgh 2020]. One bridging gap here by the Linked Data movement has been “Linked Data by stealth” approaches such as structured data entry spreadsheets powered by ontologies [Wolstencroft 2011], the use of Linked Data as part of REST Web APIs [Page 2011], and as shown by the big uptake by publishers to annotate the Web using schema.org [Bernstein 2016], with vocabulary use patterns documented by copy-pastable JSON-LD examples, rather than by formalised ontologies or developer requirements to understand the full Semantic Web stack.

Linked Data provides technologies that have evolved over time to satisfy its primary purpose of data interoperability. The needs to embrace the Web and developer experience have been central lessons learned. In contrast, FDO is a new approach with many different potential paths forward, and having a partial overlap with the aims of Linked Data.

This chapter is an extract from the preprint Evaluating FAIR Digital Object and Linked Data as distributed object systems, authored by Stian Soiland-Reyes, Carole Goble, Paul Groth. Figures added here are not part of the preprint.

References

See chapter references.

The concept of “machine actionable” is extended by FDO beyond the FAIR principles' premise of accessible data/metadata with retrievable vocabularies, in that programmatic invocation of operations on FAIR Digital Objects can be reliably coded in advance based on the information provided by the objects themselves (see section 2.1.2). The implications of considering FDOs as a distributed object system is explored further in section 3.1. ↩︎
For a brief introduction to DOIP 2.0, see [CNRI 2023a] ↩︎
URIs [Berners-Lee 2005] are generalised forms of URLs that include locator-less identifiers such as ISBN book numbers (URNs). The distinction between locator-full and locator-less identifiers have weakened in recent years [OCLC 2010], for instance DOI identifiers now are commonly expressed with the prefix https://doi.org/ rather than as URNs with info:doi: given that the URL/URN gap has been bridged by HTTP resolvers and the use of Persistent Identifiers (PIDs) [Juty 2011]. RDF 1.1 formats use Unicode to support IRIs [Dürst 2005], which extends URIs to include international characters and domain names. ↩︎
URIs can also identify non-information resources for any kind of physical object (e.g. people), such identifiers can resolve with 303 See Other redirections to a corresponding information resources [Sauermann 2008]. ↩︎
In RDF, each triple represent an edge that is named using its property URI, and the nodes are subject/object as URIs, blank nodes or (for objects) typed literal values [Schreiber 2014]. ↩︎
Datasets that distribute RDF graphs should not be confused with RDF Datasets used for partitioning named graphs. ↩︎
Presumably this large uptake of JSON-LD is mainly for the purpose of Search Engine Optimisation (SEO), with typically small amounts of metadata which may not constitute Linked Data as introduced above, however this deployment nevertheless constitute machine-actionable structured data. ↩︎