Formalizing RO-Crate in First Order Logic

Below is a formalization of the concept of RO-Crate as a set of relations using First Order Logic:

Language

Definition of language π•ƒπ–—π–”π–ˆπ–—π–†π–™π–Š:

π•ƒπ–—π–”π–ˆπ–—π–†π–™π–Š = { Property(p), Class(c), Value(x), ℝ, π•Š }
     𝔻 =  π•€π•£π•š
    π•€π•£π•š ≑  { IRIs as defined in RFC3987 }
     ℝ ≑  { real or integer numbers }
     π•Š ≑  { literal strings }

The domain of discourse is the set of π•€π•£π•š identifiers [42] (notation <http://example.com/>)1, with additional descriptions using numbers ℝ (notation 13.37) and literal strings π•Š (notation β€œHello”).

From this formalised language π•ƒπ–—π–”π–ˆπ–—π–†π–™π–Š we can interpret an RO-Crate in any representation that can gather these descriptions, their properties, classes, and literal attributes.

Minimal RO-Crate

Below we use π•ƒπ–—π–”π–ˆπ–—π–†π–™π–Š to define a minimal2 RO-Crate:

                ROCrate(R) ⊨  Root(R) ∧ Mentions(R, R) ∧ hasPart(R, d) ∧ 
                               Mentions(R, d) ∧ DataEntity(d) ∧
                               Mentions(R, c) ∧ ContextualEntity(c)
               βˆ€r Root(r) β‡’  Dataset(r) ∧ name(r, n) ∧ 
                               description(r, d) ∧ 
                               datePublished(r, date) ∧
                               license(e, l)
          βˆ€eβˆ€n name(e, n) β‡’  Value(n)
   βˆ€eβˆ€s description(e, s) β‡’  Value(s)
 βˆ€eβˆ€d datePublished(e, d) β‡’  Value(d)
       βˆ€eβˆ€l license(e, l) β‡’  ContextualEntity(l)
             DataEntity(e) ≑  File(e) βŠ• Dataset(e)
                 Entity(e) ≑  DataEntity(e) ∨ ContextualEntity(e)
              βˆ€e Entity(e) β‡’ type(e, c) ∧ Class(c)
    βˆ€e ContextualEntity(e) β‡’ name(e, n)
            Mentions(R, s) ⊨  Relation(s, p, e)  βŠ•  Attribute(s, p, l)
         Relation(s, p, o) ⊨  Entity(s) ∧ Property(p) ∧ Entity(o)
        Attribute(s, p, x) ⊨  Entity(s) ∧ Property(p) ∧ Value(x)
                  Value(x) ≑  x ∈ ℝ  βŠ•  x ∈ π•Š

An ROCrate(R) is defined as a self-described Root Data Entity, which describes and contains parts (data entities), which are further described in contextual entities. These terms align with their use in the RO-Crate 1.1 terminology.

The Root(r) is a type of Dataset(r), and must as metadata have at least the attributes name, description and datePublished, as well as a contextual entity that identify its license. These predicates correspond to the RO-Crate 1.1 minimal requirements for the root data entity.

The concept of an Entity(e) is introduced as being either a DataEntity(e), a ContextualEntity(e), or both. Any Entity(e) must be typed with at least one Class(c), and every ContextualEntity(e) must also have a name(e,n); this corresponding to expectations for any referenced contextual entity (see section on contextual entities).

For simplicity in this formalization (and to assist production rules below) R is a constant representing a single RO-Crate, typically written to independent RO-Crate Metadata files. R is used by Mentions(R, e) to indicate that e is an Entity described by the RO-Crate and therefore its metadata (a set of Relation and Attribute predicates) form part of the RO-Crate serialization. Relation(s, p, o) and Attribute(s, p, x) are defined as a subject-predicate-object triple pattern from an Entity(s) using a Property(p) to either another Entity(o) or a Literal(x) value.

Example of formalised RO-Crate

The below is an example RO-Crate represented using the above formalization, assuming a base IRI of http://example.com/ro/123/:

RO-Crate(<http://example.com/ro/123/>)
name(<http://example.com/ro/123/, 
    β€œData files associated with the manuscript:Effects of …”)
description(<http://example.com/ro/123/, 
    β€œPalliative care planning for nursing home residents …")
license(<http://example.com/ro/123/>, 
    <https://spdx.org/licenses/CC-BY-4.0>
datePublished(<http://example.com/ro/123/>, β€œ2017")
hasPart(<http://example.com/ro/123/>, <http://example.com/ro/123/survey.csv>)
hasPart(<http://example.com/ro/123/>, <http://example.com/ro/123/interviews/>)

ContextualEntity(<https://spdx.org/licenses/CC-BY-4.0>)
name(<https://spdx.org/licenses/CC-BY-4.0, 
    β€œCreative Commons Attribution 4.0”)

ContextualEntity(<https://spdx.org/licenses/CC-BY-NC-4.0>)
name(<https://spdx.org/licenses/CC-BY-NC-4.0, 
    β€œCreative Commons Attribution Non Commercial 4.0”)

File(<http://example.com/ro/123/survey.csv>)
name(<http://example.com/ro/123/survey.csv>, β€œSurvey of care providers”)

Dataset(<http://example.com/ro/123/interviews/>)
name(<http://example.com/ro/123/interviews/>, 
    β€œAudio recordings of care provider interviews”)
license(<http://example.com/ro/123/interviews/>, 
    <https://spdx.org/licenses/CC-BY-NC-4.0>

Notable from this triple-like formalization is that a RO-Crate R is fully represented as a tree at depth 2 helped by the use of π•€π•£π•š nodes. For instance the aggregation from the root entity hasPart(…interviews/>) is at same level as the data entity’s property license(…CC-BY-NC-4.0>) and that contextual entity’s attribute name (…Non Commercial 4.0”). As shown in section RO-Crate JSON-LD, the RO-Crate Metadata File serialization is an equivalent shallow tree, although at depth 3 to cater for the JSON-LD preamble of "@context" and "@graph".

In reality many additional attributes and contextual types from Schema.org types like http://schema.org/affiliation and http://schema.org/Organization would be used to further describe the RO-Crate and its entities, but as these are optional (SHOULD requirements) they do not form part of this formalization.

Mapping to RDF with Schema.org

A formalised RO-Crate can be mapped to different serializations. Assume a simplified3 language π•ƒΚ€α΄…κœ° based on the RDF abstract syntax [98]:

                𝕃𝖗𝖉𝖋 = { Triple(s,p,o), IRI(i), BlankNode(b), Literal(s),
                         π•€π•£π•š, ℝ, π•Š }
                𝔻𝖗𝖉𝖋 = π•Š
           βˆ€i IRI(i) β‡’ i ∈ π•€π•£π•š
βˆ€sβˆ€pβˆ€o Triple(s,p,o) β‡’οΌˆ IRI(s) ∨ BlankNode(s) οΌ‰βˆ§
                        IRI(p) ∧
                      ( IRI(o) ∨ BlankNode(o) ∨ Literal(o) οΌ‰
          Literal(v) ⊨ Value(v) ∧ Datatype(v,t) ∧ IRI(t)
         βˆ€v Value(v) β‡’ v ∈ π•Š
    LanguageTag(v,l) ≑ Datatype(v,
          http://www.w3.org/1999/02/22-rdf-syntax-ns#langString)

Below follows a mapping from π•ƒπ–—π–”π–ˆπ–—π–†π–™π–Š to 𝕃𝖗𝖉𝖋 using Schema.org.

        Property(p) β‡’ type(p,
             <http://www.w3.org/2000/01/rdf-schema#Property>)
           Class(c) β‡’ type(c,
             <http://www.w3.org/2000/01/rdf-schema#Class>)
         Dataset(d) β‡’ type(d, <http://schema.org/Dataset>)
            File(f) β‡’ type(f, <http://schema.org/MediaObject>)
ContextualEntity(e) β‡’ type(e, <http://schema.org/Thing>)
    CreativeWork(e) β‡’ ContextualEntity(e) ∧
                        type(e, <http://schema.org/CreativeWork>)
      hasPart(e, t) β‡’ Relation(e, <http://schema.org/hasPart>, t)
         name(e, n) β‡’ Attribute(e, <http://schema.org/name>, n)
  description(e, s) β‡’ Attribute(e, <http://schema.org/description>, s)
datePublished(e, d) β‡’ Attribute(e, <http://schema.org/datePublished>, d)
      license(e, l) β‡’ Relation(e, <http://schema.org/license>, l) ∧
                      CreativeWork(l)
         type(e, t) β‡’ Relation(e,
             <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, t) ∧
                      Class(t)
          String(s) ≑ Value(s) ∧  s ∈ π•Š
          String(s) β‡’ Datatype(s, 
             <http://www.w3.org/2001/XMLSchema#string>)
         Decimal(d) ≑ Value(d) ∧  d ∈ ℝ
         Decimal(d) β‡’ Datatype(d,
             <http://www.w3.org/2001/XMLSchema#decimal>)
    Relation(s,p,o) β‡’ Triple(s,p,o) ∧ IRI(s) ∧ IRI(o)
   Attribute(s,p,o) β‡’ Triple(s,p,o) ∧ IRI(s) ∧ Literal(o)

Note that in the JSON-LD serialization of RO-Crate the expression of Class and Property is typically indirect: The JSON-LD @context maps to Schema.org IRIs, which, when resolved as Linked Data, embeds their formal definition as RDFa. Extensions may however include such term definitions directly in the RO-Crate.

RO-Crate 1.1 Metadata File Descriptor

An important RO-Crate principle is that of being self-described. Therefore the serialization of the RO-Crate into a file should also describe itself in a Metadata File Descriptor, indicating it is about (describing) the RO-Crate root data entity, and that it conformsTo a particular version of the RO-Crate specification:

               about(s,o) β‡’  Relation(s, <http://schema.org/about>, o)
          conformsTo(s,o) β‡’  Relation(s, 
                               <http://purl.org/dc/terms/conformsTo>, R)
MetadataFileDescriptor(m) β‡’ ( CreativeWork(m) ∧ about(m,R) ∧ ROCrate(R) ∧ 
                             conformsTo(m,
                               <https://w3id.org/ro/crate/1.1>) οΌ‰

Note that although the metadata file necessarily is an information resource written to disk or served over the network (as JSON-LD), it is not considered to be a contained part of the RO-Crate in the form of a data entity, rather it is described only as a contextual entity.

In the conceptual model the RO-Crate Metadata File can be seen as the top-level node that describes the RO-Crate Root, however in the formal model (and the JSON-LD format) the metadata file descriptor is an additional contextual entity that is not affecting the depth-limit of the RO-Crate.

Forward-chained Production Rules for JSON-LD

Combining the above predicates and Schema.org mapping with rudimentary JSON templates, these forward-chaining production rules can output JSON-LD according to the RO-Crate 1.1 specification4:

 Mentions(R, s) ∧ Relation(s, p, o) β‡’  Mentions(R, o)
                             IRI(i) β‡’ "i"
                         Decimal(d) β‡’  d
                          String(s) β‡’ "s"
                     βˆ€eβˆ€t type(e,t) β‡’  { "@id": s,
                                         "@type": t }
                                       }     
             βˆ€sβˆ€pβˆ€o Relation(s,p,o) β‡’  { "@id": s,
                                         p: { "@id": o }
                                       }     
            βˆ€sβˆ€pβˆ€v Attribute(s,p,v) β‡’  { "@id": s,
                                         p: v 
                                       }
                   βˆ€rβˆ€c  ROCrate(R) β‡’  { "@graph": [ 
                                           Mentions(r, c)* 
                                         ]
                                       }
                                  R ⊨  <./>
                                  R β‡’ MetadataFileDescriptor(
                                        <ro-crate-metadata.json>) 

This exposes the first order logic domain of discourse of IRIs, with rational numbers and strings as their corresponding JSON-LD representation. These production rules first grow the graph of R by adding a transitive rule that anything described in R which is related to o means that o is also considered mentioned by the RO-Crate R. For simplicity this rule is one-way; in theory the JSON-LD graph can also contain free-standing contextual entities that have outgoing relations to data- and contextual entities, but these are proposed to be bound to the root data entity with Schema.org relation http://schema.org/mentions.


This is an appendix to the paper Packaging research artefacts with RO-Crate by Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. FernÑndez, Daniel Garijo, Bjârn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó CarragÑin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble.


  1. For simplicity, blank nodes are not included in this formalization, as RO-Crate recommends the use of IRI identifiers ↩︎

  2. The full list of types, relations and attribute properties from the RO-Crate specification are not included. Examples shown include datePublished, CreativeWork and name↩︎

  3. This simplification and mapping does not cover the extensive list of literal datatypes built into RDF 1.1, only strings and decimal real numbers. Likewise, LanguageTag is deliberately not utillised below. ↩︎

  4. Limitations: Contextual entities not related from the RO-Crate (e.g. using inverse relations to a data entity) would not be covered by the single direction Mentions(R, s) production rule; see issue 122. The datePublished(e, d) rule do not include syntax checks for the ISO 8601 datetime format. Compared with RO-Crate examples, this generated JSON-LD does not use a @context as the IRIs are produced unshortened, a post-step could do JSON-LD Flattening with a versioned RO-Crate context. The @type expansion is included for clarity, even though this is also implied by the type(e, t) expansion to Relation(e, xsd:type)↩︎