The Archive and Package (arcp) URI scheme
- Authors
-
Stian Soiland-Reyes <https://orcid.org/0000-0001-9842-9718>,
-
Marcos Cáceres <https://marcosc.com/>, Mozilla Corporation
- Abstract
- The arcp URI scheme is introduced for location-independent identifiers to consume or reference hypermedia and linked data resources bundled inside a file archive, as well as to resolve archived resources within programmatic frameworks for Research Objects.
Research Object: http://s11.no/2018/arcp.html#ro
Background
Archive formats like BagIt [1] have been recognized as important for preservation and transferring of datasets and other digital resources [2]. More specific examples include COMBINE archives [3] for systems biology, CDF [4] for astronomy data, as well as the more general HDF5 [5] which is also used for meteorological data. For the purpose of this article an archive is a collection of data files with related metadata, typically packaged as a compressed file like .zip or .tar.gz.
One challenge with regards to embedding Linked Data in such archives is how to reliably generate and resolve internal URLs, for instance <dataset13.zip>
may contain an RDF Turtle file <metadata/description.ttl>
to describe the CSV file <data/survey.csv>
— but in order to correctly reference that file it will either have to use a relative path <../data/survey.csv>
or some pre-existing Web URL like <http://example.com/dataset13/survey.csv>
.
The Research Object Bundle [6] format suggested re-using the app URI scheme for minting absolute URIs from relative paths of resources within a ZIP file. The app URL scheme [7] was originally intended for packaged web applications, where each application would get their own namespace like <app://c6179148-3cde-4435-8e66-304453f89d59/>
with paths resolved from the corresponding application package ZIP file. However the app URL scheme did not progress further on the W3C Recommendation track, and this approach was abandoned in favour of the combination of Web App Manifest [8] and Service Workers [9]. Together these technologies reuse the http/https origin URL of a downloaded application manifest together with relative links, while also allowing a web application to work offline.
The Archive and Package (arcp) URI scheme
Inspired by the app URL scheme we defined the Archive and Package (arcp) URI scheme [10], an IETF Internet-Draft which specifies how to mint URIs to reference resources within any archive or package, independent of archive format or location.
The primary use case for arcp is for consuming applications, which may receive an archive through various ways, like file upload from a web browser or by reference to a dataset in a repository like Zenodo or FigShare. In order to parse Linked Data resources (say to expose them for SPARQL queries), they will need to generate a base URL for the root of the archive.
It should be clear that using local file URIs [11] for extracted archives like <file:///tmp/tmp.cUK6ERfdBe/>
do not serve well for this purpose, as they are not universally unique, are difficult to create consistently, and may introduce security risks of attacks like <../../etc/passwd>
. Similarly it may be inappropriate to mint new web based URIs like <http://repo.example.com/cUK6ERfdBe/>
as web presence should not be a requirement to process a linked data archive, in particular as processing may occur on a laptop or a cloud node with no public IP address.
Identifier structure
By definition an arcp identifier is an URI [12] with three parts:
The arcp Internet-Draft specifies three initial prefix values: uuid
, ni
and name
, each which defines how to identify a particular archive by a corresponding namespace. These namespaces are not intended to be directly resolvable without prior knowledge of the corresponding archive.
The path is the folder and file path within the archive, represented as an URI path [12] e.g. /file.txt
or /my%20project/about/intro.doc
— using percent-escaping where needed. The root folder /
represent the archive itself.
UUID-based identifiers
The simplest case for temporary sandbox processing of an archive with arcp is to generate a new random UUIDv4 [13], e.g. c6179148-3cde-4435-8e66-304453f89d59
, then the corresponding base URI is <arcp://uuid,c6179148-3cde-4435-8e66-304453f89d59/>
, finding resources like <arcp://uuid,c6179148-3cde-4435-8e66-304453f89d59/metadata/description.ttl>
referencing <arcp://uuid,c6179148-3cde-4435-8e66-304453f89d59/data/survey.csv>
. The application is then able to do translation from arcp to local paths using URI parsing libraries to select the URI path, and augment that to the locally extracted path. Such arcp identifiers are temporary in nature, but the application can maintain a mapping from the UUID to the archive and perform extraction on demand, or the archive can self-declare its UUID, such as the External-Identifier
header in BagIt [1].
arcp also suggests how a UUID can be reliably created from the URL location of an archive, thus if the application is processing <http://example.com/download/archive13.zip>
it can use the name-based UUIDv5 [13] by SHA1 hashing the URL string to mint <arcp://uuid,d9f0b57d-0504-5e9a-abae-f5f2b8c49b94/>
— with this method anyone processing that archive URL will always get the same arcp base URI, however the application will still need to maintain a mapping to find the original archive URL. Location-based arcp identifiers may also not be ideal for preservation purposes, as the archive might change upstream or move to a different location.
Hash-based identifiers
For this arcp defines a hash-based method, where the bytes of the archive file is used to find a checksum-based identifier based on the Naming Things With Hashes (ni) URI scheme [14]. For instance if the sha-256 checksum of a zip file is in hexadecimal 7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069
then the ni uri would be <ni:///sha-256;f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk>
by using the base64 encoding of the checksum. The corresponding arcp base URIs for resources within the archive is then <arcp://ni,sha-256;f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk/>
. With this method, anyone processing the byte-wise equal archive (using the same hash method) will get the same identifier.
Another advantage is that hash-identified archives can be retrieved from a NI resolver [14] using well known paths [15], e.g. <http://repo.example.com/.well-known/ni/sha-256/f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk>
. Clients can verify the checksum of the downloaded archive, so any resolver endpoint can be used.
Name-based identifiers
Finally, paying homage to its origin in app URLs, arcp can use a system-based app name. This is a suggested mechanism for resolving resources of an application package installed in a runtime system like Android applicationId or Java package name, where an application identifier can be directly reused in arcp for URIs within that runtime system, e.g. the URI <arcp://name,com.example.myapp/styles/resource1.css>
references the resource styles/resource1.css
within the installed package com.example.myapp
.
As application package content do not necessarily correspond to archive file listings, it is open-ended how name-based arcp identifiers can be resolved, and indeed package content may vary per operating system, device type or application version, and so name-based arcp identifiers should be treated as system-local identifiers similar to file:///
URIs [11], but within a particular programming framework.
arcp implementations
The arcp Python library [23] was developed to help creating, parsing and validating arcp URIs. In particular it can generate arcp based on random UUIDs, URL locations, names and hashing archive bytes. The arcp parser recognize the arcp prefix and can extract UUIDs or hashes, and can generate the corresponding .well_known/ni
URI for retrieving the archive. This library is meant to complement Python’s urlparse library, and so it is deemed out of scope for it to do any kind of resolution of arcp based on archive or network access.
The Research Object Bundle library, part of Apache Taverna (incubating), is adding support for arcp URIs in its opening and creation of RO bundles, initially using the arcp UUID format as a replacement for app URIs, with planned support also for hash-based identifiers and opening RO Bundles from a .well-known/ni endpoint.
The CWLProv [24] approach for capturing provenance of executing Common Workflow Language is using arcp in its BagIt External-Identifier to identify its research object.
For CWLProv the use of arcp is crucial, as it assigns global identifiers for use across resources in the RO bag, including the RO manifest itself and in W3C PROV file formats like PROV-N and N-Triples, neither which support relative URIs.
In this approach the UUID of the RO identifier <arcp://uuid,82dee268-2411-45a2-83a9-3be14f84b754/>
also appears in the identifier <urn:uuid:82dee268-2411-45a2-83a9-3be14f84b754>
of the top-level workflow run (the PROV Activity), and so this is showcasing how an RO that is the primary representation of a non-information resource (e.g. a process) can be identified using a directly derived arcp URI. While this could in theory also been achieved with an arcp UUIDv5 derived from the URL “location” of the activity <urn:uuid:82dee268-2411-45a2-83a9-3be14f84b754>
that could be a confusing hack, as such URNs are not resolvable URLs. UUIDv5 hashing can however be appropriate for non-information resource that have a resolvable http/https permalink.
Conclusion
This article propose the arcp identifier scheme for resources within archives using formats like ZIP, tar and BagIt, and suggest arcp is useful for identifying standalone Research Objects and for processing Linked Data embedded in archives. The Internet-Draft draft-soilandreyes-arcp [10] is under consideration by IETF’s Applications and Real-Time Area to progress towards Informational RFC status.
References
[1] J.A. Kunze, J. Littman, L. Madden, J. Scancella, C. Adams (2018): The BagIt File Packaging Format (V1.0), Internet Engineering Task Force. https://datatracker.ietf.org/doc/html/draft-kunze-bagit-16
[2] Research Data Repository Interoperability WG (2018): Research Data Repository Interoperability WG Final Recommendations, Research Data Alliance. https://doi.org/10.15497/RDA00025
[3] F.T. Bergmann, R. Adams, S. Moodie, J. Cooper, M. Glont, M. Golebiewski, et al.,(2014): COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project, BMC Bioinformatics. 15 369. https://doi.org/10.1186/s12859-014-0369-z
[4] Space Physics Data Facility (2016): CDF Internal Format Description, 3.6, NASA / Goddard Space Flight Center. https://spdf.gsfc.nasa.gov/pub/software/cdf/doc/cdf364/cdf36ifd.pdf
[5] The HDF Group (2016): HDF5 File Format Specification Version 3.0, The HDF Group. https://support.hdfgroup.org/HDF5/doc/H5.format.html
[6] S. Soiland-Reyes, M. Gamble, R. Haines (2014): Research Object Bundle 1.0, researchobject.org Recommendation, Zenodo. https://w3id.org/bundle/2014-11-05/ https://doi.org/10.5281/zenodo.12586
[7] System Applications Working Group (2015): The app: URL Scheme, W3C Working Group Note 23 July 2015, World Wide Web Consortium. https://www.w3.org/TR/2015/NOTE-app-uri-20150723/
[8] M. Cáceres, K.R. Christiansen, M. Lamouri, A. Kostiainen, R. Dolin, M. Giuca (eds.) (2018): Web App Manifest, W3C Working Draft 04 July 2018, World Wide Web Consortium. https://www.w3.org/TR/2018/WD-appmanifest-20180704/
[9] A. Russel, J. Song, J. Archibald, M. Kruisselbrink (2017): Service Workers 1, W3C Working Draft 2 November 2017, World Wide Web Consortium. https://www.w3.org/TR/2017/WD-service-workers-1-20171102/
[10] S. Soiland-Reyes, M. Cáceres (2018): The Archive and Package (arcp) URI scheme Internet-Draft draft-soilandreyes-arcp, Internet Engineering Task Force. https://tools.ietf.org/html/draft-soilandreyes-arcp-03
[11] M. Kerwin (2017): The "file" URI scheme, RFC Editor. RFC 8089 https://doi.org/10.17487/RFC8089
[12] T. Berners-Lee, R. Fielding, L. Masinter (2005): Uniform Resource Identifier (URI): Generic Syntax, RFC Editor. RFC 3986 https://doi.org/10.17487/rfc3986
[13] P. Leach, M. Mealling, R. Salz (2005): A universally unique identifier (UUID) URN namespace, RFC Editor. RFC 4122 https://doi.org/10.17487/rfc4122
[14] S. Farrell, D. Kutscher, C. Dannewitz, B. Ohlman, A. Keranen, P. Hallam-Baker (2013): Naming Things with Hashes, RFC Editor. RFC 6920 https://doi.org/10.17487/rfc6920
[15] M. Nottingham, E. Hammer-Lahav (2010): Defining Well-Known Uniform Resource Identifiers (URIs), RFC Editor. RFC 5785 https://doi.org/10.17487/rfc5785
[16] C. Lynch, S. Parastatidis, N. Jacobs, H. Van de Sompel, C. Lagoze (2007): The OAI-ORE effort: Progress, challenges, synergies, Proceedings of the 2007 Conference on Digital Libraries - JCDL ’07. https://doi.org/10.1145/1255175.1255190
[17] N. Ferro, G. Silvello (2013): Modeling Archives by Means of OAI-ORE, IRCDL 2012: Digital Libraries and Archives, pp 216–227. https://doi.org/doi.org/10.1007/978-3-642-35834-0_22
[18] Shaopeng He, Jianhui Li, Zhihong Shen (2013): F2R: Publishing file systems as Linked Data, 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 767–772. https://doi.org/10.1109/FSKD.2013.6816297
[19] Ansgar Bernardi, Gunnar Aastrand Grimnes, Tudor Groza, Simon Scerri (2011), The NEPOMUK Semantic Desktop Context and Semantics for Knowledge Management pp 255-273. https://doi.org/10.1007/978-3-642-19510-5_13
[20] P. Ciccarese, S. Soiland-Reyes, K. Belhajjame, A.J. Gray, C. Goble, T. Clark (2013): PAV ontology: provenance, authoring and versioning Journal of Biomedical Semantics 4:37. https://doi.org/10.1186/2041-1480-4-37
[21] James Pritchett, Markus Gylling (eds): EPUB Open Container Format (OCF) 3.1. W3C Member Submission 25 jan 2017. World Wide Web Consortium. https://www.w3.org/Submission/2017/SUBM-epub-ocf-20170125/
[22] EPUB Canonical Fragment Identifiers 1.1, Recommended Specification 5 January 2017. International Digital Publishing Forum. http://www.idpf.org/epub/linking/cfi/epub-cfi-20170105.html
[23] S. Soiland-Reyes (2018): stain/arcp-py: arcp 0.2.0, Zenodo software http://arcp.readthedocs.io/en/0.2.0/ https://doi.org/10.5281/zenodo.1165986
[24] F.Z. Khan, S. Soiland-Reyes, M.R. Crusoe, A. Lonie, R. Sinnott (2018): CWLProv - Interoperable Retrospective Provenance capture and its challenges, Zenodo preprint. https://doi.org/10.5281/zenodo.1215611
Acknowledgements
This work has been done as part of the BioExcel CoE, a project funded by the European Union contract H2020-EINFRA-2015-1-675728