Web standards for describing datasets and profiles

6 minute read

Published: February 14, 2019

This is blog post was published on the Software Sustainability Institute’s website.

Image of Database Land Image by chrisdlugosz.

Supported by the UK Software Sustainability Institute Fellowship, I attended the fourth Face-to-Face (F2F) meeting of the W3C Dataset eXchange Working Group (DXWG) in Lyon, France, at the end of October 2018. Here, I report on how W3C WGs work and the specific outputs we’ve been producing at DXWG.

W3C stands for World Wide Web Consortium and is an organisation that works to develop web standards through an international community, which is built through a membership approach. Web standards are documents describing web technologies (for example HTML, CSS and XML) for which consensus, fairness, public accountability and quality have been reached. W3C follows a process that is designed to enable getting to that status where a document can be recommended for use, and this is done through Working Groups. Web standards are fundamental to achieve interoperability.

Every year, W3C members come together over a week for the Technical Plenary and Advisory Committee (TPAC) meeting. This meeting brings together all the Technical Groups, the Advisory Board, the Technical Architecture Group and the Advisory Committee, as well as Working Groups and Community Groups.

As a member of the W3C Dataset eXchange Working Group, I went to the F2F meeting, in which we continued our work related to the group’s charter to provide the following outcomes:

A revised version of the Data Catalog vocabulary (DCAT): The current DCAT Recommendation provides a vocabulary to publish datasets and data catalogs on the web. The working group is working on an update and expansion of the vocabulary, considering multiple new use cases and requirements.
Guidance on how to publish profiles of other standards: This document is meant to include a definition of what is meant by profiles and an explanation of one or more methods for publishing and sharing them.
Content negotiation by Application Profile: An explanation of how to implement the expected Internet Engineering Task Force (IETF) Internet-Draft and suitable fallback mechanisms as discussed at the Smart Descriptions & Smarter Vocabularies (SDSVoc) workshop.

Image of AGB at TPAC

To produce this work, we meet weekly in teleconferences online, communicate via a mailing list and collaborate through GitHub, where we keep all of the output documents (see the repository: https://github.com/w3c/dxwg). We also meet in person from time to time, as it is a great way of progressing the work by focusing on specific issues in intensive meetings that run for two days. As not everyone is able to attend in person, there is always remote connection available and thorough minutes are taken via an Internet Relay Chat (IRC) system. The e-mails and minutes are archived and made public for anyone to follow the work of the Working Group, and public feedback is seeked and welcome at all times.

The whole W3C process is extensively documented and it includes the development of a series of technical reports until reaching the recommendation status. Working groups then publish a First Public Working Draft for each document, then zero or more revised Working Drafts, a Candidate Recommendation, a Proposed Recommendation and a W3C Recommendation (that can be later edited or amended, if needed). You can see all the W3C standards and drafts on the W3C website.

The first step to produce these documents is to gather use cases and derive requirements from them. So, in DXWG, we first produced the Use Cases and Requirements document. The current status of the other documents is as follows.

Revised DCAT. Up until now, we have produced two working drafts of the revised version of DCAT. The latest working draft is here. The latest additions are always in the Editors’ Draft. As I mentioned before, the original DCAT vocabulary provided terms to publish datasets and data catalogues on the web. For this revised version, we have been addressing new requirements around cataloguing data services and data distribution services, recording resources provenance, adding recommendations for catalogues that store datasets that are a bunch of files rather than explicitly splitting between a datasets and its different representations in different formats, guidance on how to describe the quality of a dataset, support for data citation, and many other relevant aspects of cataloguing resources.

The Profiles Ontology. A profile is a set of constraints over one or more standards of information resources. Thus, a profile may extend a standard by identifying new entities, or may add more restrictions, options, parameters or semantic interpretations to a given standard. In the context of data catalogues, there have been several extensions or profiles of the DCAT vocabulary, but up until now there was no way to formalise profiles and to relate them to the standards they are based on, or to other profiles. In DXWG, we have produced a first public working draft of an ontology for describing profiles, including references to a test suite and to an implementation report. The latest version of the document is the Editors’ Draft.

Content Negotiation by Profile. Content negotiation is a mechanism by which a server may have different representations of a resource (e.g. different versions of the same document) at the same location (or URI) and users or agents can specify what version to retrieve. However, up until now, there was no way for a client to negotiate for content based on what standard it complies with or what profile it conforms to. DXWG produced the first public working draft of a specification for negotiating the content by profile. As per the other documents, the latest specification is always the Editors’ Draft.

If you have comments about these documents, we very much welcome them. To send feedback, you can do it via GitHub issues and/or by email at public-dxwg-comments@w3.org. We are especially looking for feedback around the direction we have taken in the different documents and also if you have any views on what we should prioritise next (you can check the list of open issues we have in GitHub to see the project roadmap). There are many DXWG participants who worked on these specifications and we look forward to hearing your views about them.

By using these specifications, we hope to enable high standards for describing datasets and profiles, which will contribute to data sharing and research reproducibility.

Twitter Facebook LinkedIn

Best Practices for Software Registries and Repositories

4 minute read

Published: August 04, 2021

(This post is cross-posted on the SciCodes website, the SSI blog, the ASCL blog, and the FORCE11 blog, Better Scientific Software (BSSW) website.) Read more

Evidence for the importance of research software

9 minute read

Published: June 08, 2020

(This post is cross-posted on the URSSI blog, the SSI blog and the Netherlands eScience Center blog, and is archived in Zenodo) Read more

The Research Software Alliance (ReSA) and the Community Landscape

5 minute read

Published: March 11, 2020

(This post is cross-posted on the UK Software Sustainability Institute blog, the Netherlands eScience Center blog and the US Research Software Sustainability Institute blog.) ReSA’s mission is to bring research software communities together to collaborate on the advancement of research software. Its vision is to have research software recognized and valued as a fundamental and vital component of research worldwide. Given our mission, there are multiple reasons that it’s important for us to understand the landscape of communities that are involved with software, in aspects such as preservation, citation, career paths, productivity, and sustainability. One of these reasons is that ReSA seeks to be a link between these communities, which requires identifying and understanding them. We want to be sure that there aren’t significant community organizations that we don’t know about to involve in our work. Also, identifying where there are gaps will help us create the opportunities and communities of practices as required. When thinking about these communities, it’s clear that in addition to those that focus on software, there are others for which software is just a small part of their interest. Some examples are communities that focus on open science, reproducibility, roles and careers for people who are less visible in research, publishing and review, and other types of scholarly products and digital objects. ReSA also wants to define how we fit and interact with that broader scholarly landscape.

How was this work undertaken?

In September 2019, a ReSA taskforce came together to map the software community landscape, consisting of the authors of this blog. This group distributed a survey to ReSA google group members to identify other groups interested in software. Other useful sources included:

Netherlands eScience Center: Awesome-research-software-registries by Jurriaan Spaaks
eResearch-meeting-list by James Hetherington
International RSE groups by the Research Software Engineering (RSE) Association
Open Science Grassroots Community Networks, a consortium of 120 networks
In which journals should I publish my software? by Neil Chue Hong

The taskforce then met to consider the results and how to analyze them. The ReSA list of research software communities is now publicly available as a living community resource, with the version of this list used by the ReSA taskforce in February 2020 and a copy of this post archived in Zenodo. Suggested additions or corrections are welcome by making comments in the list. Some of the issues we’ve had in assembling this list are:

How much interest in software does an organization need to have to be listed?
When is an organization sufficiently research focused to be included?
What momentum/scale does an organization need to have so that we consider it relevant in the global picture?

On the other hand, once we started adding entries to the list, for many we found that we immediately thought of other similar organizations that should be added. For example, some organizations have a geographic aspect, and this led us to think of other similar organizations with different geographic aspects, such as all the national and regional RSE associations.

What did we learn?

There were a range of interesting outcomes of the analysis:

There are many, many communities that support research software, emphasizing the need for a coordinating organization such as ReSA. The importance of community development is captured in articles such as Community Organizations: Changing the Culture in Which Research Software is Developed and Sustained by Daniel S. Katz et al., which provides an overview of key groups and discusses opportunities to leverage their synergistic activities.
There is an increasing (and wide) range of community initiatives. For example, the Open Science Grassroots Community Networks list has evolved into the Community of Open Scholarship Grassroots Networks (COSGN), whose networks communicate and coordinate on topics of common interest. COSGN has submitted an NSF proposal to formalize governance and coordination of the networks to maximize impact and establish standard practices for sustainability.
The increasing focus on open software makes it hard to separate research and non-research initiatives. As per the points above, it is very hard to define which initiatives are part of the research software community, and which aren’t.
Some organizations that were originally data-centric now include a software focus. For example, the Research Data Alliance now includes the Software Source Code Interest Group, which provides a forum to discuss issues on management, sharing, discovery, archiving, and provenance of software source code.

What are the next steps?

We invite readers to continue to add or make corrections to the ReSA list of research software communities by making comments in the list, which will continue to be curated by ReSA. We are also interested to hear from community members who would like to engage with us in writing a landscape paper based on further analysis and work. This could address questions such as what are the axes that create the space, where do the currently-known organizations fit in the space, and are there gaps where no organization is currently working? We also invite readers to consider involvement in other ReSA activities, including Taskforces.

Conclusion

The ever-growing number of constituents of the research software community both reflects and demonstrates the increasing recognition of research software. The research software community is now a complex ecosystem comprised of a wide variety of organizations and initiatives, some of which are community networks themselves. Collaboration and coordination across these initiatives is important, to enable the broader community to work together to achieve bigger goals. ReSA aims to coordinate across these efforts to leverage investments, to achieve the shared long-term goal of research software valued as a fundamental and vital component of research worldwide. Join the ReSA google group to stay up-to-date on our activities. Read more

Interact to Interoperate

less than 1 minute read

Published: December 06, 2018

This is blog post was published on the Software Sustainability Institute’s website. Read more

Alejandra Gonzalez-Beltran