Proposal for 3.2 - Extra metadata with API feeds

I’d like to submit this proposal for consideration by the committee. This would be included in version 3.2 of HSDS.

Extra metadata with API feeds

2 Likes

For background on the discussions leading to this proposal see the original forum post

This looks good to me. :pray:

1 Like

We’ve asked the committee to vote on this using a google form. @skyleryoung made an interesting comment which I want to share here for responses.

I’m voting yes, but I do have questions about implementation details. This augments the API, but not the specification, is that correct? In other words, how would this get expressed in the tabular version of the schema definition? I’m not sure maintaining that is the highest priority, but if we’re diverging the two version of the specification we should acknowledge and discuss it.

Here’s my initial reckons, very happy to discuss further

This augments the API, but not the specification, is that correct

Yes, that’s correct

In other words, how would this get expressed in the tabular version of the schema definition

Based on the current proposal, it wouldn’t. If we wanted to here’s some initial thoughts.

We have the metadata object already, but that is currently for record level information. Maybe it could be expanded and adapted for this purpose.

OCDS has a release package schema which is the most direct tabular comparison I can think of to this. Introducing something like this would be a major overhaul.

If we did either of the above we’d probably only want to include publisher information, not developer? Something to think about.

if we’re diverging the two version of the specification we should acknowledge and discuss it

We already have some data that is returned by GET/ that, as far as I know, isn’t included in the tabular version.

I think this is still important to consider though.

I think clarity could be improved by providing some definitions.

First a definition of terms:

  1. What do we mean by “publisher”?
  2. What do we mean by “developer”?
  3. What do we mean by “provenance”?

Second, where exactly in the JSON schema will these fields be located? Are they sent once at the top-level for each API? Or are they sent with each record?

I acknowledge that I’m more accustomed to thinking from a specification standpoint as opposed to an API standpoint, so that may be where I got lost.

Thanks @skyleryoung here’s the definitions - I can incorporate these into the proposal doc if needed.

term definition
publisher the organisation responsible for publishing the entire data set/API. This could be an organisation who are publishing data about their own services or it could be an organisation combining data from various sources
developer the organisation developing the API. This could be the same as the publisher if they have in house capacity to do this. Or it could be an organisation that has been contracted by the publisher.
provenance in this case refers to the overall responsibility for a whole data set, rather than the original source of each record which could vary across records. We aren’t provididing information about how the data has potentially been collected/transformed to get to the current state.

As @MikeThacker has found an alternative way to feed this info into the UK dashboard I would be interested to hear thoughts about whether the developer information is still useful to include.

It could be (e.g. if someone noticed common issues across APIs made by a particular developer) but perhaps the publisher info is the most important thing.

Where exactly in the JSON schema will these fields be located? Are they sent once at the top-level for each API? Or are they sent with each record?

They would be sent once at the top level GET/ alongside the fields version, profile and openapi_url

I don’t love the term developer here but i don’t have a suggested alternative yet.

I mostly just want to chime in to suggest steward as the designated holder of responsibility for a given record or set of records.

I mostly just want to chime in to suggest steward as the designated holder of responsibility for a given record or set of records.

would this be in regards to Publisher/source metadata at a record level or as alternative to publisher here?

Yeah i mean these should be subject to further consideration, but i think there’s publisher of the aggregated dataset, steward of individual records, and then there should be a term for a representative of the organization that the record is about (maybe that’s source tho in other contexts i use registrant or representative)

Thanks @bloom I’m going to copy your comments over to the metadata at a record level thread talking about future developments where we can discuss more.

The overlapping responsibilities can be difficult to understand, let alone track. I look at the roles and responsibilities in the States something like this:

I’m not sure how accurate that is, and would welcome criticism.

Also, what would a swim-lane of roles and responsibilities for directory data pipelines look like in the UK?

Edit: It took me about 60 seconds after posting to decide I need to recreate the graphic above, so yes, it’s a somewhat fluid set of relationships even in my own head. Still, I would say that’s how I’ve come to see the space I work in over time.

1 Like

This is a nice start. Can you clarify the difference in your mind between Assurer and Steward?

In my mind, Assurance is the responsibility of the Steward. The steward assures.

In what contexts might assurance be performed by a non-steward? if a steward doesn’t necessarily assure, what does stewardship necessarily entail?

Hey Greg, I can think of two approaches that differentiate between assurer and steward…

A- Here in Whatcom we have two partners through the collaborative (both from the County Library system) who help us re-verify, i.e. assure, records on a monthly basis. From a broad perspective, my role is as Data Steward and I supervise those assurers, but they are separate from the steward.

B- From the perspective of data federation, the assurer is whoever verified the record most recently, whereas the steward is who decides to include the record in their database. Those are most certainly different roles in a fully fleshed out system. In an ideal world for ex, WA211 is a data steward, whereas WRIC is an assurer that feeds data to their system.

Also Skyler, I dig that chart.

Thanks, @dreww.

I see what you mean in A. In other words, a steward might be accountable for the accuracy of a record (i.e. the bottom-liner) but the assurer may hold the responsibility to actually do the verification (i.e. the implementation of the task). So the assurer is a subsidiary role under the steward.

for B, I’m a little less sure, but I think it would make sense if we essentially consider these roles to be fractal – i.e. in the B scenario, the 211 is like a ‘super steward’ in that they are ultimately accountable for the quality of all data, even though the responsibility for data quality is held by WRIC, which in turn might assign out assurance tasks to the County Library staff.

is that right?

Hmm, with B I’d more typify it as “super assurer” than the “super steward”. From a federative pov, orgs can have multiple roles, so WRIC may be an assurer for WA211 or WithinReach while being a steward from the perspective of a local service providing org. You’re right that it’s fractal (or hierarchical) tho.

As in Skyler’s chart, I see the role of Steward being “Yes, I want x resource included in the database” whereas the role of Assurer is “Yes, x resource is up to date”. Those roles can be done by the same person/org, or they can be separated.

Thanks all, this is an interesting discussion. Tagging @MikeThacker as it would be interesting to hear his thoughts on how this translates to the UK context.

This has been a great thread to read and I love seeing the discussion of how data publishing involves a number of different roles, and that aggregated feeds exist.

I think we should still include a minimal publisher object as part of the response to GET /, and that more nuanced information can be contained in something like a Data Guide document which a publisher produces and links to.

Publisher object in the API

This should contain some minimal contact and organisation information. We can keep this minimal while re-using components of the existing schema:

{
  "publisher": {
    	"name": "Open Data Services",
	"identifier": {
		"id": "e00e811a-16a4-11f0-b8c5-3f95a3bad144",
		"identifier_scheme": "GB-COH",
		"identifier_type": "UK Companies House",
		"identifier": "09506232"
	},
	"contact": {
		"id": "6faa9b06-16a5-11f0-af81-4be513178e24",
		"email": "matt.marshall@opendataservices.coop"
	},
	"url": "https://opendataservices.coop"
  }
}

I think that the only things that are really out of place here are the UUIDs, which are required by the schema. They’re not really relevant here as they’re not being linked to services at all. If we’re bothered by them, we can disregard re-using existing schema and just sort of re-implement a minimal version of them for the publisher object:

{
  "publisher": {
    	"name": "Open Data Services",
	"identifier": {
		"scheme": "GB-COH",
		"id": "09506232"
	},
	"contact": {
		"email": "matt.marshall@opendataservices.coop"
	},
	"url": "https://opendataservices.coop"
  }
}

I think this is good enough as a “first port of call” for someone looking to get in touch about a feed. The publisher can then redirect the request to a Developer, Steward, etc. In the case of aggregate feeds, I think it’s also reasonable to state that there is an organisation who is responsible for publishing the feed.

Data Guide

A data guide is a human-readable guide to the dataset, usually published as a web page. This document serves to inform data users about nuances of the dataset, its scope and limitations etc. Essentially “how this feed came to be”.

Things that a data guide might include:

  • Who is responsible, and in what way (Developers, Publishers, Partners, Data Sources, etc.), and contact points for them
  • How the data is collected and prepared for publication
  • License information for data-reuse (pertinent for open data, maybe not for every HSDS publisher)

Open Contracting recommend data guides to their OCDS publishers, and a fair few of them actually have them. Open Contracting call it a Publication Policy though, which imho is a silly name and we should stick with something a bit more obvious like Data Guide. They also provide a template for people to fill out to make it easier to write:

(Full transparency, I wrote this template back in 2018 and would probably do it different now)

For example in an OCDS data guide the publisher would say something like “Procurements over the value of 100,000 USD are subject to $local_transparency_law. This means that procurements under the value of 100,000 USD are not included in this dataset”.

For HSDS data, our concerns are different. It might be that addresses are excluded for particular types of service, or that this feed is an aggregate feed of other feeds. Data sources would be listed alongside any additional processing done.

Putting it together

If we did it this way, we might end up with something like this as a result for GET /:

{
  "version": "3.1",
  "profile": "https://github.com/OpenReferralUK/uk-profile",
  "openapi_url": "https://raw.githubusercontent.com/OpenReferralUK/uk-profile/refs/heads/main/schema/openapi.json",
  "data_guide": "https://southtyneside.gov.uk/hsds/data-guide.html",
  "publisher": {
    "name": "Open Data Services",
    "identifier": {
      "scheme": "GB-COH",
      "id": "09506232"
    },
    "contact": {
      "email": "matt.marshall@opendataservices.coop"
    },
    "url": "https://opendataservices.coop"
  }
}

To be super clear, I think we should be discussing the issues separately:

  1. Do we want a minimum object to create a space for some publisher/contact information in the API response?
  2. Do we want to provide a way to capture more nuanced information about the dataset, and if so is a Data Guide the best way to do it?

I love the idea of a Data Guide template.

So this would be an optional / best practice complement to the spec itself? Where would it live? (When we were using datapackages, I could see easily how it fits into the package. But if a data publishers has an endpoint, is the data guide something they just share in the documentation of the endpoint? if i get a JSON file does the publisher send me the data guide md file along with it or something?)

So this would be an optional / best practice complement to the spec itself?

Yes, it would be a document that (ideally) each publisher would link to in their dataset. Our job at OR would be to provide guidance on how to produce such a document effectively

Where would it live?

Each publisher’s data guide would live behind a publically accessible URL which is linked in a field in the API response e.g. data_guide. Ideally these would exist as a HTML web page, but could also be a PDF or other document format.

The guidance/template we produce to assist publishers producing these could live in the guidance section of the HSDS docs. The template could exist as a Google Doc, markdown file, etc. which we can link off to on said guidance.

if i get a JSON file does the publisher send me the data guide md file along with it or something?)

It’s linked as part of the API response. I think the best place for it is as a single URI-formatted field as part of the GET / endpoint, since we’re using that to describe aspects of the API feed anyway i.e. which version/profile of HSDS the feed is using etc.

1 Like

Sounds great. Let me know what you suggest for outlining a template, producing a sample… if we can pull something together quick, we could review and get feedback at the next Committee meeting