Extra metadata with API feeds

MikeThacker · October 30, 2024, 11:55am

I’d like to propose extra optional fields to be returned by the root/stub API endpoint.

Specifically to allow the Dashboard (new version is under development) to be documented entirely from a feed under the publisher’s control, we could add:

Organization name
Organization URL
Developer name
Developer URL
Summary text

There may be extra data (like administrator email address) which should not be in an open feed and should stil be communicated manually.

IlesdTPX · October 31, 2024, 4:05pm

Jumping in here for the first time to piggy back on this, I’d like to also propose that the version field returned from the same GET / endpoint should be a constant with a value of HSDS-UK-3.0 or something similar.
To me it becomes a much more useful field if it can be relied upon to describe the version in a constant format, particularly going forward if each new version of the standard used a constant field here as well.

As for Mike’s original proposal the extra fields would be useful for our current development and could be useful for others trying to consume this APIs as well.

MikeThacker · November 28, 2024, 12:34pm

As agreed at the last technical standing committee meeting, I’ve drafted MBT10038 - Use cases for extra metadata with API feeds.

I’ve opened the link to comments by anyone and will accept all sensible suggest changes or raise them with the committee if controversial.

@klambacher I think you said you’d review and add to this. Feel free to change my text as you think fit.

kathrynods · December 11, 2024, 5:04pm

Thanks @MikeThacker for pointing me to this - just noting having this field as a constant wouldn’t for minor or patch releases due to backwards compatibility requirements (e.g. data than conforms to 3.0 also conforms to 3.1)

However we could still have a fixed list of acceptable versions or a pattern

klambacher · December 18, 2024, 7:02pm

As we discussed, I wanted to chime in with some of the meta data that we have need in the past and why; some is tied to the file / feed and some is per-record. I’ll break it down by who “needs” the extra information and why.

System / software level:
We have different protocols for import based on the source system, even with the same incoming format, and our external ID storage is actually system code (specifically, the software it came from, but theoretically we could accept system codes that were multiple sources but the same software) + external ID. This allows distinct handling based on source (e.g. taxonomy or coverage area name transformations). For non-GUID identifier systems (not an issue with HSDS but is an issue for some systems we manage) it also allows for a unique ID to be created via system code + external ID, even in cases where we have the record coming in with the same ID but gets forked / duplicated for various reasons. This is also distinct from what we would call the source database name/URL, since we can have the same software or version but multiple sources / feeds.
Funder / attribution:
In some cases we have data from multiple data managing organizations or multiple public source sites coming out of the same software system in the same file / feed. This means attribution at the system level is not sufficient, and needs to be available per-record, for funder verification of contributions + data quality analysis, and public record attribution in websites etc.
Ability to request changes and report issues:

Need both an overall data source / provenance AND where possible the method to use to request corrections to specific data; this is key for trustworthiness and reliability for users.

In sum…

At the file/feed level we would have:

Source system code (consistent system-internal identifier for the source software system, which is used for unique ID formation + specialized import handling)
Source system name (“user-friendly” database source name in all applicable languages for display purposes)
System URL (all applicable languages, this is to the website of the data owners / public database source NOT the software vendor)
Source system version
Schema version

At the record level we would have:

Record owner agency information (we have a unique code + name for attribution purposes and funder use)
URL for submission of changes to the record (to account for the public and/or other end user wanting to suggest possible corrections to the data when they do not have direct editorial permissions)

MikeThacker · December 19, 2024, 1:29pm

Thanks Kate. I’ll leave others to assess how much these requirements apply to data publishers in general. Just a couple of comments:

We already have Schema version (as well as profile)
A URL (as well as possible email address) for submitting changes would be useful for the feed as a whole, even if some use record level data instead or use the record-level value as an override

kathrynods · January 31, 2025, 10:53am

I’ve started writing this up in the proposal template combining Mike and Kate’s suggestions.

Kate you’ve mentioned having record level/feed level meta data.

Mike - with your original proposal if we had a data republisher making a new feed combining several feeds the “organization_name” for the mega feed would be the name of the republisher? In which case does this meet the user need regarding provenance of the data.

If we want to capture original source publishers then it might be we need to look at record level metadata for this aspect.

MikeThacker · January 31, 2025, 12:15pm

I would expect the publisher to be the organization combining the feeds, and so taking some responsibility for the resultant content.

For simplicity I’d suggest that the summary text could include details of the included feeds if the publisher wants to put that there.

In our verbal conversation, we also mentioned an optional contact email address for the publisher. This could be used to direct queries on content which the publisher could pass to the original publisher or just put the two parties in touch with one another.

I’d suggest waiting to see if there’s demand for that. Note that a republisher could be republishing another combined feed, so the ultimate solution would be something recursive!

kathrynods · February 4, 2025, 12:08pm

I have written this up in a proposal here and this will be added to the agenda for the next committee meeting Extra metadata with API feeds

This doesn’t include

record level metadata, which is something we may want to consider in the future
standardising/restricting the format of the schema field that is already present. This is something we could look into separately as a PATCH level proposal

kathrynods · February 20, 2025, 12:45pm

I discussed this proposal with @mrshll last week and I have spent some more time thinking about it based on his feedback. In response, I have made some small changes to the proposal

I have spent more time researching how publisher metadata is presented in other data standards that ODSC has worked on.

BODS: there is a “publisher” object in each statement. It contains “name” and “URL.” One of these must be included.
OCDS: there is a “publisher” object in the release package schema. It contains “name” (required), “scheme”, “uid” and “uri”
360 Giving: there is a “publisher” object in the package scheme. It contains “name” (required), “identifier”, “website” and “logo”

Matt also suggested considering whether we could reuse objects already in the standard to represent this data.

This might look like:

publisher - organisation object (id*, name*, description*, website, email)
developer - organisation object (id*, name*, description*, website, email)
feedback - contact object ( id*, email*)
description - string
schema - string

* required fields of these objects

My initial feeling is that this would be more complicated than the original proposal and I am not sure whether fields like “id” add value. I would be interested to hear the committee’s thoughts on this.

The other standards I reviewed did not reuse objects used elsewhere in the standard for this purpose.

MikeThacker · February 21, 2025, 2:32pm

My initial feeling is that this would be more complicated than the original proposal and I am not sure whether fields like “id” add value.

I tend to agree.

For better or worse, we tend not to incorporate data structures from elsewhere in HSDS (except perhaps for iCal)
Referencing an other organization record that is elsewhere in the data might be a purist’s approach but makes retrieving the details hard
adding required fields would remove backwards compatibility so new fields should at least be optional for now

I wouldn’t object to also having an optional publisher id field which references the UUID of the publisher in the data (or conceivably in someone else’s data).

kathrynods · February 21, 2025, 2:53pm

Thanks Mike

adding required fields would remove backwards compatibility so new fields should at least be optional for now

I think as long as the publisher object itself is not required this would not break backward compatibility as users could still provide no publisher data at all. However, I agree it isn’t useful to require specific fields within publisher unless they are needed.

I wouldn’t object to also having an optional publisher id field which references the UUID of the publisher in the data (or conceivably in someone else’s data).

@klambacher is this a field you could use for “source system code”? Do you think this would be useful?

MikeThacker · February 21, 2025, 3:37pm

Ahh yes, I get your point on the publisher object not being required

I personally haven’t come across a need for that field. Others may have.

kathrynods · February 26, 2025, 3:44pm

Proposal for 3.2 - Extra metadata with API feeds is the forum post for approving this proposal

Topic		Replies	Views
Proposal for 3.2 - Extra metadata with API feeds Governance	20	107	June 4, 2025
Request for Comments: Data Guides for HSDS Technical	1	24	June 4, 2025
API: Metadata too large to include in endpoints? Technical api	2	206	April 2, 2023
Update on 3.0 Schema changes Technical	6	441	January 24, 2023
Documenting the Profile Mechanism Technical	3	389	August 24, 2023

Extra metadata with API feeds

Related topics