Should we add `$id` properties to the HSDS Schemas?

Having $id properties is generally considered good practice for JSON Schemas for a few reasons, and becomes particularly important when we consider validators, SSOTs for versions of the schema files, and people cacheing it.

I’ve raised a Github issue to discuss this:

In general, I think that this is the direction we should be headed in. This has already been raised as a problem to me by Jeff Cumpsty, who is maintaining the ORUK Validator at the moment. Between: the canonical HSDS Schemas; the UK Profile Schemas; and the copies of the UK Profile Schemas inside the ORUK Validator tool – there are three copies of schemas with similar names at different locations. Having $id properties cuts through this ambiguity, allowing validators to cache schemas properly and allows us to track the version of schemas as part of the identifier.

Theoretically, adding these is as simple as adding the $id properties in with an appropriate URL, and updating the references appropriately. However, this might create a burden on Profile maintainers due to how the Profiles mechanism works in practice. The Profiles tooling could of course be adapted, but this would take a bit of time.

It’d be appreciated if people could weigh in here or on the Github thread with their thoughts on this issue.

2 Likes

Hi everyone,

I’m suggesting we add ID properties to the HSDS schemas. This idea came to me while I was looking through the changelog for the 3.1.1 Profile and noticed that there were several copies of the JSON schema files without a clear way to identify the latest or standard versions.

I believe this is a crucial step for two main reasons.

Managing Schema Versions

Adding a unique ID to each schema would solve the versioning problem by removing ambiguity. Generating and including a new unique ID for each modification, assists to track changes.

I also suggest that profile tools could be modified to generate unique IDs when compiling a profile. This could include information about the profile owner, which would make it even easier to track the origin of a schema.

Centralized Schema Repository

The unique IDs would also make it easier to set up a central repository for generated schemas. Instead of sharing entire files, a URI (the $id property), points directly to the individual schema.

From my perspective, this would eliminate the need for us to store schema files locally. It would also reduce duplication and the potential for errors.

I think these two points, especially the logistics of implementation, are worth further discussion.

2 Likes

I guess this topic has stagnated for a while.

My specific reasons for wanting “a unique” $id value assigned to a generated schema is to simplify validation process.

An id for a schema is a URI which uniquely defines the schema. If there exists a central repositoy for the generated schemas, the URI becomes a URL which permits validation logic to remotely access the schema and avoids duplication without ther need to store json schema’s locally. You also remove the opportunity for variances in copies of copies of copies of the schema files. I appreciate that could include some very bespoke schema diverging from a base HSDS profile.

(anyone correct me if im spouting rubbish)

The validation engine, which I envision (I could be way off base), would potentially only require 2 URL values…the data feed and a schema version that it must satisfy.

We can talk about using LLMs and CLMs later. :slight_smile:

1 Like

Realise that this thread is getting a little stale, but I’ve finally been able to sit down and finish my replies to it.

Headline: I am in full agreement that we should add $id properties to the HSDS Schemas. Where I’m hesitant, is that I don’t (yet) understand the implications for Profile Maintainers due to how the HSDS Schema Tools generate Profile schemas via JSON Merge Patches. However, new/existing Profiles can still explicitly add their own $id properties in the interim. I plan to do some experiments when I have time, and report back with a recommendation for the way forward.

The validation engine, which I envision (I could be way off base), would potentially only require 2 URL values…the data feed and a schema version that it must satisfy.

I know we’ve had discussions to this effect over technical meetings, calls, and the odd email exchange but I think I’ll write it down here so people scouring the forum can find it!

With the way that HSDS’ API Spec and Schemas are currently defined, validators should only really need a single URL to do validation. The API Spec defines that GET / should return an object which contains the following:

{
  "version": "3.2",
  "profile": "https://url-to-your-profile-here",
  "openapi_url": "https://url-to-openapi-url-here"
}

From here, you can just retrieve the openapi.json file used to describe the API. This will then define the endpoints etc. and importantly it defines the response schemas by way of URL to the compiled HSDS Schemas.

That means for every path defined in the openapi.json file, a validator can then automatically know what the response should look like by retrieving the schema associated with it. This is the fundamental “wiring up” of the API spec and the data models defined by the schemas.

But what about Profiles? Well, the HSDS Schema tools command used to generate a Profile requires a URL used to define your profile. When it’s generating that Profile’s openapi.json file; it will then use that URL as a basis to generate the url for your Profile’s versions of the schemas. The only manual wiring you’ll need to do is when you’re defining your own endpoints, I think.

As long as the Profile’s schemas are available at the URLs defined in the openapi.json file (if you supply a github or gitlab repo url as the profile url, this should happen automatically); all a validator needs is a “root URL”, and then from there it can retrieve, in a step-wise manner, all of the declarative information it needs to evaluate the results of an API call against the Schemas.

Where I think we still need $id properties in the schema; is that these canonical URLs for schemas are sort of implicit now, whereas declaring them in the $id of the schemas themselves allows tools such as validators to cache them nicely.