Should we add `$id` properties to the HSDS Schemas?

Having $id properties is generally considered good practice for JSON Schemas for a few reasons, and becomes particularly important when we consider validators, SSOTs for versions of the schema files, and people cacheing it.

I’ve raised a Github issue to discuss this:

In general, I think that this is the direction we should be headed in. This has already been raised as a problem to me by Jeff Cumpsty, who is maintaining the ORUK Validator at the moment. Between: the canonical HSDS Schemas; the UK Profile Schemas; and the copies of the UK Profile Schemas inside the ORUK Validator tool – there are three copies of schemas with similar names at different locations. Having $id properties cuts through this ambiguity, allowing validators to cache schemas properly and allows us to track the version of schemas as part of the identifier.

Theoretically, adding these is as simple as adding the $id properties in with an appropriate URL, and updating the references appropriately. However, this might create a burden on Profile maintainers due to how the Profiles mechanism works in practice. The Profiles tooling could of course be adapted, but this would take a bit of time.

It’d be appreciated if people could weigh in here or on the Github thread with their thoughts on this issue.

2 Likes

Hi everyone,

I’m suggesting we add ID properties to the HSDS schemas. This idea came to me while I was looking through the changelog for the 3.1.1 Profile and noticed that there were several copies of the JSON schema files without a clear way to identify the latest or standard versions.

I believe this is a crucial step for two main reasons.

Managing Schema Versions

Adding a unique ID to each schema would solve the versioning problem by removing ambiguity. Generating and including a new unique ID for each modification, assists to track changes.

I also suggest that profile tools could be modified to generate unique IDs when compiling a profile. This could include information about the profile owner, which would make it even easier to track the origin of a schema.

Centralized Schema Repository

The unique IDs would also make it easier to set up a central repository for generated schemas. Instead of sharing entire files, a URI (the $id property), points directly to the individual schema.

From my perspective, this would eliminate the need for us to store schema files locally. It would also reduce duplication and the potential for errors.

I think these two points, especially the logistics of implementation, are worth further discussion.

2 Likes

I guess this topic has stagnated for a while.

My specific reasons for wanting “a unique” $id value assigned to a generated schema is to simplify validation process.

An id for a schema is a URI which uniquely defines the schema. If there exists a central repositoy for the generated schemas, the URI becomes a URL which permits validation logic to remotely access the schema and avoids duplication without ther need to store json schema’s locally. You also remove the opportunity for variances in copies of copies of copies of the schema files. I appreciate that could include some very bespoke schema diverging from a base HSDS profile.

(anyone correct me if im spouting rubbish)

The validation engine, which I envision (I could be way off base), would potentially only require 2 URL values…the data feed and a schema version that it must satisfy.

We can talk about using LLMs and CLMs later. :slight_smile:

1 Like

Realise that this thread is getting a little stale, but I’ve finally been able to sit down and finish my replies to it.

Headline: I am in full agreement that we should add $id properties to the HSDS Schemas. Where I’m hesitant, is that I don’t (yet) understand the implications for Profile Maintainers due to how the HSDS Schema Tools generate Profile schemas via JSON Merge Patches. However, new/existing Profiles can still explicitly add their own $id properties in the interim. I plan to do some experiments when I have time, and report back with a recommendation for the way forward.

The validation engine, which I envision (I could be way off base), would potentially only require 2 URL values…the data feed and a schema version that it must satisfy.

I know we’ve had discussions to this effect over technical meetings, calls, and the odd email exchange but I think I’ll write it down here so people scouring the forum can find it!

With the way that HSDS’ API Spec and Schemas are currently defined, validators should only really need a single URL to do validation. The API Spec defines that GET / should return an object which contains the following:

{
  "version": "3.2",
  "profile": "https://url-to-your-profile-here",
  "openapi_url": "https://url-to-openapi-url-here"
}

From here, you can just retrieve the openapi.json file used to describe the API. This will then define the endpoints etc. and importantly it defines the response schemas by way of URL to the compiled HSDS Schemas.

That means for every path defined in the openapi.json file, a validator can then automatically know what the response should look like by retrieving the schema associated with it. This is the fundamental “wiring up” of the API spec and the data models defined by the schemas.

But what about Profiles? Well, the HSDS Schema tools command used to generate a Profile requires a URL used to define your profile. When it’s generating that Profile’s openapi.json file; it will then use that URL as a basis to generate the url for your Profile’s versions of the schemas. The only manual wiring you’ll need to do is when you’re defining your own endpoints, I think.

As long as the Profile’s schemas are available at the URLs defined in the openapi.json file (if you supply a github or gitlab repo url as the profile url, this should happen automatically); all a validator needs is a “root URL”, and then from there it can retrieve, in a step-wise manner, all of the declarative information it needs to evaluate the results of an API call against the Schemas.

Where I think we still need $id properties in the schema; is that these canonical URLs for schemas are sort of implicit now, whereas declaring them in the $id of the schemas themselves allows tools such as validators to cache them nicely.

I’ve cooled on this idea a bit. As you mentioned, Matt, the schema URL currently functions as the $id for both validation and caching. The primary advantage of a distinct $id—specifically one that doesn’t necessarily match the URL—is typically for preloading batches of known schemas.

In my current validation logic, I use the URL as the cache key. This assumes that schemas with external references will define the $ref as a resolvable URL rather than an abstract ID.

If we were to add $id now, wouldn’t it just mirror the URL anyway? If so, it might not provide much additional value for our current setup. (or have you already done it?)

If we were to add $id now, wouldn’t it just mirror the URL anyway? If so, it might not provide much additional value for our current setup. (or have you already done it?)

Yes it would, but it’s considered good practice in JSON Schema to have $id properties for the schemas. I think this is an uncontroversial thing to do for HSDS, as long as it doesn’t break our toolchain/build processes for the compiled schemas.

From Section 8.2.1 in the JSON Schema 2020-12 Core Spec:

The root schema of a JSON Schema document SHOULD contain an “$id” keyword with an absolute-URI [RFC3986] (containing a scheme, but no fragment).

So I think we’re ok to be moving towards adding them.

The primary advantage of a distinct $id—specifically one that doesn’t necessarily match the URL—is typically for preloading batches of known schemas.

Agreed, but I think any $id fields for HSDS would simply be the URL to the file on the Github repo.

In my current validation logic, I use the URL as the cache key. This assumes that schemas with external references will define the $ref as a resolvable URL rather than an abstract ID.

This is good and exactly how it should work, I think.

The top-level $id defines a base URI for the schema to support resolving $refs, and all the HSDS schemas use relative references as the value of $ref so they should be resolvable easily.

From Section 8.2.3.1 in the JSON Schema 2020-12 Core Spec

The value of the “$ref” keyword MUST be a string which is a URI-Reference. Resolved against the current URI base, it produces the URI of the schema to apply. This resolution is safe to perform on schema load, as the process of evaluating an instance cannot change how the reference resolves.

Overall, I think we can make progress on adding top-level $id properties to the HSDS Schemas. We’ve got alternative Profiles tooling now which will work around it, so I think the only real blocker is the existing HSDS Schema compilation process, which results in some schemas which will need different $ids e.g. the service_list.json schema: