Thanks for this Jeff.
My initial read of this is that we should be clear what we mean by compiling Profiles. When we discuss compilation for HSDS and Profiles I think what we’re referencing is two things:
- Some of the canonical HSDS schemas are the result of “compilation” steps, resulting in slightly different schemas.
- The Profile mechanism is effectively defined as: “you produce a bunch of patches atop the HSDS base schemas to define your Profile”.
I think your critiques are mostly focused on the second of those two — the actual Profile mechanism and its tooling — rather than the first, so I’ll focus on that in this reply.
Do we need a canonical, central repo dedicated to schema storage?
Yes, I think that arguably exists though. It’s the openreferral/specification repository.
Do we need to pre-compile schema’s?
I’m going to take this as “do we need to generate Profile schemas via patches?”. As noted, there is a whole can of worms ready to be opened re “compiled schemas” in HSDS!
I think the dichotomy you highlight is the following:
- Profiles currently exist as a series of JSON Merge-Patches defined atop of a branch of the HSDS Base Schemas, and to get a fully working set of Profile schemas for validation, you need to perform the Merge-Patch to get the schemas. This is currently done “offline” usually on a maintainer’s copy of the repo, and committed back
- There is an alternative model using “forks” of the canonical HSDS Specification repo, where changes are maintained on the set of schema files in each fork (which are the full schemas, not just patches).
The headline of my take is that, tooling issues aside, I am broadly a fan of having Profiles being defined as a series of patches and don’t think we should move to a model where Profiles are forks of repos.
Generating a profile differing from the standard HSDS offering, involves creating a set of json files which are then merged into the base schema; References are resolved and the output is a set of schema meeting your requirements. Personally I find a partial schema file, just doesn’t sit well; It keeps me up at night.
The JSON files for a Profile are JSON patches as defined by RFC 7396.
I think the benefits of these are:
- The process is defined in an RFC, atop a well-defined and well-understood file format (JSON: RFC 7159), leading to anyone being able to implement it (we have two implementations already) and anyone to be able to check those implementations.
- The JSON Merge Patches can be
diffed easily, in case we ever want to to generate summaries of differences between profiles. This is also true of comparing branches across repos but…
- It decouples the definition of the Profile from a particular version control ecosystem. Yes,
git won’t be going anywhere for a while. But svn wasn’t going anywhere either! jj is on the rise, and technology-trends aside, having Profiles exist simply as “patches which can be mechanically used to produce schema files following a process” allows people to maintain their own systems and infrastructure, rather than be tied to Github (and to a lesser extent, the rest of the git ecosystem.) This plays nice with future-proofing, fits in with HSDS’ declarative model, and prevents lock-in. “Forks” as defined by Github, Gitlab, etc. are all vendor-specific contexts; if I wanted to fork the openreferral/specification repo and self-host it on my own instance of Forĝejo or SourceHut, I’d need to do manual wiring to maintain contact with upstream. If everyone was on Github and I needed to run my own infrastructure, I’d be excluded from the network of Github repos unless I did a lot of wiring with a mirror. If everyone was scttered across different forges, we’d be emailing patches to each other using git send-email… which is not a bad thing to do but is more work than maintaining your own repo which just has patches on it used to generate your schemas.
To build a set of schema using the current tooling the developer still has to go through all of the schema files, decide what to include, what to omit, and where to extend…and produce a set of (almost) json schema files.
I think this is by design. If one is producing a Profile, you have to be clear with what your aims are. Who are your audience, how do their needs differ from the those addressed by HSDS itself? What additional fields do you need to support, and what things are over-complicating things.
Having to produce a patch for each schema, even if just to null it out; forces that reflection point. Or I think the hope is that it does!
My thoughts would benefit from a canonical location for schema storage. For example, should there be a repository on Github which was dedicated to the set of json schema, without additional source code, it could be forked, branched, and referenced accordingly; all neat, tidy, and succinct!
In theory, this is what the HSDS Specification repo is. The one key deviation here is that the HSDS Repo has canonical documentation as well, which it makes sense to co-locate with the specification because it should be versioned alongside it.
In practice, I think v3.0 of the HSDS repo was very ambitious about also including automations that produced serializations. I think we’re making progress in paring it back to get it closer to how you describe (plus the docs ofc).
If the new feed opted not to extend or modify a particular schema you can use $ref to directly reference the base file, or one in any other published repo. Branches and forks can be merged, rebased, compared, etc. fairly easily. One repo could even reference a particular file, in another repo, at a particular point in time by branch, or even down to individual commit level.
You highlight an important thing here: if a Profile doesn’t touch a HSDS schema, it wants to keep it as is and effectively “do nothing”; would it not be beneficial to have the Profile’s schemas instead $ref back to the original source schema?
In an ideal world of Linked Data and Semantic Web; yes. This would be amazing. It would reduce duplication, promote interlinking and sharing of schema etc.
There are drawbacks and caveats to this. It exposes $ref resolution to infrastructure outages from other Profiles, and you’d need to make sure that you either tied a specific $ref to a specific commit from a specific repo or risk getting updates that you don’t want. This starts to resemble the dependency-hell that arises with software package management ecosystems such as pip and npm, which have known and deep-run problems.
As a general rule all changes require a new branch, hence a new version, perhaps adding “b” (for beta) until the branch is agreed published and will not be changed (unless an increase in minor version number).
In theory yes, however if one is e.g. using a schema from another Profile, you trust that they have similar practices to yourself. There might exist a future in which some HSDS Profiles are the work of a single developer pushing to main! You could point the URL at the specific commit, but this gets into the same discussion as above.
Mind you, this is true regardless of how we define a Profile. If we use a $ref to point to any external schema, we run that risk regardless of whether we’re generating Profile schemas via JSON Merge Patch or whether we’re maintaining forks.
One Caveat that I discovered is that Github will rate limit you if you attempt to reference its raw files in somewhat rapid succession. I do not know what these limits are or if there is a way of increasing them. To be fair I was hammering the validation logic; I was running a few dozen asynchronous validation requests to see what would happen. Github gave me 404s. One option might be to cache json refs.
Aha yes. Github famously started rate limiting aggressively in the last few years. If I recall, it was a response to the increased amount of AI/Bot traffic. Regardless of one’s personal feelings on LLMs/AI; the irony is delicious.
The practical side is that we either need good cacheing in tools or do something else.
Be nice…This is one of my first open contributions to the forum. I very well may have everything completely wrong.
These are valid discussions and the community needs to have them. While I personally may be in favour of the “patches rather than forks” approach, it’s not sacred by any means and warrant careful reflection. Questions and thought processes such as these are the embodiment of that type of productive critique and I think it’s safe to say that ORUK is currently the one forging a path for what it means to maintain a Profile of HSDS.
I am thinking this would result in smaller, and fewer schema files and inter-dependant but controlled. The schemas would use $ref to other schemas where appropriate and be more normalised than compiling all dependancies.
There is definitely something in the idea that a Profile might want to base itself off of another Profile in the future.
At the moment, we have a sort of “star” or a “hub and spoke” model, where we have HSDS and its various branches at the centre (the hub), and Profiles exist as separate variants of it (the spokes). It would be interesting to think through the implications of inter-dependency models wherein Profiles can either be based off others, or draw from multiple Profiles.
Therefore, and I know it will end up being Matt answering this (better be using Vim)…
Lol. See attached screenshot for evidence! (also note the sneak preview of my upcoming issue about Compiled Schemas…)