Approaches to extending the HSDS data structure

At the initial working group meeting for the next version of the Open Referral data specification (HSDS) there was some discussion of approaches to extending HSDS for different use cases. We don’t want this to delay getting an upgrade of HSDS out, but ultimately we need to understand and agree on an approach.

Two approaches are:

  • Filter a wider (extended) core data structure in different ways for different applications
  • Extended a core HSDS in different ways for different applications

In the UK we took the second approach when developing Open Referral UK. HSDS 2 adopted some of our extensions but in a slightly different way, so UK and US/international structures have incompatibilities.

Work so far on aligning (see US and UK Alignment and version control from February 2022) takes the second approach of having an “extended” HSDS with filters for different application profiles.

I’m keen on the “filter an extended HSDS” approach because it allows for greater re-use of tools developed and shared, such as the Open Referral UK validator. It avoids the same properties being describned differently by different publishers.

However, it does not have to be an either/or approach. Individual publishers can add their own properties which the validator ignores as long as these properties don’t conflict with the core specification. Open API feeds can be examined so we can assess if properties added would be useful to add to the HSDS standard in future versions.

The HSDS “servce attributes” and “other attributes” also allow for extension by referencing taxonomy terms that describe attributes. For future versions, we’ve suggested that we consider adding an optional “value” field to attributes. Again attributes added can be reviewed to see if anything is worth adding as a property in its own right in future versions of HSDS.

Thanks for this, Mike.

I think what makes this hard is that HSDS is used for interoperability inside contexts where certain aspects have to work in a particular way, and that particular way might be different between contexts: it would be great to have the UK validator be re-usable by other OR contexts, but we should also recognise that some features (I can imagine UPRN checks or AIRS taxonomy term lookups, for example) will always be context-specific.

I think that we broadly align on the question of how far we expect interoperability: the conversation so far has been much more about tools than data, so I think that we can focus on that.

Unsurprisingly, I guess it then comes down to a question of governance: how will we decide what gets in? How will we decide which of several, potentially incompatible, approaches to modelling certain concepts we use?

I don’t think that either a filter or an extension approach alone solves this; I think we just have to work through the consequences of our choice.

the validator ignores as long as these properties don’t conflict with the core specification.

This is good behaviour to have, IMO, but it is different from the behaviour that someone would experience taking an extended HSDS datapackage (which is how we currently ship HSDS) and running it through the OKFN datapackage validator. In practice, I don’t think anyone does this, or wants this “closed” behaviour - but we should be mindful of making sure that our tools match up with our expectations of the standard.

Thanks Rob. I think our views are broadly aligned.

some features (I can imagine UPRN checks or AIRS taxonomy term lookups, for example) will always be context-specific

Yes. It’s a question of what goes in the standard, what goes in an application profile and what is left to local interpretation.

Regarding your specific examples:

  • Por20349 - HSDS - US and UK Row 88 proposes an “external_identifier” which UK guidance or a UK public sector application profile might require to be a UPRN (as mandated by government)
  • I think we all see Open Referral as taxonomy agnostic but again an application profile might mandate a specific taxonomy like AIRS or the LGA’s service types in specific cases. Certainly, UK users who want to combine feeds from multiple sources want consistent use of taxonomies across those feeds.

After a recent upgrade working group call, I wanted to write up here something that I mentioned in passing that I think is relevant. @skyleryoung may be particularly interested in this as well.

Fundamentally, all of what we’re talking about here is making sure that a particular bit of data (a file, API response, etc) is structured in a way that means that a particular application is able to use it. And, by “structure”, I mean field names, properties of contents, API methods, and more - anything that can be considered a container for the information that needs to be exchanged between the systems involved.

Standards like HSDS prescribe the ways in which certain concepts have to be modelled: a service has an id and a name and a status, and (even though it’s not in the schema), the standard says that the email field is for an email address. Someone putting a postal address in the email field wouldn’t be following the standard.

So far, so straightforward. But what about times when not everyone has the same needs of the data? That’s what we’re talking about in this thread: making it so that people who do have the same needs of the data are able to work together, without making the standard burdensome (or irrelevant) for everyone.

One approach is to put together a set of additional prescriptions: I might say that I need to know the fees for a service, and so I can only work with data where that field is filled in. If someone else has the same need, then we can agree to always use that field. We might even go further, and say “We offer HSDS-compilant data, always with fees”. Obviously, in practice, it’s likely to be a set of requirements, that we can bundle up and call an “extension” or a “profile”, or something. This approach is great when (and this list isn’t exhaustive):

  • there’s a clearly defined need, and agreement around what’s being described
  • there’s some level of coordination, and the opportunity for the creation of a new artefact
  • a particular approach is required for a particular context - as in the priorities of OR-UK

Another approach is to devise a framework by which data can be described. If I need data that has fees included, then I can check a bunch of data sources, and see if they include fees. The same holds if I need data that uses AIRS, or both AIRS and fees. I might even publish the results of my checking, so that anyone else in my situation can understand what data they can use.

We recently made a quality dashboard for 360Giving which is based on the idea that we can describe the qualities of data, rather than judge that data is “good quality” or “bad quality”. We’ve picked ~10 features of data that we think are useful to know about (such as grant duration, location codes, recent publication) and run a daily test of all the data in order to provide a report of what data has what qualities. This is intended to set out our idea of what’s useful in data (we chose qualities based on research) and provide a basis for potential data users to discover data that might be useful for them.
This approach is useful when (again, not exhaustive):

  • there’s no one clearly defined need, but several ideas of what’s useful in particular contexts
  • not all qualities are relevant to all data sources (e.g. an historic publication can never get a “recent publication” badge, and that’s ok, but means your data isn’t useful for a “this month in grantmaking” newsletter)
  • there’s little use for a new artefact

These approaches are, of course, not mutually exclusive - but I think could help inform our discussion on how we proceed with extending/constraining/profiling HSDS.

In practical terms, the approaches do converge, so we don’t have to bake in a choice, right now and for all time.

Thanks for the thoughtful reply @robredpath. I always learn a lot listening to you (so to speak).

One of the principles I’ve been operating by is that the broadest and most useful interoperability exists in the core specification, which should never be violated by extensions (or filters or what have you).

Doubtless, my opinion is informed largely by my experience.

For example:

My clients, the 211s, are targeting two objectives relevant to this topic:

  1. breaking down data silos to cooperatively maintain data with other organizations (who don’t necessarily use the AIRs Taxonomy), and
  2. adding additional fields to the taxonomy_term table so that they can take advantage of the full feature set inside AIRS Taxonomy.

My observation is that they can accomplish both.

For the first case, taxonomy_term fields of code and term in the core spec are more than adequate for handling the basic function of AIRS Taxonomy (Connect 211’s whole taxonomy search paradigm runs off just those to fields at the moment). But, importantly, cooperating organizations are more concerned about sharing contact information, addresses, services, and other nuts-and-bolts data than they are about using the AIRS Taxonomy.

For the second case, a 211 Application Profile can add all of the additional AIRS Taxonomy fields that open up a world of possibilities for 211 apps. As long as the extension doesn’t violate core, they can have the best of both worlds.

I recognize that this all get’s more complicated if we start down the path of sharing, or enforcing, standards around eligibility criteria, etc. However, at least in the USA, we are far from consensus on what standardized eligibility criteria should be, which is why I’ve been advocating not for standards around eligibility criteria, but rather standards around how to structure eligibility criteria, if possible. I certainly cannot enforce which age ranges should be used universally, but I can enforce how any given set of age ranges is stored in HSDS. Hopefully.

I’ll stop there lest I start rambling. I do want to note that I’m very intrigued by your model for assessing qualities vs “quality”. I think that’s brilliant. We’re potentially diving into that topic at an upcoming hackathon for Washington State data collaboration.

I’ll close with a question: are we in this group sharing the assumption that extensions should not alter the core spec in any way?

Thanks @robredpath. Interestingly, in the UK meeting to discuss use cases, the desire to see which records met a quality threshold came up.

I think the profiles we have in mind can address both the prescriptions and the descriptions you describe, by applying tooling differently. Assuming a validator checks compliance with the core HSDS (as reflected in API query responses), then you use the profile either to validate to accept/reject records or to assess the quality of data (and maybe define a threshold for using records (e.g. in an aggregator)).

Yes. I think we’re saying that application profiles constrain the standard and extensions extend it, but neither violates it.