The use of UUIDs in the UK

This is a technical request from a logical person so please go easy with technical answers, I am looking for the benefits rather than simply meeting the standard.

I understand that UUIDs are necessary for the prime entities of Service, Location, Organisation and Schedule but there are a lot of UUIDs everywhere else in ORUK.

We have chosen to add the UUID of the related prime entity rather than generate a new UUID for a minor entity.

For example Contacts has its own UUID and then Phones is under contacts with a UUID.

Are we ok to set the phone UUID to be the service UUID?

I guess it is easy to say that the standard wants to generate a new UUID but can someone explain why that might be so we know if we are missing something.

Thank you

1 Like

It’s a good question @iansingo and I hope others such as @mrshll will chip in.

The main reason for using UUIDs rather than simple identifiers generated within a publishing organisation is so we can combine records from multiple publishers without identifiers conflicting. Another benefit of UUIDs is that they provide a unique reference that can be used with other data, for example in referral systems and systems that analyse the effectiveness of services.

Those points are particularly important to the organization and service entities. I would argue that they are less important to ā€œchildā€ records where, for example, a contact has a parent service, so effectively the identifier of the contact is the UUID of the service + the identifier assigned to the contact.

I would not expect our Validator to pick up the re-use of UUIDs across different entities.

Aren’t there potentially multiple phone numbers that might be associated with a given service?

Also, can you help clarify the tradeoff – what is the cost associated with generating UUIDs for various entities?

Thanks @MikeThacker and @bloom

I understand the point of identifiers not being duplicated although ironically a bigger problem is de-duplicating the services and venues and organisations with different UUIDs from multiple sources but that’s a different story.

I think the conclusion is that if we want to hold one phone number then we can get away with using the service UUID. However, if we want to hold numerous phone numbers then they will need their own UUID.

Is that correct?

Well, each record needs its own identifier which theoretically can be the same for a contact as one of the phone numbers. So that’s not an issue if you only have one phone number.

The broader issue of whether a different UUID is needed for everything is something that, ideally, I’d like @mrshll to comment on. From my understanding it’s most important that services and organizations have different UUIDs. I think we understand one another. :slightly_smiling_face:

Hey @iansingo,

Sorry for the late reply to this @mikethacker; I was typing my reply yesterday when I realised I had to run to the Technical Meeting!

Practical stuff first:

Are we ok to set the phone UUID to be the service UUID?

This would pass validation according to the rules of the Standard, because all a validator would see is that phone.id is a UUID. Schemas-based validators don’t do any cross-checking of identifiers, as that’s a Data Quality issue.

However, it’s not good practice for UUIDs. @mikethacker encapsulates the main reason for using UUIDs quite well; it’s so that records from multiple sources can be ingested and uniquely identified safely without collisions, and you can then use the UUID in other systems which want to do stuff with the data.

Ideally, you should be creating UUIDs for every unique entity you have, as this is how UUID is supposed to work. How you do this will depend on your systems. Generating them initially should be straightforward, but then you’d need to ensure that this UUID was actually stored with each record. Depending on how your systems work, this might be tough e.g. if your database doesn’t have a concept of a ā€œPhoneā€ and instead groups that information in another record type.

If you share a little bit about your systems, I could suggest a few things.

The background/motivation of UUIDs in HSDS:

The long and short of it is that UUIDs are the most straightforward and easiest-to-implement solution for uniquely identifying different types of record in a global dataset, given the design philosophy and history of HSDS as a Standard.

I wasn’t around in the early days of HSDS so I am keen to be corrected, but I understand that HSDS comes from a place where people have been storing and exchanging sets of normalized data, where everything has an identifier in the system. Previously, it was published as a ā€œTabular Data Packageā€, with CSV files containing identifiers to tables of data in other CSV files.

A lot of that history is reflected in the current design of the HSDS. The different objects in HSDS preserve the ability to reference other ā€˜tables’ of objects by identifier, whereas a more contemporary approach is to totally abstract implementation details (i.e. normalized databases) away from the data model used for exchange. In other Standards used for Open Data, globally unique identifiers are usually only necessary for the top-level object e.g. a Grant or a Contracting Process. Everything under that just needs a local identifier to support with parsing and querying arrays. It’s very rare that someone analysing e.g. some OCDS data will need to uniquely identify a particular document in a merged dataset of documents. Instead, they’re looking for particular contracting processes and then can drill down to find the document they need.

With HSDS, I think it’s a bit different. Since phone numbers can be attached to services, organizations, locations etc. HSDS has modelled that. Although I imagine there will always be overlap in global datasets, having a UUID for the concept of a ā€œphoneā€ record means that systems have the ability to ingest data about various different entities and then update or check records match appropriately. For example they might have consumed informatin about Service A from somewhere, and created the entry for that service’s phone number in their systems. If they later ingest information about Organization A from somewhere else, they might spot that this supposed to be the same phone number if it has the same UUID. This allows the system to then: update the record if appropriate (last updated date maybe), throw a warning out, or check that Organization A is somehow associated with Service A and therefore maybe this is appropriate and well.

Hopefully that has clarified a few things!

2 Likes

The broader issue of whether a different UUID is needed for everything is something that, ideally, I’d like @mrshll to comment on. From my understanding it’s most important that services and organizations have different UUIDs. I think we understand one another. :slightly_smiling_face:

In general, this is correct. Service and Organization are the two ā€œmainā€ models in HSDS, so identifying them uniquely is important.

An important caveat is that the UUID is really the identifier of the record, especially in the context of Organizations. Organizations also have access to the separate organization_identifier schema, which is a way of identifying an organization globally based on national/international registers of identifiers (e.g. Company House numbers) for cross-dataset analysis e.g. tracking organizations across procurements and service delivery etc.

In practical terms: the schema requires that each entity has an id property, and that property must be a string conforming to the UUID format. Whether they’re unique or not is a data quality issue. It’s not ideal and might cause problems for data users, but it’s ā€œvalidā€.

1 Like

Ok thanks @mrshll, your points are understood.

As you say we have aligned the phone number with the service or organisation or venue and the contact who owns the phone number so any change is in context but I recognise that we can still do this and use UUIDs for those that might want to maintain the phone numbers independently.

However it does make it more difficult to use ORUK without any real benefits. I guess that is the balance decision that ORUK need to make.

Cheers

Ian

1 Like

I agree with this, and in future developments of HSDS (and ideally ORUK, if it comes along) I’d love to see a move away from the need for UUIDs. I think it adds extra complexity and I agree with you 100% that it creates a burden of finding some way of de-duplicating records if two records have been published about e.g. an organization but have different UUIDs!

For Organizations, this problem is addressed (never solved!) by having good Organization Identifiers and there has been a lot of work in getting various governments and entities to open their registers of identifiers for use with open data.

I’m unaware of similar registries or identifiers for Services, and whether a single globally unique identifier for a Service could even be possible. At some point, Open Referral will need to figure out whether it wants to have a method or guidance for creating globally unique service identifiers based on some other fields, or whether this is something for the wider industry to address.

1 Like

@iansingo it sounds like if you’re working with a system for which there are no foreseeable plans to exchange data with other systems, and in which there are simple records with only one phone, URL, etc, you could take the shortcut as proposed and still be technically compliant. Of course if there is a need for exchanging data among systems then the benefit of the UUID method is that it makes federation much more feasible.

We actually do have some tooling for deduplicating organizations and services from multiple sources, which might be helpful for addressing the bigger problem you allude to. Perhaps @skyleryoung can point to that.

Meanwhile, @mrshll at some point I’d like to hear more hypothesizing about how this could possibly be done without UUIDs at all.

1 Like

I can’t help but comment on the de-duplication aspect as it is the crux of the real benefits of ORUK going forward. It is a burden but one that we must go through to get to the efficiency and accuracy benefits. UUIDs on the prime entities does help here but not so much on all entities.

The de-duplication issue is obviously bigger than good identifiers although this does mean we can maintain information from a particular source. The problem is working out which source has the accurate information. If there was definitive sources then we’d be fine. e.g. if local government owned all location info through the UPRN then we’d only have one source but it is not a complete data set and accessibility data comes from various places. Same for organisations, great if it all came from companies house or charity commission but it doesn’t. Services are even more difficult as there are many similar services and the data is collected by numerous sources and all with a slightly different take and none can really be relied on. UUIDs help us know the source information but don’t help identify one definitive source.

Our take on defining the source for services is to ask the organisation (service provider) to name their preferred assurer. This could be themselves or a third party but then we know the definitive Service UUID. From there we can ā€˜encourage’ other collectors not to waste tax payers money and not collect data which in turn minimises the need for de-duplication.

We want people to focus on consuming the data from a single federated repository (as they will focus resources on accurate data) and then present it in to numerous contexts so that it is appropriate for the particular context.

We are looking to exchange information with lots of frontends that will present the data in the context that they are focussed on e.g. adult social care, family information, SEND, youth and issues such as loneliness, suicide prevention, healthy living etc.

However a lot of the information e.g. phone number will remain in the context and not shared independently so we do need UUIDs but maybe not as many as we have.

The whole point of ORUK though is that those who are consuming the data, know exactly what the data format is so that their software will work in Lancashire, Dorset, London or wherever. If the data collectors/aggregators/assurers all start interpreting ORUK in their own way then we will have simply created lots of proprietary standards. We probably need a stronger ORUK telling everyone that ā€˜this’ is the interpretation as that will get the most benefits.

And don’t get me onto taxonomies as that is crucial for the data consumers to be able to find appropriate services for their users. And we don’t even have one standard taxonomy. (I know we can map one to another). That is another story.

1 Like

Meanwhile, @mrshll at some point I’d like to hear more hypothesizing about how this could possibly be done without UUIDs at all.

Don’t get me wrong, at some technical level you need unique identifiers to help identify specific records in systems. My take on this is that if HSDS is being used as an exchange medium, the system’s identifiers are not important. UUID is exceptionally good at identifying records, not necessarily specific entities which exist in the real world — such as services and organisations — which usually have multiple records describing them.

When exchanging data, UUID solves the issue of ā€œhow can I be sure that this record from this system is describing the same thing as it was beforeā€. It doesn’t the wider issue of how to uniquely identify this service/this organisation in the world so that different systems modelling the same organisation can identify it as the same organisation.

The way we usually address this is via unique identifiers that exist in use elsewhere, such as registers of organisations etc.

I’m not sure how services are identified within or between jurisdictions. So UUIDs representing records of services might be the best mechanism we have there.

However, HSDS actually does cater for uniquely identifying organisations across datasets, via the organization_identifier schema.

This model provides a flexible-but-structurally-consistent way of applying known external identifiers to an organization in a HSDS dataset. The actual identifiers can be drawn from different places, and it’s known that lots of organisations have multiple e.g. Open Data Services has a UK Companies House number and a US Tax Identification Number.

Multiple identifiers can be associated with a specific record, and a system ingesting information about the organisation can check its associated identifiers and cross-reference them with other known ones to uniquely identify the organisation.

One could use this not only to de-duplicate information about organisations within HSDS datasets, but also do cross-analysis with other datasets doing the same thing e.g. tracking service procurement from OCDS and then monitoring the service delivery/availability information in HSDS from the provider who won the procurement.

We maintain a list of known lists of organisation identifiers at org-id.guide:

So, in a theoretical future, HSDS has lots of options to support this use-case. We could adjust the schemas in a few different ways:

  • Loosen that organization.id doesn’t need to be a UUID, and encourage the use of org-id style identifiers for the id property.
  • Keep the UUID requirement for organization.id but ensure that semantically this only identifies the record from the particular system and then enforce that an organization object MUST have e.g. organization.identifier which is an instance of organization_identifier, as well as an optional list of additional identifiers to support cross-referencing in other systems.

For now, one could develop a Profile (or adjust the UK Profile) to add the enforcement of adding organization_identifiers, and write guidance about preferred identifiers (the UK Profile could also enshrine use of e.g. GB-COH numbers or GB-CHC numbers).

the problem is working out which source has the accurate information. If there was definitive sources then we’d be fine. e.g. if local government owned all location info through the UPRN then we’d only have one source but it is not a complete data set and accessibility data comes from various places. Same for organisations, great if it all came from companies house or charity commission but it doesn’t.

100% agree; this is a challenge faced by lots of transparency and open data projects. I wish I could tell you that there’s a solution in the UK, but there simply isn’t.

The best we can do at the moment is label organisations using external identifiers as-best-as-we-can, and be clear which register that identifier is drawn from when it’s known.

In the UK, the single sign-on for doing procurement should support in percolating the availability of identifiers for organisations although there will always be gaps (sole traders, organisations which don’t need to register for the charity commission etc.).

Our take on defining the source for services is to ask the organisation (service provider) to name their preferred assurer. This could be themselves or a third party but then we know the definitive Service UUID.

This is interesting to me, as it starts to look similar to building a unique identifier out of component parts. If you had a record of the assurer’s organisation identifier (which is a big if, I know) you could use this as part of the identifier for the service.

I’m wondering though, how do you handle cases where the assurer either changes with the service provider, or the assurer assures multiple different services across a dataset?

HSDS 3 has a field to identify the assurer via their email address. We match this to the registration of the service provider where they state who their assurer’s email is. So if there are multiple assurers of the same service we could automatically tell the unregistered ones to stop wasting tax payers money and only update according to the registered assurer. The provider has the ability to change their registered assurer which will impact on the automated assurance. Obviously this only works for our API. If someone is aggregating from several APIs then they may encounter duplication but wouldn’t have access to our registration system so not know the definitive assurer.

However, at this stage there are not many people aggregating local sources never mind at a higher level so one for the future.

2 Likes