Handling confidential addresses in HSDS 3.0

While working on a couple transformation projects these past weeks, I was reminded of a fairly urgent need in the spec: the ability to denote when an address is confidential, or should otherwise be hidden from public view.

For example, I may want to store the address for a domestic violence shelter on the backend for call-centers to reference, but in many cases I will not want to display that address publicly. I need a way to communicate that detail to consuming applications.

I can think of two methods for solving this off the type of my head:

  1. Add a fourth enum to address.type with value of “confidential” or “hidden”. This takes advantage of an existing convention, which is good, but may be too ambiguous in defining what type of address is hidden. Is it a physical, mailing, or virtual address that is being hidden? In practice, I have only encountered a use case of physical addresses being hidden, but I’m not sure if that can be generalized as an implicit rule or not.

  2. Add a boolean “confidential” or “hidden” field to the address table. I don’t love adding another field unless we have to, but it does disambiguate what type of field is being hidden, if that matters.

@devin @MikeThacker

I’m very nervous about such a change because I only consider a publisher to be compliant with Open Referral if they have an open public API feed complying with the minimum API specification. So we can’t have an open feed with all data and a flag to show what’s confidential.

I’m also wondering if there are other fields or whole records in addition to address that should not be public.

Might we suggest that a private feed could add a field such as you suggest and include extra information but this is not to detract from the fact that non-confidential live data should be kept open?

We handle confidential contact information as well. Or rather: contact information that is only available to a subset of logged in users.

I understand Mike’s point about public APIs and think it’s a good one.

Maybe better for us to address confidential info in a section of the Guidance document?

Another issue similar to this, for us, is publishing/verification. Our users are creating these directories and they want to be able to save drafts of information, put it up for review, and then publish. HSDS doesn’t address that type of workflow but it does offer a format for the final product.

So, in summary, HSDS is strictly to be used as a standard for exchange of public data. When used for back-end or administrative purposes, it will need to be extended, and that’s a matter of implementation choices.

I can almost live with that, but I feel like this particular use case has come up quite a bit, and will be a necessary component of all federation attempts. Organizations will want to swap confidential addresses (and probably contacts, to Devin’s point), but keep that from being publicly viewable.

I think we should say that may also suitable for separate interchange of non-public information but this should not be seen as an alternative to making open all data that is not sensitive.

Perhaps we can define a Profile for exchange and collaboration of data between enterprise organizations. In such a case, the goal is still the exchange of data (and a positive move towards more openness), but they will need to collaborate on private information. I would prefer that such things are still standardized as much as possible since the goal remains interoperability.

1 Like

I just ran into the need for this again today. I really think there is a necessary use case here. Consider this:

  1. A domestic violence shelter is public information that my clients want to share. The service, organization, and even general location info is all meant to be public and shareable.
  2. The only part of this record that should be confidential is basically the street address and geo-coords.
  3. It’s important that service area is still publicly accessible.

Having encountered this edge case several more times, I’m actually advocating strongly for adding an enum of confidential to address.address_type. I think this is still well within the spirit of open data as everything else that’s meaningful about these records is public, it’s only the address itself that needs to be confidential for purposes of security and safety of the population these resources are intended to serve.

@MikeThacker @devin @bloom

Is an enum sufficient to address the risk of exposure? What could go wrong?

Thanks for the question @bloom . The main thing that can go wrong is the thing we are trying to solve: sensitive address information can get leaked.

My recommendation for data that’s intended for public consumption is to nullify address fields if the is_confidential field is true. That way no leak is possible.

At Connect 211, all of our clients’ data (so far) have had some sort of “is private” field on their address data, and we nullify the street address field when this field is true. If we had a “is private” field in the spec, we would set that directly and nullify all of the address fields where we also have service area data present on the record.

Last side note: as I’ve been thinking this through some more, and as presented above, I have concluded that a boolean adress.is_confidential field would be better than adding an enum to address.address_type. This way we can denote both the type (physical, postal, or virtual) and confidential status of the address. Most use cases that I know of are for physical addresses, which is why I initially conflated the two, but it’s more description and flexible to break confidential status out to its own field.

This is a very interesting question.

HSDS is primarily intended for the public interchange of data; the work on use cases that has developed over time is underpinned by an assumption that data will be public and the current docs state that:

The primary use case served by HSDS is the provision of human service directory information as “open data,” to be consumed by any third-party information system.

It is convenient that HSDS can also be used as the basis for an internal database structure, or as the means by which information is privately exchanged, but it’s not a design intention. That said, greater understanding of and use of HSDS throughout the data supply chain can only help reduce the costs associated with public data supply, so I certainly don’t think that it’s something that we should actively exclude.

So, my immediate reaction is to say that anyone who has confidential data should make sure that it doesn’t go into their HSDS publication, and that it’s up to them to work out how to do that in their systems. If two organisations do want to exchange information that’s confidential, then it’s up to them to work out what “confidential” means for them, and to put policies in place to ensure that the data is handled appropriately.

I think this potential ambiguity around what “confidential” means is what concerns me: access control to sensitive information can be complex. I could imagine, for example, an organisation wanting to put different controls around the specific locations of DV shelters (where the risk of a leak puts vulnerable people at immediate risk of harm) than services provided on a “pop-up” basis (where the risk of a leak is that they might be inundated by inappropriate requests).

I can think of two potential ways forward.

One might be to use the attribute mechanism to classify addresses (or, indeed, any other object!) according to a suitable access taxonomy. A profile might be used to then specify what that taxonomy is, and what the terms mean.

Another might be to add a field to describe access control measures: this could be free-text in the core standard but overridden to an enum by a profile.

I think we can say two things at the same time:

  1. HSDS data (primarily service data) should all be publicly available.
  2. Certain aspects of information about a service may be sensitive and situationally unsuitable for public display.

The primary issue I’m dealing with, which I don’t think I have adequately isolated previously, is the need to understand in the data why the address is missing on one record and not most of the others. We need to detect and communicate that reason to application users in order to avoid confusion. In our system, for example, when we detect a “confidential” address, we give users this tooltip:

This location may be confidential or not open to the public. Please choose another provided contact method for this service.

All of that said, I think your idea about using attributes is quite workable.

The question before us is this:
Is this edge case common and important enough to warrant coercing a consistent solution in the specification (ie, adding an “is_confidential” field), or should we simply write some suggested guidelines around using attributes and leave it at that?

PS
In the latter case, I’d be willing to draft that article since it is top of mind for me.

The primary issue I’m dealing with, which I don’t think I have adequately isolated previously, is the need to understand in the data why the address is missing on one record and not most of the others. We need to detect and communicate that reason to application users in order to avoid confusion.

Ah ha! Now that, right there, is a penny-dropping moment. Wonderful stuff.

I wonder if the word that we’re looking for here is, in fact, “redacted”. A redaction marker is, I think, quite unambiguous: it means that in the context that you’re viewing the data, you don’t have permission to see this information, but you know it exists. It also has the crucial property that it replaces the data point in question: no-one will ever received the redaction along with the data point and have to work out what that means. I like that.

In the case of public data, it means the location isn’t public. In the case of HSDS being used for internal interchange (e.g. the access control case I discussed above), it means that your current access level isn’t sufficient. HSDS doesn’t have to care about why it’s not sufficient - it’s a response format, in that case.

We could use attributes to effect this, but I wonder if a more HSDS-ey way would be to use the “type” fields that we already have. location, for example, already has location_type which is constrained to “physical, postal, or virtual”. We could add “redacted” to that list, and have a rule that if it’s “redacted”, then no other fields can be specified.

As a side note, we should consider whether location uuids should be re-used in this case: if two things happen at the same redacted location, should they have the same uuid? What happens if one of them accidentally leaks information? e.g. if “DV Shelter” and “Yoga at Number 10 Main Street” both link to the same location, then it’s pretty obvious where the shelter is.

I think the fields most likely to require redaction already have “type” fields (location, phone, address), but it would be a MINOR update to add more.

Although part of me would like to move towards using attribute more, as I think it’s neater in principle, this does feel like it fits in with the broad design principle that we have continued in HSDS 3.0 of moving key attributes to be more proximate to the data itself.

(also, to be clear: I would expect the business logic that determines whether or not an address is specified or redacted to be entirely the concern of the publishing system: that could be as simple as a redact_when_publishing flag, or a full-featured IAM system. HSDS is just the container that moves the data around once that decision has been made)

What a delightful perspective shift. “Redacted” is a perfect fit.

For the record, I agree that “redacted” is the correct terminology for this concept, connoting that data exists, but should be hidden in a particular context, as you said.

This still leaves the question of implementation. I’ll recap the three obvious methods briefly:

  1. Add a redacted option to address.address_type. I do think we’ll need this on address instead of location so that a location can have both a visible mailing address and a redacted physical address, the most common use case.
  2. Add an is_redacted boolean field to the address table. This is necessary if its useful to know both the type of address (physical, mailing, or virtual) and whether it’s redacted.
  3. Add redacted as an attribute to the relevant records.

I see pros and cons for each.

address.address_type
PROS:

  • Establishes unambiguous standard for redacted data.

CONS:

  • Lacks descriptive power around what type of address is redacted.

address.is_redacted
PROS:

  • Provides the most descriptive power.
  • Establishes unambiguous standard for redacted data.

CONS:

  • I have a near visceral dislike of adding a boolean field. ¯_(ツ)_/¯

Use attribute
PROS:

  • Elegant, and it will work.
  • In the spirit of the spec.

CONS:

  • It’s more ambiguous; adoption depends on users reading guidelines or best practices.

What follows in now simply my opinion based on the above facts.

The necessity to describe what type of address is redacted eliminates option 1.

The degree to which we need to enforce uniform adoption of a solution for this problem decides which of the remaining two options we choose. The most intuitive and self documenting approach will be address.is_redacted. Whereas using attribute is more elegant, but will have to be specified, not in the spec, but in supplemental guidelines. I would assume this means lower levels of uniform adoption.

I’m open to either solution on balance. What are your thoughts?

Is there also the option of publishing a list of common attributes that are actually part of the spec?

There is one more option, which our move to JSON Schema opens up: an alternative definition for the address object, which would be a redacted address.

We could use the oneOf schema composition keyword, and provide one definition that includes all of the fields that describe an address (as per the current object), and a different definition that includes some indication that the data is redacted, along with any information about the address that we want to include alongside the redaction notice. That could be, for example, the address_type.

We would need some field which only existed in the redacted definition: that could be a string that can only have the value “redacted”, or a boolean that can only be true, or something like that. Just something to differentiate the redacted definition.

In tabular form, this would be represented as additional columns - it would fail validation if someone tried to supply fields from both definitions.

I summon @davidraznick for his reckons.

That is in interesting idea. I hadn’t considered the additional utility that JSON Schema brings in this case.

I’m very interested to get @davidraznick’s feedback on this.

There are issues using the oneOf schema word to choose different object structures depending on a certain criteria.

  • Generating the datapackage will be more difficult and require custom logic.
  • If the various objects have the same properties but different types it makes it impossible to generate the datapackage. We would need to make a check to make sure this did not happen.
  • Validation errors from oneOf and anyOf give off very bad validation errors which are hard to read unless customised.

However, there is anotther option. We could add is_redacted and potentially other fields to the address object and we can add validation logic that would be something like:

IF `is_redacted` is true THEN `redaction_notice` is required AND `address_1` should be empty AND `address_2` should be empty

In other words, I have found it to be better practice to seperate out the data structure from the validation of that of that structure. The data structure itself is better fixed (making generating tabular schemas easier). Howerver, we can make the validation rules change depending on different criteria. This means we can add validation rules about what subset of the structure is required (or not allowed) under various conditions. This can mean we could also add logic to say things like:
if address_type == (`physical` or `postal`) AND is_reducted is false THEN `address_1` is reqired

Thanks David, that is very interesting.

I think we have narrowed our options down to address.is_redacted or using the attribute table then.

What are our abilities for checking or enforcing rules with attributes? Can we check across joins like that?

In other words, would your example still work like this?
IF address has attribute of 'redacted' THEN 'address_1' is reqired

(Forgive my tragic pseudo-code).

Maybe this starts to open up a different topic, but I’m inevitably wondering about our ability to enforce, or at least document, specific usages of attribute. The use case we are currently discussing has the goal of facilitating a better end user experience, and I could see it potentially applying to other data like phone numbers or emails. If that’s true, using attribute would be much more elegant than spamming is_redacted all over the place.

Perhaps the settling questions would be:

  1. To what extent do we need to enforce one method for indicating redacted records. I’m willing to hear that my use case only applies to myself, and we should not worry the spec with it it. If that were the case, I would use attribute internally. I’m particularly interested in feedback from @robredpath and @bloom on this.
  2. If needed, to what extent can attribute be checked and enforced mechanically? @davidraznick, I think this is your domain :slight_smile:

Feels like an issue we might want to consider in the context of examples of how resource directory managers handle this stuff now.

Skyler do you have examples of clients that manage non-public records, can we learn more about how they do it? I can also ask some of our other partners.

I know that we did have one early adopter org that dealt largely with refugee and trafficking services, and once upon a time used HSDS for an entirely closed resource data sharing system, i’ve lost touch with the org since but could reach back out.

g