What is good vocabulary for describing Data Quality?

In our Validator and HSDS validation conversations, we’ve been dancing around different words and concepts relating to the quality of data coming through APIs. Reasons for doing so include:

  • As @oughnic pointed out most recently, is to know who to contact to “fix” problems. Is it a technology problem? Contact the technology developers. Is it a source data problem? Contact the source data stewards.
  • So that we can intelligently talk about assessing and improving data processes and tooling.

Based on my recent experiences working in various data orchestration and collaboration projects, I’ll take the bold move of proposing some standardized vocabulary and definitions for data quality. I’ll be speaking from the perspective of Open Referral, assuming that someone is using HSDA/HSDS, using the validator, and seeing some problems. How should they describe the types of problems they are encountering?

  • Validation
    • Example of Good: “I can plug into this data source and all of my HSDA tooling just works automatically.”
    • Example of Bad: “This data I’m getting isn’t even in the correct schema; I’ll have to retool for it.”
    • Description: Whether or not the shape of data is valid HSDA/HSDS.
    • Blame: Technology Vendors
  • Field Level Interoperability
    • Example of Good: “As much of this data as possible is machine readable, making it easier to parse, present, and share with users.”
    • Example of Bad: “Argh! All hours of operation are plain text… I can’t tell users basic facts like ‘what’s open now’.”
    • Description: The degree to which field level data like schedules, service areas, and languages codes are structured, standardized, and generally machine readable.
    • Blame: Technology Vendors or Data Stewards - Some technology doesn’t support creating structured data, some Stewards don’t use the features they already have
  • Completeness
    • Example of Good: “All critical objects and fields seem to be here, including contact info, Last Assured dates, addresses, basic descriptions, etc.”
    • Example of Bad: “Many of these records are missing phone numbers and Application Process notes; users won’t even know how to contact services.”
    • Description: Data is missing key data elements that degrade usability.
    • Blame: Technology Vendors or Data Stewards - data may have been accidentally omitted by Vendors, or not provided to them by Stewards in the first place
  • Freshness
    • Example of Good: “Most of these records have been Assured within the past year; I feel like I can trust the accuracy of this data.”
    • Example of Bad: “These service records haven’t been updated in over five years, I’m not sure I can trust this is still accurate.”
    • Description: How recently individual records have been Assured (AKA verified) for accuracy.
    • Blame: Data Stewards
  • Accuracy
    • Example of Good: “Users report that all the phone numbers they called are working and connect to the expected service.”
    • Example: “My users are showing up to servides with the wrong require documents; our descriptions are listing the wrong documents.”
    • Description: Whether assertions made in data and descriptions are true in reality.
    • Blame: Data Stewards
  • Richness
    • Example of Good: “It feels like everything I need is here: descriptions, how to apply, application requirements, hours and contact info; as a user I know exactly what to do next.”
    • Example of Bad: “These descriptions too terse; I feel like important information is missing.”
    • Description: Rich data improves the user experience by going beyond merely required information to include helpful information, like “Tips for Applying”, schedule notes on the best time of day to call, extra eligibility criteria or target groups, which required documents are preferred, supported languages and interpretation services, payments accepted, accessibility notes, etc.
    • Blame: Data Stewards

I’ll bet there are more dimensions to consider. Forgive me if I’m retreading ground that’s already documented somewhere.

The most subjective item in that list is “Richness”, but as a user experience designer I certainly feel the difference between “rich” vs “anemic” data. It’s easy to tell when the end-user experience was front-of-mind for Data Stewards.

Freshness and Accuracy are closely related, but distinct. Inaccuracies can occur even in the midst of fresh data: user error or misunderstanding, faulty sources, etc. Perhaps this is an academic and not very useful distinction, but there it is. I’m just brainstorming at this point.

Inform USA has some useful elements to consider in Section 2 of their Standards: Standards - Inform USA (formerly AIRS, the Alliance of Information and Referral Systems)

@mrshll @bloom @oughnic @HannahN what would you add, remove, or change about this list? Do you already operate with a rubric for overall data quality? If so, what is it?

I was wondering where Taxonomy/keywords would fit, but they could almost be their own category, especially when you look at rules around correctly applying the hierarchy, not coding secondary services, coding programs consistently across agencies. However, taxonomy granularity is similar to richness, ie do you use “Food” or “Food Pantries” and “Formula/Baby Food” and “Soup Kitchens” and “WIC” so maybe it all fits there?

Thanks @skyleryoung - good topic.
My definition of “good data quality” is about whether a datum or collection of data are good enough to be used. Consequently one person’s good data is useless for someone else’s use case. The other natural consequence of this is that data are often recorded for one purpose and then “reused” for another purpose. Good data may not be fit for other purposes.

Specificity is another challenge. For example, a service description could be very precise or quite vague. Welfare support, food bank, vegetarian food bank, vegan food bank, Kosher foodbank, Halal foodbank could all be used as labels for a service. The corect level of granularity depends on the needs of the user. Someone who is hungry and has no dietary preference may have very different needs to a halal eater in New York, but the kosher foodbank may well match their needs for a service.

A framework used in England is to mark each datum on a five-classifcation system
Valid - the datum is fit for the new purpose
Other - the datum is technically correct but not specific enough to meet the new purpose
Default - the datum is using a default value (e.g. 1 January for a date of birth) so there is no confidence in the value
Invalid - the datum cannot be understood because it doesn’t meet the rules for the field.
Missing - no datum!

… and I’ll leave it to the reader to decide whether 13/03/2026 is a valid date. It probably depends on your context.

1 Like

Skyler, can you say more about your perceived distinction between the following pairs:

Validation and field-level interoperability
Completeness and richness

These strike me from a distance as roughly synonymous. Are there differences between them?

I consider freshness to be a signifier for accuracy. it is true that they are different criteria. but the former is perhaps only relevant to inform assumptions about the latter?