In our Validator and HSDS validation conversations, we’ve been dancing around different words and concepts relating to the quality of data coming through APIs. Reasons for doing so include:
- As @oughnic pointed out most recently, is to know who to contact to “fix” problems. Is it a technology problem? Contact the technology developers. Is it a source data problem? Contact the source data stewards.
- So that we can intelligently talk about assessing and improving data processes and tooling.
Based on my recent experiences working in various data orchestration and collaboration projects, I’ll take the bold move of proposing some standardized vocabulary and definitions for data quality. I’ll be speaking from the perspective of Open Referral, assuming that someone is using HSDA/HSDS, using the validator, and seeing some problems. How should they describe the types of problems they are encountering?
- Validation
- Example of Good: “I can plug into this data source and all of my HSDA tooling just works automatically.”
- Example of Bad: “This data I’m getting isn’t even in the correct schema; I’ll have to retool for it.”
- Description: Whether or not the shape of data is valid HSDA/HSDS.
- Blame: Technology Vendors
- Field Level Interoperability
- Example of Good: “As much of this data as possible is machine readable, making it easier to parse, present, and share with users.”
- Example of Bad: “Argh! All hours of operation are plain text… I can’t tell users basic facts like ‘what’s open now’.”
- Description: The degree to which field level data like schedules, service areas, and languages codes are structured, standardized, and generally machine readable.
- Blame: Technology Vendors or Data Stewards - Some technology doesn’t support creating structured data, some Stewards don’t use the features they already have
- Completeness
- Example of Good: “All critical objects and fields seem to be here, including contact info, Last Assured dates, addresses, basic descriptions, etc.”
- Example of Bad: “Many of these records are missing phone numbers and Application Process notes; users won’t even know how to contact services.”
- Description: Data is missing key data elements that degrade usability.
- Blame: Technology Vendors or Data Stewards - data may have been accidentally omitted by Vendors, or not provided to them by Stewards in the first place
- Freshness
- Example of Good: “Most of these records have been Assured within the past year; I feel like I can trust the accuracy of this data.”
- Example of Bad: “These service records haven’t been updated in over five years, I’m not sure I can trust this is still accurate.”
- Description: How recently individual records have been Assured (AKA verified) for accuracy.
- Blame: Data Stewards
- Accuracy
- Example of Good: “Users report that all the phone numbers they called are working and connect to the expected service.”
- Example: “My users are showing up to servides with the wrong require documents; our descriptions are listing the wrong documents.”
- Description: Whether assertions made in data and descriptions are true in reality.
- Blame: Data Stewards
- Richness
- Example of Good: “It feels like everything I need is here: descriptions, how to apply, application requirements, hours and contact info; as a user I know exactly what to do next.”
- Example of Bad: “These descriptions too terse; I feel like important information is missing.”
- Description: Rich data improves the user experience by going beyond merely required information to include helpful information, like “Tips for Applying”, schedule notes on the best time of day to call, extra eligibility criteria or target groups, which required documents are preferred, supported languages and interpretation services, payments accepted, accessibility notes, etc.
- Blame: Data Stewards
I’ll bet there are more dimensions to consider. Forgive me if I’m retreading ground that’s already documented somewhere.
The most subjective item in that list is “Richness”, but as a user experience designer I certainly feel the difference between “rich” vs “anemic” data. It’s easy to tell when the end-user experience was front-of-mind for Data Stewards.
Freshness and Accuracy are closely related, but distinct. Inaccuracies can occur even in the midst of fresh data: user error or misunderstanding, faulty sources, etc. Perhaps this is an academic and not very useful distinction, but there it is. I’m just brainstorming at this point.
Inform USA has some useful elements to consider in Section 2 of their Standards: Standards - Inform USA (formerly AIRS, the Alliance of Information and Referral Systems)
@mrshll @bloom @oughnic @HannahN what would you add, remove, or change about this list? Do you already operate with a rubric for overall data quality? If so, what is it?