As we discussed, I wanted to chime in with some of the meta data that we have need in the past and why; some is tied to the file / feed and some is per-record. I’ll break it down by who “needs” the extra information and why.
-
System / software level:
We have different protocols for import based on the source system, even with the same incoming format, and our external ID storage is actually system code (specifically, the software it came from, but theoretically we could accept system codes that were multiple sources but the same software) + external ID. This allows distinct handling based on source (e.g. taxonomy or coverage area name transformations). For non-GUID identifier systems (not an issue with HSDS but is an issue for some systems we manage) it also allows for a unique ID to be created via system code + external ID, even in cases where we have the record coming in with the same ID but gets forked / duplicated for various reasons. This is also distinct from what we would call the source database name/URL, since we can have the same software or version but multiple sources / feeds. -
Funder / attribution:
In some cases we have data from multiple data managing organizations or multiple public source sites coming out of the same software system in the same file / feed. This means attribution at the system level is not sufficient, and needs to be available per-record, for funder verification of contributions + data quality analysis, and public record attribution in websites etc. -
Ability to request changes and report issues:
- Need both an overall data source / provenance AND where possible the method to use to request corrections to specific data; this is key for trustworthiness and reliability for users.
In sum…
At the file/feed level we would have:
- Source system code (consistent system-internal identifier for the source software system, which is used for unique ID formation + specialized import handling)
- Source system name (“user-friendly” database source name in all applicable languages for display purposes)
- System URL (all applicable languages, this is to the website of the data owners / public database source NOT the software vendor)
- Source system version
- Schema version
At the record level we would have:
- Record owner agency information (we have a unique code + name for attribution purposes and funder use)
- URL for submission of changes to the record (to account for the public and/or other end user wanting to suggest possible corrections to the data when they do not have direct editorial permissions)