Hi @ananya_iyer,
This all looks very exciting! Really impressive to see what you’ve achieved here.
A few thoughts follow.
Transformer
This is really in-depth and it’s really impressive. I have some high-level exploratory questions about the use-cases of the Transformer. This is purely based on my reading of the README and the docs, so I might’ve missed something in this thread and I definitely don’t have much knowledge on what source data generally looks like; I’m keen to learn about what motivations you had and design choices you had to make.
My main question is: What’s my input expected to look like?. This might be obvious to everyone else, but from the docs I can’t tell whether I’m suppposed to have some non-standardised CSVs or whether I’m supposed to be massaging my source data into another shape for the Transformer to then convert? How are people getting their CSVs for input? Are they database exports, do they already have CSVs laying around?
This feeds into my larger question of Where does this tool sit in the ecosystem?. My experience/knowledge of HSDS is very oriented towards the API folk, who are talking about exchanging HSDS JSON from endpoints. Since this is abstracted from implementation details there isn’t a need for a transformer and thus I don’t know enough about what problems a Transformer is solving for me to contextualise how I’m supposed to use it.
This isn’t a critique of the tool; I just don’t know enough about this side of the ecosystem to understand what problem it solves. I’m keen to hear where you see this fitting in, since I think we’d probably like to signpost it to people and we’ll need to give them enough context to decide whether this addresses their particular problem.
All of that can probably be addressed by adding a few lines to the README so that a newcomer to the Open Referral’s ecosystem can understand what role this plays.
Last thought on this: there’s a lot of manual back-and-forth of downloading various CSVs to use as mapping files etc. It might be that this is simply a reality of how the source data works, but if people are giving you CSVs, you can sometimes shave off a bunch of complexity by supporting them producing a standardised tabular representation and then converting that to JSON.
We often recommend tools like flatten-tool for this. Its job is to support “round-tripping” of JSON data with flattened representations including CSV, XLSX, and ODS (the file format, not my company!). One interesting feature is that you can give it a JSON Schema and it can generate a series of “templates”. You can even tell it to use the human-readable names for the column titles.
So you could, for example, generate a template from the service schema and then people could use this to fill out their rows of data manually. Or you could let them put their data into a “src” tab and use some spreadsheet-fu to auto-populate the template.
Once you have the template populated with data, you can then just use flatten-tool to convert it to JSON. If you’ve generated the template from the service.json file, you’ll be expecting instances of service.json on the other side.
It’s a different approach and might not be appropriate given your inputs, but worth considering and discarding maybe. It might be that your tool could focus on getting the really-messy-non-standard files into a shape that flatten-tool understands (a folder of CSVs or a spreadsheet with headers generated from a chosen schema); and then you can let it handle the conversion from CSV.
Or I could be fundamentally misunderstanding the shape of the input data you’re working with!
Validator
Really good to see more things happening in the validator space for HSDS! The more of this, the better imho.
I’m aware that you’ve said you’re working on documentation for the validation tool, so forgive me for these questions which you’ll likely be covering there!
My questions are on scope of the validator functions and where you foresee it fitting into a workflow. Is this designed to support validation of local HSDS JSON files, or is it designed to be pointed at an API endpoint to retrieve and validate data?
I’ve had a quick poke around the source code for the validator and I’ve probably missed something or misunderstood, but I think (emphasis on thin) that:
- The CLI takes arguments for a directory of HSDS JSON files and a directory of the HSDS Schemas
- There’s also an option to spin up the validation library as an API with the same arguments available
- You’ve modelled the HSDS Schemas in Python objects
- When validating lots of files, you’re mapping the input file against a HSDS Schema based on the filepath of the input
- You’re having to dereference schemas manually in your tool
Broadly, I think this is a decent way to validate local HSDS files but I have some questions (these are my attempts to learn, not criticisms!):
- Why do you need to model the HSDS Schemas as Python Classes? The
jsonschema library (which you’ve required) doesn’t really care about that, and you can use it to validate instance data against schemas. It seems you’ve done the hard work of matching models to schemas already; I don’t know why one would need class representations of the HSDS schemas unless they were doing something specific.
- What’s the motivation to manually dereference schemas? In most cases I don’t think you need to manually dereference the schemas except as a cool learning activity (which is worthwhile by itself). The
jsonschema library handles this for you and you can even define custom referencing behaviour if you’re having challenges with local files: JSON (Schema) Referencing - jsonschema 4.25.1 documentation
I’m also curious about how you found using the “compiled” HSDS schemas? These are the ones that the openapi definition is “wired up” to, and are already fully de-referenced. Notably, they’re only available for a few key schemas.
If you wanted to develop the tool further with the ability to validate external API endpoints: I have a working proof-of-concept validator also written in Python. It’s GPL-3 licensed so open for study and is copyleft (so has implications for any borrowing of code):