HSDS Transformer Use Cases

Hi all,

My name is Ananya Iyer, I am a member of Blueprint @ Stevens Institute of Technology in Hoboken NJ working with Open Referral to work on the transformer project. I’ve been drafting some user stories for the Transformer, to handle CSV/JSON to HSDS 3.0 conversions. Since we are relatively new to the project, I wanted to ask for some input considering your experiences with the old transformer.

So far I have put together a small first draft spreadsheet with 10 user stories, covering CSV and JSON transformation, batch processing, error handling, API inputs, and configurable options. I have also drafted a couple Gherkin-style scenarios to see how these stories might look as acceptance tests. Before I expand the list further, I’d really appreciate your feedback:

  • Are the right user roles (publisher, commissioner, consumer, developer, data manager) represented?

  • Are there any edge cases that should be included but aren’t yet?

  • Should we emphasize batch processing, error handling, or API support more strongly?

  • Are there lessons from the Validator stories that should definitely be taken here?

To make this easier, I’ve also created a short feedback form.

Thanks so much for your time and guidance — I’m looking forward to learning from your expertise and making sure these stories are useful to everyone.

1 Like

Hi Ananya, and welcome to the Open Referral forums!

I work with the HSDS Standard itself, so I can’t comment directly on the use cases from a perspective of someone wanting to produce it; but these look fairly good from my understanding of the ecosystem.

I was chatting to @bloom about this and it tickled my memory in a few ways, so here’s some things that you might find interesting in your quest to build a transformer:

First is from another Open Data initiative — Open Contracting and OCDS — with the OCDS ETL Pipeline tool. Also called OCDS Kavure’i (there is an tradition of naming OCDS tools after various birds…), it works by having the user develop develop some SQL queries, which are then designed to produce data in OCDS’ CSV serialization format which can then be transformed into OCDS-compliant JSON using flatten-tool and the OCDS Schema files. The tool then loads it into an Elastic Search database.

The tool was originally developed by the DNCP Team in Paraguay and meets their use-cases, but I think the model is interesting if you can’t fork it or re-use any of the code. HSDS is materially different to OCDS in a few significant ways (which I’m happy to chat through on a call if you need), but crucially HSDS still supports the Tabular Data Package Format… which is a bunch of CSV files! So it’s feasible to think that one way to develop a transformer is to develop a pipeline similar to Kavure’i where the input is .sql query files against the user’s database, resulting in a bunch of CSV files which technically constitute valid HSDS according to the Tabular Data Package Format.

From there, I think the job would be to try and convert these to the HSDS JSON format as a final stage; although the serialization would technically still be valid HSDS I think the JSON format is more useful broadly and it’s what people think of when they think of HSDS APIs.

To be clear, I don’t think this would address all of the use-cases in your spreadsheet, but it might go some way to demonstrating how these types of tools have been built in practice in the past.

The second one is a lot jankier, and comes from my (now very outdated) PhD work trying to transform heterogeneous charity spreadsheet files into a common format. The method I used was to have charities export files into CSV format and then built an interface to support them doing “mapping” from the headers to the shared model, and then the mapping could be re-used for future uploads. While I wouldn’t strictly recommend this approach nowadays, I think there’s been great work in other standards producing mapping templates and I think there’s potential in taking such artefacts and using them to automate data transforms.

Hopefully that helps somewhat, I’m looking forward to this work regardless of which direction you go in.

Hi @ananya_iyer welcome. Thanks for getting this started. I’ve just left a bunch of comments, including a suggestion that we might want to build this table in the main Tools spreadsheet (this is not a huge deal) and also some suggestions that the next steps are probably to start telling the ‘stories’ about how the contents of the source data can/should be translated into the contents of the HSDS schema (this is probably a bigger deal).

Since the contents of resource databases are wildly variable (even though they’re mostly about the same concepts, just named and structured very differently) the most important initial questions are probably about how the user specifies what should be transformed into what.

I do think the stories you have here are helpful too, in a big picture sense of what users expect in terms of inputs and outputs. but content is the devil we have to tussle with.

cc’ing @skyleryoung @sasha @CheetoBandito

re @mrshll ‘s suggestion, yes we should definitely understand how their transformer works because there may be important patterns there.

That said, I’m not sure if the tabular data format is helpful at this point even tho it’s technically still compatible. If it were, we could also just return to updating @shelby’s HSDS 2.0 transformer! still, yes a helpful reference point

Hi all, thank you so much for the helpful responses! @mrshll I will definitely take this information into consideration with these use cases. I’m interested in knowing more about the Tabular Data format and will research that in the following days. Further, I want to ask if the heterogenous files would be part of general use cases for Open Referrals purposes, or if that is separate from the organization’s needs.

@bloom Thank you for the feedback, I will take a look and implement those changes soon. Hopefully I’ll get a lower level view on the use case with the feedback I’ve gotten and hopefully some more responses in the form.

Looking forward to working more with you all!

I’m interested in knowing more about the Tabular Data format and will research that in the following days

Like @bloom suggested, I’d also advise that the Tabular Format is much less useful than the JSON format. Imho, it might be useful as an intermediary format and then you can focus on the challenge of transforming that into the HSDS JSON. So that way you break the challenge into two: 1) how do I transform HSDS from the Tabular Serialization to the JSON format and 2) Can I write a flexible tool that allows me to generate the Tabular Serialization from other people’s data?

I’m not quite sure I understand this, can you elaborate? In my previous work, the issue was that I was trying to produce a general-purpose application which could be useful for multiple organisations; but each of them had their own spreadsheets with different columns etc. This is the classic “we need a data standard” problem. However, I chose to abstract the Standard away from them and produce an interface which effectively allowed them to write a mapping from their spreadsheet to the format I wanted it in. It had quite a few limitations as a piece of software, but the broader work around mapping templates in other data Standards I think offers an alternative view of what mapping work actually is and where it sits in the publication process.

Hello and welcome @ananya_iyer. And thanks for being so thorough in describing your use cases. I’d like to describe two scenarios in the UK where your proposed work might be useful. I think you have use cases that, brought together, cover these.

Firstly I should say that Open Referral UK (ORUK) is a “profile” of the full Open Referral HSDS/HSDA. The profiling methology allows us to define restrictions, extensions and/or rules to HSDS. The ORUK profile just uses a subset of the full HSDS so anything transformed to HSDS should fit ORUK.

Secondly the UK (and arguably Open Referral in general now) is only concerned with data interchange via APIs although people may want to put dumps of full datasets in their databases from HSDS-compliant JSON or from CSV files.

1. Converting propriety ‘syndication’ feeds to the HSDA format

ORUK is promoted as an independent open standard designed to free dataset owners from vendor tie-in - so they can switch software suppliers and we can communicate between software from different suppliers.

One supplier has a ‘syndication’ tool which lets someone who owns its product share a feed of data with others who can choose to import service records from the syndicated feed. I would like to explore taking data from this proprietary format feed (used in many directories) and converting it to an HSDA feed (ideally in the ORUK API format but the general HSDA is fine).

If I can get an owner of this syndication product to open its feed for us, do you think you might examine it to seen if it can be transformed to an HSDA feed? Let me know if you’d like to explore this. If so, I’d be happy to continue talking with you and reviewing your work.

2. One-off import of data from CSV

I believe some organisations start with simple CSV data which they want to put into an HSDS format for import to a database that support HSDS and provides a compliant API. A tool to do this might be useful if it can be configured to cross-reference the CSV column names against HSDS fields (entity properties).

Of course CSV is not good for handling one to many relationships. FYI my old company developed a transforamtion from an ORUK API feedd to a Google spreadsheet. It created one tabbed page in the Google spreadsheet for each table/entity and used the VLOOKUP formula to link between sheets. @Dominic could probably answer any questions you have on that.

Keep in touch.

Hello Mike! I’m Sotiris, another PM alongside Ananya at Blueprint at Stevens Institute of Technology. Thank you so much for your reply and for the scenarios you mentioned.

On the syndication feeds, I’d be interested in exploring this further. If you’re able to share access to a feed, we would be extremely happy to examine it and work from it.

Best regards,
Sotiris

1 Like

Hello @sotiris . I’ll make enquiries about getting access to such a feed.

Hi everyone!

Following up on recent discussions, here are the latest resources from our work on the HSDS Transformer (a large change from the last time we posted!). These include the main repository, a sample mapping spreadsheet, and our current mapping documentation which includes on the final page step by step instructions on how to use the mapping spreadsheet.

HSDS Transformer Repository:
https://github.com/stevensblueprint/hsds-transformer

Mapping Documentation:
https://docs.google.com/document/d/1km9xY7G_f-4mciHj-6hRhHE3tuk_mhMVLM9jGsrAZn4/edit?usp=drive_link

Sample Mapping Spreadsheet:
https://docs.google.com/spreadsheets/d/1pE8kLsQlLfoGgRzWLNGZdPUmspjFaLSp2S65b-ds2E4/edit?usp=drive_link

The transformer is currently designed as a command line interface (CLI) tool to support transformation from CSVs (flat files) into JSON files in the HSDS Specification. It supports the following cases for input data:

  • Different column headers/field names to HSDS (PhoneNumber in input data maps to number in HSDS)

  • Multiple CSV files linked by IDs (i.e. deprecated HSDS)

  • Multiple columns in input data that all correspond to one field in HSDS

  • Splitting cells up into multiple HSDS objects (i.e. the input file contains a Languages column which has {English,German,Spanish,etc}

  • Stripping extra characters/sets of characters from all cells in a column (i.e. removing {} from the previous example)

  • Multiple types of objects in one spreadsheet (and their subsequent linking) (i.e. storing (HSDS) Organizations and Services in one file)

The sample mapping at this moment includes some but not all of the objects and fields/paths from HSDS. The mapping is described in the mapping documentation. Any feedback on whether the mapping and process feels intuitive, and whether there are additional cases we should support is greatly appreciated!

We also wanted to share that we have developed a HSDS Validator as an extension of the original project. This Validator was mainly developed as a way to test our own transformer (i.e. as a CLI tool meant to validate directories of JSON files of a singular HSDS type) and familiarize us with the HSDS format/standard. We currently have a version published as a python package on PyPi for ease of use. We hope it can be useful/of service in any way and welcome any feedback or suggestions!

HSDS Validator Python Package:
hsds-validator · PyPI

HSDS Validator Repository:
https://github.com/stevensblueprint/hsds-validator

We are currently working on documentation for this validation tool and will be sure to update along the way as well as our work on the transformer.

Thank you very much!

The Blueprint Team

1 Like

Hi @ananya_iyer,

This all looks very exciting! Really impressive to see what you’ve achieved here.

A few thoughts follow.

Transformer

This is really in-depth and it’s really impressive. I have some high-level exploratory questions about the use-cases of the Transformer. This is purely based on my reading of the README and the docs, so I might’ve missed something in this thread and I definitely don’t have much knowledge on what source data generally looks like; I’m keen to learn about what motivations you had and design choices you had to make.

My main question is: What’s my input expected to look like?. This might be obvious to everyone else, but from the docs I can’t tell whether I’m suppposed to have some non-standardised CSVs or whether I’m supposed to be massaging my source data into another shape for the Transformer to then convert? How are people getting their CSVs for input? Are they database exports, do they already have CSVs laying around?

This feeds into my larger question of Where does this tool sit in the ecosystem?. My experience/knowledge of HSDS is very oriented towards the API folk, who are talking about exchanging HSDS JSON from endpoints. Since this is abstracted from implementation details there isn’t a need for a transformer and thus I don’t know enough about what problems a Transformer is solving for me to contextualise how I’m supposed to use it.

This isn’t a critique of the tool; I just don’t know enough about this side of the ecosystem to understand what problem it solves. I’m keen to hear where you see this fitting in, since I think we’d probably like to signpost it to people and we’ll need to give them enough context to decide whether this addresses their particular problem.

All of that can probably be addressed by adding a few lines to the README so that a newcomer to the Open Referral’s ecosystem can understand what role this plays.

Last thought on this: there’s a lot of manual back-and-forth of downloading various CSVs to use as mapping files etc. It might be that this is simply a reality of how the source data works, but if people are giving you CSVs, you can sometimes shave off a bunch of complexity by supporting them producing a standardised tabular representation and then converting that to JSON.

We often recommend tools like flatten-tool for this. Its job is to support “round-tripping” of JSON data with flattened representations including CSV, XLSX, and ODS (the file format, not my company!). One interesting feature is that you can give it a JSON Schema and it can generate a series of “templates”. You can even tell it to use the human-readable names for the column titles.

So you could, for example, generate a template from the service schema and then people could use this to fill out their rows of data manually. Or you could let them put their data into a “src” tab and use some spreadsheet-fu to auto-populate the template.

Once you have the template populated with data, you can then just use flatten-tool to convert it to JSON. If you’ve generated the template from the service.json file, you’ll be expecting instances of service.json on the other side.

It’s a different approach and might not be appropriate given your inputs, but worth considering and discarding maybe. It might be that your tool could focus on getting the really-messy-non-standard files into a shape that flatten-tool understands (a folder of CSVs or a spreadsheet with headers generated from a chosen schema); and then you can let it handle the conversion from CSV.

Or I could be fundamentally misunderstanding the shape of the input data you’re working with!

Validator

Really good to see more things happening in the validator space for HSDS! The more of this, the better imho.

I’m aware that you’ve said you’re working on documentation for the validation tool, so forgive me for these questions which you’ll likely be covering there!

My questions are on scope of the validator functions and where you foresee it fitting into a workflow. Is this designed to support validation of local HSDS JSON files, or is it designed to be pointed at an API endpoint to retrieve and validate data?

I’ve had a quick poke around the source code for the validator and I’ve probably missed something or misunderstood, but I think (emphasis on thin) that:

  • The CLI takes arguments for a directory of HSDS JSON files and a directory of the HSDS Schemas
  • There’s also an option to spin up the validation library as an API with the same arguments available
  • You’ve modelled the HSDS Schemas in Python objects
  • When validating lots of files, you’re mapping the input file against a HSDS Schema based on the filepath of the input
  • You’re having to dereference schemas manually in your tool

Broadly, I think this is a decent way to validate local HSDS files but I have some questions (these are my attempts to learn, not criticisms!):

  • Why do you need to model the HSDS Schemas as Python Classes? The jsonschema library (which you’ve required) doesn’t really care about that, and you can use it to validate instance data against schemas. It seems you’ve done the hard work of matching models to schemas already; I don’t know why one would need class representations of the HSDS schemas unless they were doing something specific.
  • What’s the motivation to manually dereference schemas? In most cases I don’t think you need to manually dereference the schemas except as a cool learning activity (which is worthwhile by itself). The jsonschema library handles this for you and you can even define custom referencing behaviour if you’re having challenges with local files: JSON (Schema) Referencing - jsonschema 4.25.1 documentation

I’m also curious about how you found using the “compiled” HSDS schemas? These are the ones that the openapi definition is “wired up” to, and are already fully de-referenced. Notably, they’re only available for a few key schemas.

If you wanted to develop the tool further with the ability to validate external API endpoints: I have a working proof-of-concept validator also written in Python. It’s GPL-3 licensed so open for study and is copyleft (so has implications for any borrowing of code):

Hello Everyone!

Stevens Blueprint is excited to share another update on our progress with the HSDS Transformer. We’ve been focused on testing and documentation, along with some progress on some newer features for which we would love your feedback.

Additionally, we’d like to say a huge thank you to Skyler, Sasha, and Greg for their continued support and feedback! It’s been incredibly helpful as we work through some of the project’s major pain points.

We’re now looking for additional testers. If you’re a potential user of the transformer and would be willing to try it out, we would greatly appreciate your help. We’ve attached an overview video and a link to the repository. If you’re able to test the tool, please let us know and share any feedback you may have. All of these resources and more can be found on the project document.

As for some updates, on a technical level:

  • Logging for each transformation run on a data set

  • Optional generation of UUID-5s upon transformation

Currently we are working on implementing a couple newer features, including:

  • Custom Defined Python Features such that developers can implement particular data operations on datasets

  • Multiple source fields to one HSDS field

  • Addressing the service_at_location edge case

And most excitingly:

  • And a GUI for non technical users to access and use the tool.

  • A reverse transformation tool, which takes in HSDS (or other schemas), and transforms it to a given schema as an output.

  • Generating Mapping templates from schemas as a way to provide ease transitions between versions of HSDS.

With these new updates, we wanted to go ahead and ask a couple questions to the people who would use this tool the most:

  • Currently, we plan for the GUI to remove the need to work directly with Git, Python, etc. However, the mapping process would remain the same. Is that aligned with your use case?

  • What is the use case of a reverse transformation tool (HSDS->some schema). Who would be the ideal user of this tool?

  • Is there a use case for having the transformer as part of a pipeline that accepts source data via an API rather than CSV files?

  • Currently we go from flat CSVs to nested JSON ⇒ is there a use in transforming from JSONs to JSONs?

  • What are aspects of transformations you would like to see recorded in a log after the transformation process?

  • What would you like to see in this tool that you do not see implemented/considered?

Thank you and we hope to hear from you all!