HSDS Transformer Use Cases

Hi all,

My name is Ananya Iyer, I am a member of Blueprint @ Stevens Institute of Technology in Hoboken NJ working with Open Referral to work on the transformer project. I’ve been drafting some user stories for the Transformer, to handle CSV/JSON to HSDS 3.0 conversions. Since we are relatively new to the project, I wanted to ask for some input considering your experiences with the old transformer.

So far I have put together a small first draft spreadsheet with 10 user stories, covering CSV and JSON transformation, batch processing, error handling, API inputs, and configurable options. I have also drafted a couple Gherkin-style scenarios to see how these stories might look as acceptance tests. Before I expand the list further, I’d really appreciate your feedback:

  • Are the right user roles (publisher, commissioner, consumer, developer, data manager) represented?

  • Are there any edge cases that should be included but aren’t yet?

  • Should we emphasize batch processing, error handling, or API support more strongly?

  • Are there lessons from the Validator stories that should definitely be taken here?

To make this easier, I’ve also created a short feedback form.

Thanks so much for your time and guidance — I’m looking forward to learning from your expertise and making sure these stories are useful to everyone.

1 Like

Hi Ananya, and welcome to the Open Referral forums!

I work with the HSDS Standard itself, so I can’t comment directly on the use cases from a perspective of someone wanting to produce it; but these look fairly good from my understanding of the ecosystem.

I was chatting to @bloom about this and it tickled my memory in a few ways, so here’s some things that you might find interesting in your quest to build a transformer:

First is from another Open Data initiative — Open Contracting and OCDS — with the OCDS ETL Pipeline tool. Also called OCDS Kavure’i (there is an tradition of naming OCDS tools after various birds…), it works by having the user develop develop some SQL queries, which are then designed to produce data in OCDS’ CSV serialization format which can then be transformed into OCDS-compliant JSON using flatten-tool and the OCDS Schema files. The tool then loads it into an Elastic Search database.

The tool was originally developed by the DNCP Team in Paraguay and meets their use-cases, but I think the model is interesting if you can’t fork it or re-use any of the code. HSDS is materially different to OCDS in a few significant ways (which I’m happy to chat through on a call if you need), but crucially HSDS still supports the Tabular Data Package Format… which is a bunch of CSV files! So it’s feasible to think that one way to develop a transformer is to develop a pipeline similar to Kavure’i where the input is .sql query files against the user’s database, resulting in a bunch of CSV files which technically constitute valid HSDS according to the Tabular Data Package Format.

From there, I think the job would be to try and convert these to the HSDS JSON format as a final stage; although the serialization would technically still be valid HSDS I think the JSON format is more useful broadly and it’s what people think of when they think of HSDS APIs.

To be clear, I don’t think this would address all of the use-cases in your spreadsheet, but it might go some way to demonstrating how these types of tools have been built in practice in the past.

The second one is a lot jankier, and comes from my (now very outdated) PhD work trying to transform heterogeneous charity spreadsheet files into a common format. The method I used was to have charities export files into CSV format and then built an interface to support them doing “mapping” from the headers to the shared model, and then the mapping could be re-used for future uploads. While I wouldn’t strictly recommend this approach nowadays, I think there’s been great work in other standards producing mapping templates and I think there’s potential in taking such artefacts and using them to automate data transforms.

Hopefully that helps somewhat, I’m looking forward to this work regardless of which direction you go in.

Hi @ananya_iyer welcome. Thanks for getting this started. I’ve just left a bunch of comments, including a suggestion that we might want to build this table in the main Tools spreadsheet (this is not a huge deal) and also some suggestions that the next steps are probably to start telling the ‘stories’ about how the contents of the source data can/should be translated into the contents of the HSDS schema (this is probably a bigger deal).

Since the contents of resource databases are wildly variable (even though they’re mostly about the same concepts, just named and structured very differently) the most important initial questions are probably about how the user specifies what should be transformed into what.

I do think the stories you have here are helpful too, in a big picture sense of what users expect in terms of inputs and outputs. but content is the devil we have to tussle with.

cc’ing @skyleryoung @sasha @CheetoBandito

re @mrshll ‘s suggestion, yes we should definitely understand how their transformer works because there may be important patterns there.

That said, I’m not sure if the tabular data format is helpful at this point even tho it’s technically still compatible. If it were, we could also just return to updating @shelby’s HSDS 2.0 transformer! still, yes a helpful reference point

Hi all, thank you so much for the helpful responses! @mrshll I will definitely take this information into consideration with these use cases. I’m interested in knowing more about the Tabular Data format and will research that in the following days. Further, I want to ask if the heterogenous files would be part of general use cases for Open Referrals purposes, or if that is separate from the organization’s needs.

@bloom Thank you for the feedback, I will take a look and implement those changes soon. Hopefully I’ll get a lower level view on the use case with the feedback I’ve gotten and hopefully some more responses in the form.

Looking forward to working more with you all!

I’m interested in knowing more about the Tabular Data format and will research that in the following days

Like @bloom suggested, I’d also advise that the Tabular Format is much less useful than the JSON format. Imho, it might be useful as an intermediary format and then you can focus on the challenge of transforming that into the HSDS JSON. So that way you break the challenge into two: 1) how do I transform HSDS from the Tabular Serialization to the JSON format and 2) Can I write a flexible tool that allows me to generate the Tabular Serialization from other people’s data?

I’m not quite sure I understand this, can you elaborate? In my previous work, the issue was that I was trying to produce a general-purpose application which could be useful for multiple organisations; but each of them had their own spreadsheets with different columns etc. This is the classic “we need a data standard” problem. However, I chose to abstract the Standard away from them and produce an interface which effectively allowed them to write a mapping from their spreadsheet to the format I wanted it in. It had quite a few limitations as a piece of software, but the broader work around mapping templates in other data Standards I think offers an alternative view of what mapping work actually is and where it sits in the publication process.

Hello and welcome @ananya_iyer. And thanks for being so thorough in describing your use cases. I’d like to describe two scenarios in the UK where your proposed work might be useful. I think you have use cases that, brought together, cover these.

Firstly I should say that Open Referral UK (ORUK) is a “profile” of the full Open Referral HSDS/HSDA. The profiling methology allows us to define restrictions, extensions and/or rules to HSDS. The ORUK profile just uses a subset of the full HSDS so anything transformed to HSDS should fit ORUK.

Secondly the UK (and arguably Open Referral in general now) is only concerned with data interchange via APIs although people may want to put dumps of full datasets in their databases from HSDS-compliant JSON or from CSV files.

1. Converting propriety ‘syndication’ feeds to the HSDA format

ORUK is promoted as an independent open standard designed to free dataset owners from vendor tie-in - so they can switch software suppliers and we can communicate between software from different suppliers.

One supplier has a ‘syndication’ tool which lets someone who owns its product share a feed of data with others who can choose to import service records from the syndicated feed. I would like to explore taking data from this proprietary format feed (used in many directories) and converting it to an HSDA feed (ideally in the ORUK API format but the general HSDA is fine).

If I can get an owner of this syndication product to open its feed for us, do you think you might examine it to seen if it can be transformed to an HSDA feed? Let me know if you’d like to explore this. If so, I’d be happy to continue talking with you and reviewing your work.

2. One-off import of data from CSV

I believe some organisations start with simple CSV data which they want to put into an HSDS format for import to a database that support HSDS and provides a compliant API. A tool to do this might be useful if it can be configured to cross-reference the CSV column names against HSDS fields (entity properties).

Of course CSV is not good for handling one to many relationships. FYI my old company developed a transforamtion from an ORUK API feedd to a Google spreadsheet. It created one tabbed page in the Google spreadsheet for each table/entity and used the VLOOKUP formula to link between sheets. @Dominic could probably answer any questions you have on that.

Keep in touch.

Hello Mike! I’m Sotiris, another PM alongside Ananya at Blueprint at Stevens Institute of Technology. Thank you so much for your reply and for the scenarios you mentioned.

On the syndication feeds, I’d be interested in exploring this further. If you’re able to share access to a feed, we would be extremely happy to examine it and work from it.

Best regards,
Sotiris

1 Like

Hello @sotiris . I’ll make enquiries about getting access to such a feed.