Technical Report on OpenLineage

Hi folks,

Another technical report! This time based on the investigations I’ve been doing around OpenLineage. This was kicked off a few months ago after the Technical Committee Meeting where @klambacher discussed her needs around understanding the entire history of a record as well as the ownership of that data.

Headlines:

  • OpenLineage is another JSON-based data standard which provides models to describe the history of a record and the processes which have been put upon it, as well as things such as source systems.
  • At a technical level, HSDS and OpenLineage are compatible by lieu of both being standards defined in JSON Schema so integration is possible; we just need to determine if we want to, how we want to, and to what extent we would provide structures to link up these two models.
  • It may be that OpenLineage doesn’t cover all of the use-cases described by Kate in the committee. There does exist another standard — W3C-PROV — which does cover these use-cases but looks very heavy-handed and based on semantic web technologies. Further research is needed into this.
  • There’s nothing stopping HSDS implementers from outputting OpenLineage now if they wanted…
  • If we wanted to implement something which integrated OpenLineage, we might be blocked or hampered by needing to adapt the model to support the Frictionless Data Package serialization. Dropping support for Frictionless would be possible after a MAJOR upgrade, but we probably shouldn’t issue a MAJOR upgrade just for this so it might be something to look at in the future.

On a side note; if the “Technical Report” format is useful or explicitly not-useful, I’m keen to hear feedback on this. I figured it was a nice way to share knowledge with a URL attached to it, and allowing comment threads on the reports while leaving higher level discussion to forum posts. Let me know if it you prefer your knowledge in another format.