Through hands-on hacking, perform a detailed, practical investigation of Bacalhau (https://www.bacalhau.org/) - a software solution for distributed compute over data - aiming to achieve technical goals and report on outcomes as detailed below.
At minimum, this will involve testing usage with existing open source computational workflows and tools for Cohort Discovery as developed at the University of Nottingham (https://github.com/health-informatics-uon/rquest-tools, https://github.com/health-informatics-uon/beacon-tools).
Areas of investigation include:
- Achieving successful usage with existing workflows for known HDR UK federated use cases; at a minimum the above Cohort Discovery workflows.
- Comparison with existing tools in the landscape.
- At a minimum tools that already run the above workflows:
- Workflow Execution Service (WfExS), an Elixir tool
- TESK, a Kubernetes implementation of GA4GH Task Execution Service (TES)
- Considering the above in the wider context of the DARE UK TRE-FX stack as well
- At a minimum tools that already run the above workflows:
- Suitability assessment
- for use in emerging Federated Analytics ecosystems such as TRE-FX, HDR UK Federated Analytics
- for use in Trusted Research Environments (TREs) (e.g. in the context of the DARE UK Blueprint's Research Analysis and Query Management Zones)
- Gap analysis of all of the above
- What additional effort if any is required to run existing workflows?
- What shortfalls are there when compared with other tooling?
- What gaps are there for Federated Analytics use, particularly in TREs?
- In what areas are these gaps?
- Governance compliance
- Technical capabilities
- Standards compliance
- FAIR data, e.g. provenance facilitation
- Interoperability
- What effort might be required to bridge these gaps?
- What advantages are there over existing tooling?
- Could this investigation inform future development work on other tools to improve their suitability for future use?
Outcomes of the investigation could be used to inform discussions within both HDR UK and Elixir UK associated projects, such as the HDR UK Federated Analytics programme, and the EOSC ENTRUST project. Concrete information on Federated Analytics capabilities and requirements is extremely valuable for projects like these, as they strive to deliver value and impact while respecting the nuances of a heavily regulated governance environment, and so this investigation would help shape the future of this work.
From a more concrete and technical standpoint, this project could lead to the recommended use of Bacalhau where suitable, or otherwise directly inform development work on other HDR UK or Elixir associated tools such as Hutch or WfExS.
An existing synthetic patient dataset in the OMOP Common Data Model will be used for the Cohort Discovery workflows to query.
Project participants should be comfortable programming with C# or python and working with Common Workflow Language, Linux, Docker.
Jonathan Couldridge, Vasiliki Panagi