Open WASH data by establishing Data Stewards and increasing FAIRness
A proposal submitted to the Open Research Data Program of the ETH Board
This is a copy of the proposal as it was submitted to the 2nd call for Explore projects of the Open Research Data Program of the ETH Board on 29th February 2024. The original version is stored in this GitHub repository.
Abstract
We established the openwashdata community for the Water, Sanitation, and Hygiene (WASH) sector. We built infrastructure and communication channels, taught 100 WASH professionals the basics of data science, developed a workflow to publish WASH data following FAIR data principles, and mobilized those in the sector who were interested in joining our vision and mission. Our next step is establishing a data stewardship network, actively working with strategic partners in Malawi and South Africa by placing a fully-funded data steward within a research institute and a non-governmental organization. A newly developed 12-module “data stewardship for openwashdata” training programme will develop data management strategies and help our partners to institutionalize ORD practices long-term within their organizations. We will also further invest in the openwashdata publishing arm of the community by increasing the FAIRness of our data, critically analyzing how to better address the details of all four components of FAIR: Findability, Accessibility, Interoperability, and Reusability. We will also set up a governance structure and sounding board to ensure the long-term sustainability of the community. Through our activities and active open communications channels, we expect to create a demand for data stewardship in the WASH sector, assess their role and define a profile for data stewards more generally.
Keywords
data stewardship, FAIR principles, network, collaboration, competencies
Proposal full title
Open WASH data by establishing Data Stewards and increasing FAIRness
Acronym: openwashdata
Background and motivation
openwashdata community
The openwashdata community has found its home. Thanks to the funding we received from the first call for Explore projects by the ETH Board, we have successfully fostered a community of individuals who share a keen interest in open data and open code within the broader Water, Sanitation, and Hygiene (WASH) sector.
Organogram
This proposal is outlined along the figure below which can serve as an overall organogram of the different openwashdata elements.
ORD Phase 1
We have established a website for the openwashdata community that attracts 10 unique visitors daily, we disseminate a monthly newsletter that reaches 200 people, and we maintain a community chat with 100 active members.
We also established the openwashdata academy with our 10-week “data science for openwashdata” training programme, which has produced 40 graduates.
Lastly, we started “openwashdata publishing”: a clear and efficient data publishing workflow for datasets and code that follows FAIR principles and highest standards for computational reproducibility and version control.
We have successfully achieved the objectives of all our work packages in phase 1. As we transition into the next phase, our aim is to lay the groundwork for long-term sustainability of ORD practices in the sector.
ORD Phase 2
The core of this proposal as we transition into phase 2 openwashdata community is to extend the established openwashdata academy. Our first course has had the most significant impact on community development and increasing competencies for ORD practices remains crucial for this community. We will also further invest into the openwashdata publishing arm of the community, as our data packages still have signficant potential to improve on FAIR data practices. We will also setup a governance structure and sounding board to ensure the long-term sustainability of the community.
openwashdata academy
Along with the 10-week “data science for openwashdata” training programme, we will develop a 12-module “data stewardship for openwashdata” training programme, offered to a newly established data stewardship network.
data science for openwashdata
With an impressive 222 sign-ups, we launched a 10-week data science program that has made a significant contribution to the growth of our community. Each module, lasting 2.5 hours, devoted roughly one-third of the time to break-out sessions where participants, in groups of two or three, collaboratively tackled coding assignments. This approach has fostered a network of individuals who are not only connected, but also share a unified understanding of the data science workflow we advocate for within the community. We’ve noticed that participants began collaborating on homework assignments and sharing contact details to stay connected beyond the course. All graduates have completed and published a full data science report on GitHub Pages, along with the corresponding data in a GitHub repository.
We intend to invite the graduates to a newly established data stewardship network and continue our engagement by offering additional training in the openwashdata data publishing workflow.
data stewardship network
We envision a more hands-on approach by embedding a data steward within two entities. This steward will collaborate with the organization to identify existing ORD practices, formulate a preliminary data management strategy, pinpoint datasets ready for open publication, and identify datasets requiring more careful scrutiny before publication due to ethical approvals, data privacy, or industry partnerships. Our openwashdata team will provide the steward with comprehensive guidance and technical support. These partnerships will bolster the reputation of the Global Health Engineering group at ETH Zurich as a pioneer in open data and open code within WASH sector. Together with our partners, we aim to introduce and eventually solidify ORD practices (e.g. Data Management Strategy & Data Management Plans) to the WASH sector. Already established partnerships from Phase 1 of the project will continue to be a place for collaboration and exchange within the ETH Domain (https://rseed.ch/) and beyond (https://washweb.org/).
In addition to the two core data stewards, we will identify graduates of our first “data science for openwashdata” course to become part of the data stewardship network, limiting it to six individuals in total for the beginning year.
UKZN WASH R&D Center
Established in the 1970s, the Water, Sanitation & Hygiene Research & Development Centre (WASH R&D Centre) at the University of KwaZulu-Natal in Durban, South Africa, is a transdisciplinary research and development hub. Formerly known as the Pollution Research Group, the Centre has become a leader in WASH R&D, providing a platform for testing innovative sanitation solutions for both local and international service providers and technology developers. Both Prof. Elizabeth Tilley and Open Science Specialist Lars Schöbitz have spent time at the Centre as researchers between 2011 and 2015, enabling a long lasting relationship built on trust and mutuality.
One of the Centre’s key initiatives from 2017 to 2023 was the Sanitation Engineering Field-Testing (EFT) Platform. This platform was designed to expose sanitation technologies to real-world conditions early in their development. It achieved this by testing prototypes in informal settlements and rural households in Durban, South Africa. From 2017 to 2020, nearly 20 prototype technologies were tested on this platform.
The platform is a result of a multidisciplinary partnership involving university partners, the local consultancy firm Khanyisa Projects, and the local eThekwini Municipality’s Water and Sanitation Unit (EWS). On a national level, the work was supported by the Water Research Commission (WRC) and financially backed by the Bill & Melinda Gates Foundation (BMGF). Drawing from the experiences of prototyping various sanitation technologies, the South African Sanitation Technology Enterprise Programme (SASTEP) was established. The programme was created by the Water Research Commission (WRC) in partnership with the Department of Science and Innovation (DSI), and the Bill & Melinda Gates Foundation (BMGF), with support from the Department of Water and Sanitation (DWS). The WASH R&D Centre serves as the research partner for the implementation of the programme, utilizing the systems established under the EFT platform.
Over the years, the WASH R&D Centre has amassed a significant amount of data on WASH related topics, including data from the EFT Platform and the broader research community. This data is currently stored in various locations, such as the university’s network drives, the university’s data repository, the technology developers’ systems and on the personal computers of researchers, without a holistic data management strategy. While open access journal publications from the research are available, the data is not openly published and is not easily accessible to the research community.
BASEflow Malawi
BASEflow, a Malawian social enterprise established in 2017, is dedicated to enhancing water security in the country. Operating from Blantyre, this non-profit organization has a committed team of 15 individuals. They have demonstrated leadership in data collection and sharing within the Water, Sanitation, and Hygiene (WASH) sector in Malawi, offering support to research institutions and government agencies in both data collection and management. During the initial phase of openwashdata, BASEflow published a dataset using the openwashdata workflow and has expressed interest in continuing the collaboration to develop a comprehensive data management strategy. This strategy aims to ensure that all shareable data is openly published. BASEflow’s commitment to data transparency is exceptional in the WASH sector, where non-academic organizations typically have fewer incentives to share their data. Nevertheless, the data they provide is invaluable and often in high demand by government entities and researchers. By adopting this open approach, BASEflow is poised to set a new standard for the sector and gain recognition as a leader in promoting data openness.
data stewardship for openwashdata
This training will focus more holistically on data management and stewardship, including data management plans, data privacy, data ethics, and data publication. It will cover aspects of file organisation, file types, and different pathways for collaboration and task management.
We will host modules to identify current file and data management practices within each organisation and support the definition of a strategy for (research) data management. Contrary to the “data science for openwashdata” training, we don’t prescribe one way of doing things but rather help determine the best practices for the group, allowing individuals to use the workflows and tools that suit them best to maintain autonomy, however encouraging best practices for ORD.
The programme will consist of a monthly offering of 12 modules, each of which is composed of a 1-hour teaching module, a 1-hour hands-on workshop, and a 1-hour discussion group exchange.
Governance
We recognize the necessity of a well-defined governance process and a feedback mechanism for the project’s long-term success. We propose an advisory board composed of six members to provide feedback on the project’s direction. The governance process will outline decision-making within the community and contribute to the establishment of a sustainable funding strategy for the openwashdata community.
openwashdata publishing
Increase FAIRness
The initial phase of our project successfully established a data publishing workflow that adheres to FAIR data practices and the highest standards of rigour, transparency, and reproducibility [1]. All work is under git version control, publicly available on Github, and archived on Zenodo. This workflow includes the submitted raw data, the code required to process and prepare the data for analysis, and the code needed to generate figures and tables for articles on websites created to publicise and curate the data. We publish the website using the R data package ecosystem, which provides templates for detailed documentation and metadata provision for each dataset. Assigning a DOI through Zenodo and adding all authors through ORCID id ensures the data is findable and reusable.
In the next phase, we aim to enhance the findability of our data packages. We plan to identify a list of existing metadata standards and schemas suitable for further publication, and develop the necessary code to transform the data package into these standards (e.g., JSON Table Schema of Frictionless Standards, Dublin Core Metadata Element Set).
While our data is easily accessible to the R community and we continue to provide CSV and XLSX formats, we recognize that a significant portion of the scientific community uses Python. Therefore, we will explore ways to make our datasets more interoperable and accessible to the Python community by prototyping Python data packages for the pip and conda package managers.
Currently, our R data packages are installed from Github using the remotes or devtools R packages. The effort required to submit all datasets to the Comprehensive R Archive Network (CRAN) is currently too time-intensive to justify. However, we plan to explore the development of an R package and its submission to CRAN, which would provide a way to explore all published datasets and install them through this package. Publishing an R package on CRAN will significantly enhance the findability of our data and further streamline the process of exploring and reusing the data.
Lastly, we will utilise existing tools to assess the FAIRness of our datasets and use the results to further enhance the FAIRness of our datasets.
Journal for openwashdata
Under our publishing arm, we will start to identify the process of establishing our own journal for openwashdata. The purpose of this journal is not to highlight scientific results, but to provide a platform for data publishing. Our initial vision is to host this journal entirely through GitHub, using Quarto documents as a submission format, from which we can generate attractive online journal publication. The first issue of this journal will include the capstone project reports of graduates from our “data science for openwashdata” course offer. Further, we will use the R data packages created through our data publishing workflow as a submission format.
ORD project plan
WP1: Data stewardship network
Goal
The goal of WP1 is to establish and strengthen strategic partnerships with key institutions and organisations in the WASH sector to promote ORD practices and enhance the impact of the openwashdata community.
Activities
- Activity 1.1: Identify and engage potential partner organizations for collaboration.
- Activity 1.2: Develop and implement data stewardship training program with network.
Research Questions
- RQ 1.1: What are the key factors for successful strategic partnerships in promoting ORD practices?
- RQ 1.2: How does the integration of data stewards within organizations affect the adoption of ORD practices?
Aims addressed
This WP addresses the aim of building communities to engage in ORD practices by fostering collaborations that emphasise exchanges in the spirit of Open Science with other researchers, technical experts, and stakeholders. It further addresses specific knowledge gaps around ethical approval in different countries, for different organisations (NGO & University) and for different types of data (qualitative, quantitative, observational, experimental, etc.) [2]. Our strategy is to establish a cohort of data stewards organised through implementing agile project management practices (bi-weekly sprint meetings, quarterly strategy meetings, etc.). The technical infrastructure exists from Phase 1, we have experience in async communication, global collaboration and experience from the open source community. This strategy will enable us to quickly respond to the needs of the community and develop a strong network of data stewards that can support each other and the community.
WP2: Governance
Goal
The goal of WP2 is to establish a robust governance framework that ensures the sustainability and integrity of the openwashdata community and its initiatives.
Activities
- Activity 2.1: Develop a governance structure for a community organization and decision-making processes.
- Activity 2.2: Form a sounding board comprising community members to provide directional feedback.
- Activity 2.3: Create a long-term funding strategy for the openwashdata community.
Research Questions
- RQ 2.1: What governance structures best support the sustainability of ORD communities?
- RQ 2.2: How does community feedback influence the direction and effectiveness of ORD initiatives?
- RQ 2.3: What are the key elements of a successful long-term funding strategy for ORD initiatives?
Aims addressed
This WP contributes to the aim of building communities by establishing a governance process that involves the community in decision-making and strategic planning.
WP3: Community building
Goal
The goal of WP3 is to expand and strengthen the openwashdata community through targeted engagement and capacity-building activities.
Activities
- Activity 3.1: Offer advanced data science training and workshops to community members.
- Activity 3.2: Develop a mentorship program to support new members in adopting ORD practices.
- Activity 3.3: Organize community events to foster networking and collaboration.
Research Questions
- RQ 3.1: What are the most effective methods for building and maintaining an active ORD community?
- RQ 3.2: How does mentorship contribute to the adoption of ORD practices among community members?
- RQ 3.3: What role do community events play in promoting collaboration and knowledge exchange?
Aims addressed
This WP aligns with the aim of building communities to engage in ORD practices by providing training, mentorship, and networking opportunities that support community growth and engagement. Our strategy is to further engage with the cohort of students and graduates of our first ORD training course (data science for openwashdata). The graduates will become advocates for the already established openwashdata data publishing workflow and become mentors in onboarding new community members.
WP4: Increase FAIRness
Goal
The goal of WP4 is to enhance the Findability, Accessibility, Interoperability, and Reusability (FAIR) of data produced by the openwashdata community [1].
Activities
- Activity 4.1: Develop and submit an R package to CRAN for easier data exploration and installation.
- Activity 4.2: Assess and improve the FAIRness of datasets using existing tools.
- Activity 4.3: Prototype Python data packages for increased interoperability with the Python community.
Research Questions
- RQ 4.1: How does the availability of an R package on CRAN affect the findability and reuse of datasets?
- RQ 4.2: What improvements in FAIRness metrics can be achieved through targeted interventions?
- RQ 4.3: What are the challenges and benefits of creating cross-platform data packages for R and Python users?
Aims addressed
This WP addresses the aims of specifying ORD standards and prototyping ORD tools by creating accessible data packages and improving the FAIRness of datasets, thus facilitating and fostering sharing research data based on ORD principles.
Impact
How sustainable is the proposed project inside the ETH Domain?
The first phase funding from the ETH ORD Explore programme has not only allowed us to mobilise a growing and engaged audience, but also to build new connections and collaborations within the ETH Domain itself (e.g. the Research Software Engineering and Economic Data (RSEED) division at KOF Swiss Economic Institute). Concurrent to our funded work, we have started an openwashdata internship programme targeted at graduates of the Data Science and Computer Science programmes at ETH Zurich. This programme enables us to engage with the next generation of data stewards and advocates for ORD practices at ETH, while benefiting from the excellent competencies of the graduates. We are actively promoting the openwashdata R data package workflow within the ETH Domain to further advance its application and increase the number of people applying these principles in domains beyond WASH. Lastly, we are in the process of establishing a governance structure that will ensure appropriate decision-making and the financial sustainability of the openwashdata community and its initiatives. In general, we are building our network, expanding the competencies of our partners, and raising the profile of ORD practices within ETH to ensure the long-term, decentralised sustainability of this work.
To what extent will the planned project substantially advance the ORD field, or solve a critical outstanding problem in the targeted field(s)?
This project will contribute to filling knowledge gaps around data sharing for research that previously received ethical approval under conditions where data sharing wasn’t specified. Data generated at research institutions working with private sector partners, and data management for different types of data (observational vs. laboratory research) and organisations (research institute vs. NGO). The workflows and tools developed will advance the overall ORD field and contribute to the progress of data management strategies. The WASH sector, as a targeted field, is particularly in need of ORD practices and dedicated data stewards, as it is often data-rich but information-poor, related to the concept of the Data-Information-Knowledge-Wisdon pyramid (https://en.wikipedia.org/wiki/DIKW_pyramid). Our goal is to give open data the same recognition and prestige as academic papers, and in doing so, enable a new way of measuring impact in the WASH sector and beyond. Once people (especially researchers who are junior, based outside of academia, and/or in the Global South) notice they are being recognized and valued for sharing their data, rather than criticised, we expect a substantial cultural shift in how work in the WASH sector is measured and valued, which will in turn lead to more date-driven decision-making, more transparency, and more focus on actually addressing the global disparities in water and sanitation.
To what extent does the proposal support a collaborative approach, involving significant synergies, complementarities, or/and scientific added-value to achieve its objectives?
The proposal strongly supports a collaborative approach. Our strategic partners have been carefully chosen because of their motivation and the synergies they offer. We have the day-to-day experience of a data steward employed in our research group and will now be able to guide other organisations in establishing this position within their own work culture. The skills of our partners are highly complementary, as they hold significant quantities of valuable data and we can provide the support in getting it published as ORD resources. This two-way support system will result in significant scientific added-value as we increase the overall quantity and quality of the data available in the WASH sector, making it re-usable for other researchers.
Work Packages and milestones
The following Table 1 shows a basic gantt chart against the four work packages, including program activities. Column “Lead” abbreviations: (LS) Lars Schöbitz, (DS1) Data Steward 1, (DS2) Data Steward 2, (SA) Scientific Assistant.
Any publications derived from this program will be published as open access material, following ORD practices and Open Science standards for computational reproducibility and sharing of data and code under FAIR principles.
Table 1: https://docs.google.com/spreadsheets/d/1KbxWuLBKi2mEWWeE935yR5bqoZskq6Joe-6-PviqXsY/edit#gid=0
Resources (including project costs)
Table Y: https://docs.google.com/spreadsheets/d/1jl2yIyX7-xsnawdO26IlJY5hcOhD0Bzg9j7yt3p47uc/edit?usp=drive_web&ouid=106935258332658829405
Budget justification and description of co-financing
Table Z: https://docs.google.com/spreadsheets/d/1DHb4j8t1yhV1AyFqCY0ZkpAUwDQT2JArBAeXgiWUmNM/edit#gid=0
Bibliography
Citation
@online{schöbitz,
author = {Schöbitz, Lars and Dr. Elizabeth Tilley, Prof.},
title = {Open {WASH} Data by Establishing {Data} {Stewards} and
Increasing {FAIRness}},
url = {https://openwashdata.org/pages/gallery/proposal-02/},
langid = {en}
}