This is the first of a series of posts on “Collaborative Data Design”. Here we explain why a standard approach to data collaboration is desirable. Future posts will cover existing community initiatives, techniques and best practices, and community goals. Please treat this series as a conversation – you are invited to annotate and comment directly on this document, to email the mailing list, or to get in touch directly.
Progress on publishing data to open platforms in the past few years is an important step towards an open data landscape, but insufficient for sustained collaboration. Although already valuable, the common practice of sharing a database without its generating process, and without sufficient documentation to link it to other work raises a number of critical issues.
It endangers dataset reusability. In short, datasets don’t become reusable within a community without careful consideration of how they interact with other entities (datasets, platforms, repositories, codebases) from the outset1.
It harms the cumulative nature of empirical research. Without access to the data generating process (usually the codebase), it is virtually impossible for new entrants to benefit from previous efforts and continuously improve upon past efforts.
It is a factor of misallocation within the research community. While empirical research is increasingly becoming a team sport (Jones), without community access to the data generating process, vast amounts of unused available resources and ideas from the rest of the community remain unexploited.
We advocate for a pragmatic path towards a shift in how data is considered: from a snapshot of research (tied to a publication) that is shared at a point in time, to a living resource that is maintained and developed on as a communal good. The community already benefits from such initiatives that are steered by larger organisations: what we are proposing is that all shared datasets should be treated as a potential contribution, and as such be shared with re-use and continuous improvement in mind. We call this approach “collaborative data design”.
Numerous existing principles, tools and policies already exist for effective data-collaboration. Their adoption, however, requires widespread buy-in to be truly effective. In this series of posts, we will outline existing community initiatives that contribute to this shared effort, share techniques and best practices, and describe goals for advancing collaboration around data. In particular, we seek to highlight small, low-cost actions used in other disciplines that could have a transformative effect on the Innovation Data landscape.
We believe that a practical and collaborative approach to data sharing will benefit the empirical research community as a whole, in particular researchers whose work focuses on developing and improving datasets.