A perennial challenge is to streamline data sharing, and to find the best repository to host an open dataset.
Some things we’re working on:
A set of community data-sharing guidelines and best practices
A catalog of datasets used across the community
Experimenting with office hours and other resources to work through related issues
Please find below an initial set of questions to ask yourself when sharing data, and a top-line set of core guidelines. These will evolve over time, and your feedback and troubleshooting are invaluable.
What kind of data do you want to share? (file size? compression? updatable?)
Where is the best place to put the data? (for how long? requiring registration for users?
what sort of usage statistics? what affiliation? will I be maintaining and responding to queries?)
Description of the project/dataset:
What was the project about and why share this data, and who should use it?
Who are the individuals involved (if more than just yourself) and contact details?
Any plans/schedule for updating the data, and how users can submit issues/requests (GitHub has this inbuilt)
Describe the data:
Data files + datasets (size, relevance to what sort of analysis, etc.)
Data schemas and/or field descriptions
How is the data updated (if relevant)
Describe the code:
Document the source code and process used to create the dataset
How to use and build on the data
Tuneable parameters in key steps, explicit nods to replication, etc.
URL / link to other datasets/software/code used
URL / link to related papers
How should it be cited (‘If you use the data, please cite…’)
Here is a link to submit reference data sets or papers you feel are useful to working with your data sets : _________________.
We look forward to discussing these guidelines and any issues you have encountered as you prepare to share data sets in our upcoming virtual office hours.