Skip to main content

Data checklists + considerations

Published onFeb 20, 2020
Data checklists + considerations

1) Compile a checklist that feels sufficient for public datasets
2) Compile a checklist that captures the needs of incomplete/embargoed data 3) Draft a data-sharing use case for each of: the original NBER dataset; Reliance on Science; Lens Lab data; ScalingScience metrics; IProduct; Bronwyn’s data page; other presenters at the Dec. meeting
4) Classify / prioritize guidelines: what should we require; what is good style?

0. Universal form

  • What sort of data are you gathering? (short text)

  • URL / other link to data [link to DV in case they want to upload first] [optional]

  • URL / other link to related paper [optional]

  • Contact email

  • What datasets are you currently working with?
    Who else should we reach out to working on similar data?
    What would you like feedback on from the community?

1: Sharing a public dataset

as an example, see

  • What data does this capture? (high level: ‘citations from worldwide patents to scientific articles’)

  • Describe schemas + their components (‘We link A,B,C; each linkage has D, E’ ‘full details in __’ ‘we redistribute [existing datasets] F, G’)

  • Describe files + datasets (size, relevance to what sort of analysis)

  • How to use + build on this data? (tunable parameters in key steps, explicit nods to replication)

  • How should it be cited? (‘If you use the data, please cite…’)

  • How is it updated?

  • License details (code, other materials)

  • Where to find source code

  • Where to send queries/requests/suggestions (contact email)

  • What other context do you track / would be relevant to other researchers replicating or building on this work?

2. Sharing a dataset in progress

see IProduct:

  • What data does this capture?

  • Describe schemas + components

  • How should it be cited?

  • Embargo details; plans for updates until/after release

  • Other access/use policies

  • How is it updated?

  • License details

  • Where to find source code, any closed data

  • Contact information

3. Challenges to be aware of

  • Context: Huge replication crisis. ~50% of papers use closed data.

    • Most journals make data available for replication only.

      Support cumulative innovation! - not rote replication

  • What recipes can we use, to document and tell people about it?

    • Examples of projects using that recipe, qualifications + caveats

  • What guidelines are there for those who want to be open?


  • Publish guidelines, best practices; possibilities/recipes, pros/cons.

    BR: Recommend a shortlist of repository options.

    • Maintain a meta-catalog pointing to everything using one of the recipes

    • Include recipes for syncing across repositories (MM:ICPSR+Z)

  • MM: Go to noted collections/people, offer to permanently curate their data for them (Jeffrey Kuhn, e.g.) Migrate from their site into the borg

    • Highlight great datasets we admire and want to preserve.

  • Options for long-term preservation (SF, Wellcome)
    Arrange collection by an endowed library (MIT, IA, Zen)

    • see NSF on open science? for research data [MM talked to Skip, <-> AJ]

  • AJ: Ensure people make public not just data but code. Keep updating by public information, even when initial contributors retire.

    • Standardized form: for keeping snapshots and archiving going forever

  • Dedicate summer meeting to news you can use: how-tos, workshops

    • Index: half-decent catalog of what exists and is widely used

    • What-to do with your dataset! Options, tool workshops.

    • Consumer Reports for repositories (interview, summarize)

    • Offer free consults: to map workflows that don’t yet work for that.

Notes from discussions with participants


Chat w/ Phil Durbin: DV at IQSS have a liaison working w/ projects (like III) to sort out how to map their datasets and community into DV accounts. Some of us should meet with her). They can help us get set up as a group.

Some limitations of the Harvard dataverse: 3000 files per directory, 10MB per uncompressed dataset, 50MB per file. datasets can be queried / subsetted / visualized without moving them off of dataverse.

Enclaving: One feature in DV is a canonical ‘enclave’ option: you put a pointer and metadata in DV which points to the underlying data. This can address and point to detailed data elsewhere.
[MATT: that’s interesting so you could keep your files at Zenodo but have essentially a ‘pointer’ to them on DV? That’s cool. I originally posted to ICPSR but now I have to update both copies of the files, yuck.]

Chat w/ Matt + Laura

File sizes -- what size range is reasonable? [MATT: most laptops have 8G memory, many have 16. Keeping under 8 would be good. Not sure whether it is helpful to split larger files into smaller pieces like A-L and M-Z, though. Could post both?]
--> use uncompressed where possible
(Ex: reliance on science core files could be unzipped.)

Compression toolchain:
—> What are considered standard? (for text, binaries…) [MATT: many people like .csv but these are HORRIBLE because you have to put quotes around any field with a comma in it, very wasteful compared to tab-separated which is what I use and also]
If using anything else, inclue a link to (or binary of) the decompression tool
—> compression options on download? Can we get the best of both worlds — uncompressed in the archive [for online visualizations] and compression on download w/ no fuss? (ask Zenodo and DV how they think about this) [MATT: true but compression can be needed to avoid hitting total-storage or per-file constraints. for example I’m at 46G on zenodo now and only because it’s all compressed.]

—> every variable needs to be defined and explained as to its origin
—> ideally source code would be supplied, though at least described textually.


No comments here