Skip to main content

Data checklists + considerations

Published onFeb 20, 2020
Data checklists + considerations
·
history

You're viewing an older Release (#3) of this Pub.

  • This Release (#3) was created on Apr 02, 2020 ()
  • The latest Release (#4) was created on Jun 16, 2020 ().

1) Compile a checklist that feels sufficient for public datasets
2) Compile a checklist that captures the needs of incomplete/embargoed data 3) Draft a data-sharing use case for each of: the original NBER dataset; Reliance on Science; Lens Lab data; ScalingScience metrics; IProduct; Bronwyn’s data page; other presenters at the Dec. meeting
4) Classify / prioritize guidelines: what should we require; what is good style?

0. Universal form

  • What sort of data are you gathering? (short text)

  • URL / other link to data [link to DV in case they want to upload first] [optional]

  • URL / other link to related paper [optional]

  • Contact email

1: Sharing a public dataset

as an example, see https://zenodo.org/record/3685972

  • What data does this capture? (high level: ‘citations from worldwide patents to scientific articles’)

  • Describe schemas + their components (‘We link A,B,C; each linkage has D, E’ ‘full details in __’ ‘we redistribute [existing datasets] F, G’)

  • Describe files + datasets (size, relevance to what sort of analysis)

  • How to use + build on this data? (tunable parameters in key steps, explicit nods to replication)

  • How should it be cited? (‘If you use the data, please cite…’)

  • How is it updated?

  • License details (code, other materials)

  • Where to find source code

  • Where to send queries/requests/suggestions (contact email)

  • What other context do you track / would be relevant to other researchers replicating or building on this work?

2. Sharing a dataset in progress

see IProduct: http://www.iproduct.io/

  • What data does this capture?

  • Describe schemas + components

  • How should it be cited?

  • Embargo details; plans for updates until/after release

  • Other access/use policies

  • How is it updated?

  • License details

  • Where to find source code, any closed data

  • Contact information

Notes from discussions with participants

Dataverse

Chat w/ Phil Durbin: DV at IQSS have a liaison working w/ projects (like III) to sort out how to map their datasets and community into DV accounts. Some of us should meet with her). They can help us get set up as a group.

Some limitations of the Harvard dataverse: 3000 files per directory, 10MB per uncompressed dataset, 50MB per file. datasets can be queried / subsetted / visualized without moving them off of dataverse.

Enclaving: One feature in DV is a canonical ‘enclave’ option: you put a pointer and metadata in DV which points to the underlying data. This can address and point to detailed data elsewhere.
[MATT: that’s interesting so you could keep your files at Zenodo but have essentially a ‘pointer’ to them on DV? That’s cool. I originally posted to ICPSR but now I have to update both copies of the files, yuck.]

Chat w/ Matt + Laura

File sizes -- what size range is reasonable? [MATT: most laptops have 8G memory, many have 16. Keeping under 8 would be good. Not sure whether it is helpful to split larger files into smaller pieces like A-L and M-Z, though. Could post both?]
--> use uncompressed where possible
(Ex: reliance on science core files could be unzipped.)

Compression toolchain:
—> What are considered standard? (for text, binaries…) [MATT: many people like .csv but these are HORRIBLE because you have to put quotes around any field with a comma in it, very wasteful compared to tab-separated which is what I use and also patentsview.org]
If using anything else, inclue a link to (or binary of) the decompression tool
—> compression options on download? Can we get the best of both worlds — uncompressed in the archive [for online visualizations] and compression on download w/ no fuss? (ask Zenodo and DV how they think about this) [MATT: true but compression can be needed to avoid hitting total-storage or per-file constraints. for example I’m at 46G on zenodo now and only because it’s all compressed.]

Documentation:
—> every variable needs to be defined and explained as to its origin
—> ideally source code would be supplied, though at least described textually.

Comments
0
comment
No comments here
Why not start the discussion?