A moderated discussion of trade-offs and choices when sharing datasets, drawing on experience from building the Reliance on Science dataset.
We will host the initial session at 12:00 - 13:30 ET
A second session, for Asia/Australasia timezones, will be held the following day (late Thursday ET / early Friday ACT).
Any questions recorded in these notes after the session will be forwarded to the presenters.
This session is part of the I3 Spring 2021 Data Sharing workshop series.
Worth noting that data on patents - which involves granting monopolies - is generally free, whereas data on open scholarship is generally locked up.
“What is the heritage of innovation?“
Open science vs commercial innovation worlds don’t talk well to each other. Patent-patent citations are more straightforward as they’re in the same world. Patent citations are a mess (wow), issues with Ibid. no citation numbers, total nightmare.
Using a machine learning (GROBID) and a heuristic approach. Currently heuristics having the edge, trying to encode human intelligence. Not sure if it’ll always be true + there are other projects but in this instance heuristics find about 23% more.
what data to share
where to share it
how to share it
who to share it with
when and whether to share at all? (more of a philosophical q; invitation to discussion)
Note: bulk of work done with the 🔥 Aaron Fuegi from BU, helped with parallelising the algorithm which was a huge help.
This is different from just a journal obligation: your obligation ends when you submit some data and some code, but you’re not required to update etc. Talking about data sharing that you want to do.
Ask: who are your customers?
Benefits / goals: broader adoption + fewer support-threads.
Kay: “simple things should be simple. complex things should be possible”
--> guiding thoughts about enabling 80% of users to do what they want easily, but also enabling 20% of power users.
In this case customers are mostly social scientists:
Working on assumption: 80% of reusers just want counts of citations per patent. If you look at the 1000s of papers that use Adam Jaffe’s data, most people are using counts. If we can get 80% of the way with just numbers, that might be good for some but we shut out the 20% who want more information to allow more sophisticated analysis.
However you end up in an information-about-papers discussion.
Probabalistic matching leads to a dilemma: do I just share the matches that I’m very sure of, or do I use more matches. Strong bias in the social sciences against false positives. But if you have no false positives you have a lot of false negatives: there’s a tradeoff there. In the interest of reducing support load, did not just share the perfect matches.
Decided to share a larger set, but didn’t want to share without any insight as that’s not responsible. Assign a confidence score to each match that gives you a 9-1 scale of sureness. Spent a lot of the Sloan grant 🌱 providing an interpreting the confidence scores.
e.g. 93% coverage if just 1.2% false positives. choosing your point in the knee on the curve.
One dilemma that accompanies this is, do you share the test set that you used to generate this data. Did share just in the interest of transparency, and if you wanted to try your own algorithm.
How to address these? Linked before to Web of Science — Clairvate pays publishers for data, so you get really rich data. You pay for this and you have a license that says you can’t post it.
So how to share information about papers? Can’t download Google Scholar data in bulk. Don’t recommend trying to scrape. I just accidentally discovered Microsoft Academic [and Open Academic] — not as well known, but they have a site that looks a lot like Google Scholar and you can bulk download all the data.
Microsoft just scrape this data (which might mean more errors). But since they don’t pay publishers, you can access the whole for free. We could link to entities in the Microsoft Academic Graph — they have a MAG ID and Author IDs — but when we started sharing not many people had used MAG, so that wasn’t terribly helpful by itself.
One thing we considered is merging all the fields in the data we provide. One thing would be to point people to MAG data, but it’s on Azure and a bit of a pain. It costs $80 to download, you need to set up a data lake, etc.
Some people have mirrored it (maybe what we should have used) but we weren’t sure how often they updated (MAG updates each week). The other issue is that it’s massive (63GB). Target researchers are using their laptops, makes opening this file is hard.
We created a redistribution of MAG data where they chop up these files by fileds (of year, vol/issue/page, title, author, affiliation)
Even titles by themselves is 17GB but most people just want year.
Aside: Microsoft have author id! it’s nice.
This facilitated people doing simple things.
— Don’t redistribute the full MAG file (e.g. 200GB of abstracts, encoded to avoid copyright violations, listing all the words in numbered order)
— Created complementary assets like journal impact factor
Early feedback: People wanted to use particular sections of the data, e.g. grabbing sections in Web of Science
We built a frequency distribution mapping from keywords to web of science categories and OECD categories
Main issue was the papers, but patents also a complementary asset. We considered something similar to microsoft data but didn’t seem like a good use of time.
In the US: we used PatentsView — which is great. To chop the files we basically stole the patentsview schema.
Outside of US:
- you need to buy PATSTAT fsys.
- DocDB data is downloadable from google patents.
We decided not to redistribute entire thing, but you get patents that are the same in multiple jurisdictions, so released the patent family index to avoid double counts.
Merging issues: Just because you have the patent id doesn’t mean you can merge easily. We tried to massage Google Scholar data to match USPTO, but it wasn’t perfect. Had an enquiry with errors that turned out to be an issue with leading zeroes.
Everyone has a different way of writing patent ids. If you’re going to provide mergeable data then this is something you should think about. (ask Nicholas Pairolero)
Q: How does this work w/ work done w/ Lens data in 2017-18? (front-page citations)
A: This includes in-text/body citations. RonS is both sets citations, globally; should complement Lens? Worth discussing later.
Could put data on personal website (e.g. Jeff kuhn does a great job of this — Patent Citation Similarity dataset) A lot of flexibility from hosting himself.
Pedro Matos and Jan Bena: interesting entity reconciliation approach using Google Search matching. Published their data, built their own website to host it. If you want their data you can put in your info, need approval to access it. They have total control and flexibility, but you also have to build it. These sites also have issues with persistence & versioning. Can people get past versions for a paper? Do versions have data DOIs? Also trivial things like a download counter: this is done for you in a number of repositories.
3 hosted options to consider:
ICPSR (comes in 2 flavours) — originally started as a curating service, you could pay people to clean and document/version your dataset. they have now added OpenICPSR where you can post data for free.
DOIs for data
Default capacity 2GB, can increase it if you send them an email
Requires registration (we eventually moved b/c we wondered if it was creating friction). this was a huge barrier! order of magnitude more downloads as soon as posted from zenodo.
Bitsy: wonder where all of those are coming from
“where all the 1990’s econ history datasets are :-)” - Bitsy Perlman
also if your institution not signed up to ICPSR
Dataverse — IQSS project. can download and install your own, people mostly use Harvard dataset. Per datafile capacity limit, at the time 2.5GB but may have increased. Have posted smaller datasets to dataverse below this limit, can also increase if you ask.
Zenodo — chosen mostly for capacity,
50GB default partition (just enough for this)
A lot of great solutions, and you don’t have to worry about a lot of persistence issues. E.g. on your old faculty page is terrible if you change institutions.
Bitsy Perlman: icpsr is where all the 1990s economic history datasets are, so if you want to hang out with published 19th c census table, that's where to do it. :-)
Both DV and ICPSR are great places to search when beginning a project, a great place to look if you’re starting a project
Bulk data vs APIs — other people have done this, e.f. patentsview. This would be something we could think about adding for commercial users. Never had a request (so far) out of 34k+ downloads
Format: proprietary formats tend to be smaller than flat files like CSVs, as some formats allow internal pointers which could shrink the files.
TSVs are smaller, but CSVs are more well understood: chose TSVs in the end
as mant compression algorithms as there are types of files. Ultimately decided to provide the chief file unconpressed
get email;s asking how to unxip files
in some repositories, uncompressed files can be unpacked and show fields in a disaplat format (dataverse does this which is really cool)
xipped rest of the files, even though 7zip is better, but didn’t want to get people to download new tool in order to use the data.
Documentation, describe all the values, whole schema, provide links to everything, provide description e.g. of error scoring. because bpoth frontpage and in-text tell you where it appears. Number of scholars comparing those two, that’s a key indicator.
pubmed is a public access database, anyone can use it. Path to the most diffusion of data. One step back to apply a license.
3 that have seen used:
(CC-BY, ODC-BY) — use for anything but show where you got it
non-profit, e.g. attribution non-commercial
non-reditribution, no-derivatives (you can’t make a new dataset from this data)
Jeff Kuhn does this
Just b/c you put a license on a dataset doesn’t mean people won’t ignore it. Another option is to restrict access (e.g. UVA dataset you have to email them). It could also be that this decision is made for you. Made using MAG, which is ODC-BY: if you make a derivative dataset you have to pass the license through. Cannot make public access even if wanted to.
Bitsy: how strict is no-derivs? could you use in a paper?
Sam: No-Derivs is quite restrictive; it suggests not only that you can't make a derivative dataset, but also that you may not be able to include an excerpted table in a paper. Would still need to get special permission from author to publish composites/extracts. A regression table would be a derivative under copyright law (or, well, this is underspecified). This license wasn’t really intended to be used with data — a grey area people try not to plumb.
The lesson here is a no-derivs license is probably not what the authors want: you don’t get authors suing other people.
Obviously if required to post data you do that, but what id not required. Wjay if you want to spur on other researchers and contribute to the discourse? When: after publishing? or just now and get payment in citations? or do you want a model to sustain ocver time (not sharing now in order to keep sharing later)
Adam, putting you on the spot as a pioneer of data sharing in this field: how did you and collaborators thonk about making data available?
Adam: solution is to have manuel as co-author. Talking on a beach, got a grant to make the dataset, thought should publish papers but Manuel reckoned would just publish dataset and everyone would cite them.
In retrospect: definitely the right decision — got so many citations, so much attention for the dataset. 0 regrets.
A decision which they didn’t worry about too much, what’s the weight of small number of highly cited papers vs many less cited. NBER has >4k cites on google scholar, less on other platforms that coubt less stuff
Matt published papers using the web of science version of this. People were asking for the data and couldn’t give away, but then found and integrated MAG in order to share.
Adam: Update question becomes enormous. Still people writing papers with NBER dataset that ends in 2006, itself an update on the original that ended in early 1990s
Q: why do people use this vs patentsview?
Even though patentsview has much more up to date data?
Adam: not sure if patentsview does have same calcs (generality/originality) but OECD does. information despite simple models does not diffuse freely, people don’t know what’s out there
Matt: speaking of which, little behind in updates :)
Matt’s asides on his sweet video setup:
Greenscreen for zoom
PPT add-on: ‘transforms each slide into several slides’
Mic w/ stand