We need a Github repository for sharing code and documentation! This is an elaboration on discussions from our workshop in December, and related ideas from the past year.
The challenge: finding the latest work on a public dataset
Many researchers develop data and methods which are “adjacent” to large public datasets, and others develop improvements on those data and methods. These should be archived in a way that they can be easily discovered, compared, and used in tandem.
Azoulay et al. (2019) developed a matching algorithm to link publications in PubMed to patents which cite them, and a PhD student at Stanford developed a modification of this matching algorithm which produces around 490,000 more (seemingly valid) matches in the same sample – an increase of roughly 12.5%.
I digitized historical versions of FDA Orange Book patent and exclusivity data, publicly posted online and available for merging with standard USPTO administrative datasets on patent grants. Other researchers (law professors, and Bhaven Sampat) hand-coded classifications of the Orange Book patents, which could lead to an improved version of this data.
The innovation/patent research community has good sharing norms, but code and data could be shared more effectively. Two inefficiencies:
Many datasets – such as digitized Orange Book data – are hosted on one-off websites which can be easily missed; and many algorithms for analyzing USPTO and complementary public datasets are shared via social networks rather than posted to a public archive, which can generate inequality in reuse.
Sometimes “technological progress” occurs that advances an initial contribution, but researchers are only aware of the initial contribution (which gathers more citations over time), and it takes a long time for improvements to make their way into iterative work. For instance, see the Azoulay et al. 2019 algorithm noted above.
A natural step forward for addressing both inefficiencies is be to set up a shared public catalog of code and documentation, and a Github repository for code that doesn’t otherwise have an archival home. The catalog would include bidirectional links to derivatives and related historical work. The repository would track use and downloads for each subproject, and provide DOIs for code (via Zenodo), so researchers can track how their work is used.
Researchers would post code and documentation (and associated data where needed, as in the Orange Book case), including citations to papers which apply that code. This would let future researchers more easily build on past methodological improvements.
This might start by focusing on data and code that is “adjacent” to USPTO administrative datasets, and code used to generate various metrics from documents or citation graphs. The same framework could then be extended to other public innovation data. While simple, this has the potential to both reduce the lag needed to go from research idea to initial analysis, while improving the quality of communication, collaboration, and subsequent research.
A first step would be getting a group to consolidate the already existing packages and papers. A group could organized this by category to include a review of the outputs that exist, the code used to create it (including what languages are most often used), and current documentation describing how to use it and the best current practices.
I’ve used GitHub as the example repository here, as it feels easy to get started with that. However if we decide later that a different set-up outside of GitHub would be more effective, it would be easy to transfer work there.
convert to actual comments on related text above? other notes
Many PhD students come to my office with creative, novel, and interesting research ideas where “step 1” is to go replicate five other linkages or methods that have been done in the literature, which can frequently take several months or longer. (HW)
A few working group participants indicated interest in an overlay journal of work done to produce these datasets, or done with them, could feature all work associated with or extending III datasets. One requirement to publish in such a journal could be to post work/code on github. (via MK)
Software communities -
R and Stata centralize their packages because they are coding languages.
In our case, the metrics community (for instance) is not specific to one coding language, so creating a community within Github would be an ideal space to upload and update packages and flag any issues.
Bio - protein databank
Census data — alignment on names
Genomics — alignment on average genomes by species