Here are public innovation datasets, some newly shared at our first working group meeting. If you have additional data to share, please contact us.
Reliance on Science - Citations from patents to scientific articles, both from worldwide front-page and USPTO body-text (data + code)
(Matt Marx) This dataset includes citations extracted from front pages and body text of patents, confidence scores in the extraction, related code, and intermediate steps in the process, to simplify replication or parallel work.
_pcs.tsv has patent citations to science, with confidence scores
_pcs_pubmed.tsv has a PubMed-specific match for USPTO patents.
_pcs_bodytextbeta.tsv is a preliminary release including citations from the body text of USPTO patents since 1836. This adds a field indicating whether a citation appeared on the front page, in the body text, or both.
Other files redistribute the 1/1/2019 release of the Microsoft Academic Graph, carved up into smaller, variable-specific files, with extensions for journal impact factor & technical classifications.
Source code: mattmarx/reliance_on_science
As an example of custom datasets that can be shared from Lens.org, these are public dumps of metadata associated with patents and scholarship from MIT faculty:
MIT scholarly works (1950-2018) [372 MB]
MIT scholarly works cited by patents (1950-2018) [79MB]
MIT Citing Patents [3.9 GB]
MIT Citing Patents collection (online)
MIT Patents [300 MB]
MIT Patents collection (online)
French author disambiguation dataset
Portal & Beta API
Explore: visualizations of different impact metrics
Initial list courtesy of Bronwyn Hall
Additional files for the 2006 edition on Bronwyn’s website
1999 NBER patent data files (2002: Hall, Jaffe, and Trajtenberg version)
Chilean IP and firm data (1995-2005)
PatentsView (USPTO parsed data 1976 and later, with inventor/assignee disambiguation, inventor gender, and more)
PATSTAT global documents, curated by EPO + OECD. Highly recommended.
Google Patents Public Datasets (worldwide bibliographic and USPTO full-text, available via BigQuery)
Match of US Patents to CRSP 1926-2010 (Kogan, et al.)
Match of EPO data to European firm (and R&D) data (Grid Thoma)
also has matches w/ US Amadeus data, and some trademark information
Japanese Patent Data from IIP - 9M applications + 2.7M registrations, from the early 20th century to 2004. Includes citations and owner information
Match of Chinese data to firm names 1998-2009 (He, Tong, & Zhang) - A match of patents to firms. Separate files for utility, invention, and design patents.
Harvard Business School Patent Dataverse - Name disambiguation of US inventors, 1975-2010. (Lai, D'Amour, & Fleming)
In addition to the relianceonscience snapshot of the Microsoft Academic Graph, here are other sources for scholarly-graph data:
via Semantic Scholar (Kohlmeier, et al)
Open Research Corpus: articles, citations, citation-type, and extracted entities, for a deduplicated superset of the MAG data, 175M papers in the final set. Available for bulk download via an S3 endpoint.
via ma-graph.org (Michael Färber)
RDF dump files of the Microsoft Academic Knowledge Graph (last: 12/2018)
URI resolution of MAKG within Linked Open Data.
HTML page descriptions of resources in the graph
Entity embeddings for all papers in the graph