I³ members have been working toward developing and collaboratively refining public datasets, and sharing methods and best practices. In addition to individually supported data projects, this community has been anchored by regular meetings to share work, methods, and data sources.
The first meeting of the I³ Technical Working Group was held in Cambridge, MA from Dec 6-7, 2019. Roughly 65 people attended. We heard presentations on the specific data construction efforts being funded by the project, as well as from others working on disambiguation of company patent ownership, the challenges of working with Chinese patent data, race and ethnicity of inventors, analysis of patent texts, and measurements of novelty and impact. We had wide-ranging discussions — both in the formal session and in informal conversations — of how the Initiative can best work to support effective data sharing. These discussions are being followed up with ongoing proposals.
The convening generated an initial mailing list of 60 interested contributors, and legal scholars studying intellectual property have since joined. This summer, a series of online discussions around data sharing will help community members troubleshoot making their research data public. Our next full meeting will be at the end of 2020. The plan for a one-day research workshop as part of the 2020 NBER Summer Institute was disrupted by the pandemic. The Summer Institute is being held virtually, with fewer presentations. While part of the Innovation program on July 14-15 will be I3 -related papers, the virtual format does not incur any costs that will be charged to the grant. So the 70K budgeted for this meeting will be held for a future meeting, likely in 2021. In addition, the NBER has announced a decision to have no in-person meetings through the calendar year 2020. We will therefore have a virtual meeting (or series of meetings) of the Technical Working Group in the Fall of 2020, which will mean that the 35K in the second year budget for a TWG meeting will also be saved and pushed off until the third year of the project.
The I³ website is gathering models for sharing data to spur cumulative research, which involve capturing context, comments, and updates beyond simply providing data for replication of results in a published article. We are maintaining an index of data and publications produced by members, and compiling case studies on data-sharing workflows. An I³ dataverse will preserve a copy of most of the data published by the collaborative, while the largest files, such as those used by Reliance on Science, are being hosted on Zenodo.
For reasons unrelated to I³, YarnLabs has decided to cease operations in the near future. We therefore request that the Sloan contract with YarnLabs be terminated with the end of the first year of the project. We have made arrangements with Code for Science and Society (‘CSS’) to take over as the primary grantee for the project. The Budget for the final year of the project with CSS will be exactly the same as the final year of the originally proposed project, and the subcontracts from CSS to BU, NBER, Lens.org and EFPL will be exactly the same as in the original proposal.
A first version of the Lens Lab portal is available at https://staging.lens.org/lens/labs. It highlights relevant patent datasets, points visitors to related open innovation data, and uses open, granular metrics to explore the influence of science and technology on society. It is currently being tested by researchers. The first set of data dumps of bulk patent and cited data have been made available through the portal. The Lab will link users to all Lens API & data facilities, to the MIT bulk patent and scholarly works datasets and associated schemas, and to example dashboards.
New datasources and APIs: A new patent data architecture and API for the Lens has been finalized and tested, with a beta version scheduled for release in June. We are aligning US, EP full-text and WIPO full-text data with a common data model. New prior art data and INPADOC data are being reviewed for inclusion, including changes in patent ownership and legal status.
US full-text dataset from 2018, training datasets and other quality control datasets. (NC)
COVID data: 35+ patent and scholarly datasets, including biological sequences disclosed in patents.
Patent full text (stripped of HTML) for “Nanotechnology” US patents, used in the project Patent Disclosure - An Economic Analysis Using Computational Linguistics.(Nancy Kong and Adam Jaffe)
A public innovation dataverse (12/2019), webinar on prior art and patent data to IPOS (3/2020)
The Scaling Science project uses machine learning algorithms to find patterns of collaboration and interest in citation and co-authorship graphs. It is designing related metrics to estimate impact and influence over time, to help identify new ideas and collaborations with high potential. A demonstration site is available for biology; this will later be extended to all fields. Metrics are calculated for each year after the publication of a work, so that one can see how the perceived impact changes over time. These metrics will themselves be available for download and reference.
The team used existing data structures and database formats to build a Neo4j graph database (with citation clusters visualized to the right) to calculate graph-based metrics from scratch. In early tests it efficiently calculates ArticleRank, PageRank, node2vec and h-index measures, with existing Neo4j plugins and pipelines.
A visualization framework is being built now to compare traditional and new metrics calculated from the Scaling Science graph, with a focus on prediction of impact. Scaling Science explored over 40 other impact metrics from the literature, to compare their predictive capacity.
This analysis was first done with Microsoft Academic Graph [MAG] data, then repeated with Dimensions data (with little improvement), and finally with Lens data (better coverage of edge cases, updated and reconciled using Wikidata and Lens IDs). Future visuals will use the latter. A Python framework was developed for efficient access to Lens data via its API.
Presentation: Open Innovation Metrics
The Reliance on Science project has built out its own small website and already enjoys wide reuse. Its dataset advances patent-to-paper citations in two respects: by enhancing existing front-page matches from previous work, and by curating a set of full-text matches (i.e., from the body text of patents). Each link includes the patent #, paper identifier, applicant/examiner flag, confidence score (1-10), and whether the reference was in the front page, the body text, or both.
Front-page matches have been updated through the end of 2019, as written up in Strategic Management Journal. The matches have been posted at relianceonscience.org with full documentation and downloaded roughly 10,000 times. Matches are now available for both the Microsoft Academic Graph [MAG] and for PubMed.
Full-text matches were curated and presented at the December working group meeting and also posted to reliance on science in beta status. This spring, a team of six research assistants has been harvesting known-good matches from a random sample of 9,000 patents (oversampling on the 1800s and pre-1975 OCR era, with double RA coverage). They hope to have an estimate of false positives and false negatives for full-text matches by the next update of the data.
All computation was done on the Boston University Shared Computing Cluster. Reliance on science has been developed as a model for data sharing, with a companion monograph on how to capture data structure and process as well as the result set. We continue to share standards for publishing data schemas, documenting data process, and effective use of repositories.
The team is making the data platform for IPRoduct collaborative, in order to use crowd contributions to keep up with the changing data landscape. Platform credits for downloading the most recent data will be offered to encourage user participation. A private beta of the classifier was released in April 2020, with 1500 inputs from a dozen beta users.
Samuel Arnod-Prin began developing the platform in January, and the coming months will see development of user training and user tasks, including web page classification, and data enrichment for firms (firm size, creation year, industry) and products (product price, product codes). The platform is currently hosted on EPFL’s internal network, but a public beta release is planned for July, when data export options will be made available.
Collaborations: The data has been used by the Novo Nordisk Foundation (NNF) to track the commercial uses of research findings they sponsor. The team has shared subsets of data with U.S. and European scholars (Jonathan Ashtor and Carolina Castaldi) for their research. Insurance and patent analytics companies have also indicated interest.
Linking Patents to Products (public talk, Dec 2019)
de Rassenfosse, G., & Gruber, M. (2019). Technology search and the two faces of appropriability: An empirical study in the medical device industry. Under review.
de Rassenfosse, G., & Higham, K. (2020). Wanted: A Standard for Virtual Patent Marking. Journal of Intellectual Property Law & Practice, accepted for publication.