The first meeting of the I³ Technical Working Group was held in Cambridge, MA from Dec 6-7, 2019. Roughly 65 people attended. We heard presentations on the specific data construction efforts being funded by the project, as well as from others working on disambiguation of company patent ownership, the challenges of working with Chinese patent data, race and ethnicity of inventors, analysis of patent texts, and measurements of novelty and impact. We had wide-ranging discussions — both in the formal session and in informal conversations — of how the Initiative can best work to support effective data sharing.
The convening generated an initial mailing list of interested contributors and legal scholars studying intellectual property. A series of online discussions around data sharing will help community members troubleshoot making their research data public. Our next full meeting will be at the end of 2020. We will hold a virtual meeting (or series of meetings) of the Technical Working Group in the Fall of 2020.
The I³ website is gathering models for sharing data to spur cumulative research, which involve capturing context, comments, and updates beyond simply providing data for replication of results in a published article. We are maintaining an index of data and publications produced by members, and compiling case studies on data-sharing workflows. An I³ dataverse will preserve a copy of most of the data published by the collaborative, while the largest files, such as those used by Reliance on Science, are being hosted on Zenodo.
A first version of the Lens Labs portal is available at https://www.lens.org/lens/labs. It highlights relevant patent datasets, points visitors to related open innovation data, and uses open, granular metrics to explore the influence of science and technology on society. It is currently being tested by researchers. The first set of data dumps of bulk patent and cited data have been made available through the portal. The portal will link users to all Lens API & data facilities, to the MIT bulk patent and scholarly works datasets and associated schemas, and to example dashboards.
New datasources and APIs: A new patent data architecture and API for the Lens has been finalized and tested, with a beta version scheduled for release in June. We are aligning US, EP full-text and WIPO full-text data with a common data model. New prior art data and INPADOC data are being reviewed for inclusion, including changes in patent ownership and legal status.
Datasets
MIT scholarly works (1950-2018) (372 MB); MIT scholarly works cited by patents (1950-2018) (87.6 MB), Patents citing MIT works from 1950-2018; MIT Draft Patent Portfolio
US full-text dataset from 2018, training datasets and other quality control datasets. (NC)
COVID data: 35+ patent and scholarly datasets, including biological sequences disclosed in patents.
Patent full text (stripped of HTML) for “Nanotechnology” US patents, used in the project Patent Disclosure - An Economic Analysis Using Computational Linguistics.(Nancy Kong and Adam Jaffe)
Presentations
A free and open platform for science and technology mapping (10/2019)
A public innovation dataverse (12/2019), webinar on prior art and patent data to IPOS (3/2020)
The Scaling Science project uses machine learning algorithms to find patterns of collaboration and interest in citation and co-authorship graphs. It is designing related metrics to estimate impact and influence over time, to help identify new ideas and collaborations with high potential.
A demonstration site is available for biology; this will later be extended to all fields. Metrics are calculated for each year after the publication of a work, so that one can see how the perceived impact changes over time. These metrics will themselves be available for download and reference.
The team used existing data structures and database formats to build a Neo4j graph database (with citation clusters visualized to the right) to calculate graph-based metrics from scratch. In early tests it efficiently calculates ArticleRank, PageRank, node2vec and h-index measures, with existing Neo4j plugins and pipelines.
A visualization framework is being built now to compare traditional and new metrics calculated from the Scaling Science graph, with a focus on prediction of impact. Scaling Science explored over 40 other impact metrics from the literature, to compare their predictive capacity.
This analysis was first done with Microsoft Academic Graph [MAG] data, then repeated with Dimensions data (with little improvement), and finally with Lens data (better coverage of edge cases, updated and reconciled using Wikidata and Lens IDs). Future visuals will use the latter. A Python framework was developed for efficient access to Lens data via its API.
Presentation: Open Innovation Metrics
The Reliance on Science project has built out its own small website and already enjoys wide reuse. Its dataset advances patent-to-paper citations in two respects: by enhancing existing front-page matches from previous work, and by curating a set of full-text matches (i.e., from the body text of patents). Each link includes the patent #, paper identifier, applicant/examiner flag, confidence score (1-10), and whether the reference was in the front page, the body text, or both.
Front-page matches have been updated through the end of 2019, as written up in Strategic Management Journal. The matches have been posted at relianceonscience.org with full documentation and downloaded roughly 10,000 times. Matches are now available for both the Microsoft Academic Graph [MAG] and for PubMed.
Full-text matches were curated and presented at the December working group meeting and also posted to reliance on science in beta status. This spring, a team of six research assistants has been harvesting known-good matches from a random sample of 9,000 patents (oversampling on the 1800s and pre-1975 OCR era, with double RA coverage). They hope to have an estimate of false positives and false negatives for full-text matches by the next update of the data.
All computation was done on the Boston University Shared Computing Cluster. Reliance on science has been developed as a model for data sharing, with a companion monograph on how to capture data structure and process as well as the result set. We continue to share standards for publishing data schemas, documenting data process, and effective use of repositories.
The team is making the data platform for IPRoduct collaborative, in order to use crowd contributions to keep up with the changing data landscape. Platform credits for downloading the most recent data will be offered to encourage user participation. A private beta of the classifier was released in April 2020, with 1500 inputs from a dozen beta users.
Samuel Arnod-Prin began developing the platform in January, and the coming months will see development of user training and user tasks, including web page classification, and data enrichment for firms (firm size, creation year, industry) and products (product price, product codes). The platform is currently hosted on EPFL’s internal network, but a public beta release is planned for July, when data export options will be made available.
Collaborations: The data has been used by the Novo Nordisk Foundation (NNF) to track the commercial uses of research findings they sponsor. The team has shared subsets of data with U.S. and European scholars (Jonathan Ashtor and Carolina Castaldi) for their research. Insurance and patent analytics companies have also indicated interest.
Publications:
Linking Patents to Products (public talk, Dec 2019)
de Rassenfosse, G., & Gruber, M. (2019). Technology search and the two faces of appropriability: An empirical study in the medical device industry. Under review.
de Rassenfosse, G., & Higham, K. (2020). Wanted: A Standard for Virtual Patent Marking. Journal of Intellectual Property Law & Practice, accepted for publication.