Skip to main content

2023 Technical Working Group Meeting

December 2023 meeting of the I3 TWG

Published onDec 01, 2023
2023 Technical Working Group Meeting

December 1-2, Cambridge MA
Agenda (NBER) | Edit these notes! | Youtube recording
All TWG meetings: 2019 | 2020 | 2021 | 2022 | 2023

Matching Scientists and Inventors

Lee Owen Fleming, University of California, Berkeley
Matt Marx, Cornell University and NBER
Emma Scharfmann, University of California, Berkeley


If anyone knows about the bump of Chemistry papers and gatekeepers around 2010, let me know.
A: that's an OpenAlex error in using the date of publication, not date of invention, that's the date that lots of documents were published online.

We use the term ‘gatekeepers’ for someone who both publishes and patents. Also 'scientist-inventor'.
Comment: Maybe not a good term for you. esp seeing the geographic patterns; opening slide makes me think of centrality and eigenvectors; these aren't gatekeepers of info flow.  It also feels normative.

Q: for many East Asian countries, there's an oversupply of grad students, just part of the culture that you'll overeducate a bit [then students will go to firms?] 
A: I never connected that to the asian angle until yesterday; that makes a lot of sense.  

Q You started from paper-patent pairs where matt started. I'm guessing that the vast majority of output is not actually such pairs, ti's people doing both.  have you looked at that at all? 
A: that's my intuition also, but haven't looked.  

Q: you have the # of patents and papers, maybe you can classify based on the distribution?  some write a lot and patent once, some patent a lot and write once.  impact of each one?  
A: One slide that didn't make it is a pie chart: gatekeepers seem to be prolific scientists; more than 50 papers.
A: Old work on university patenting shows that's true.  profs wh patent are on average the most productive.  

Q: In general you’re looking at a slew of papers/patents by the same person?
A: We start with patent/paper pairs, then look at their ORCID papers to expand our training set. That’s where we get model numbers. But we didn’t look at accuracy after we’re done with clustering, which is a headache.

Adam: Thanks all! Looking forward to working with the data when it comes out next summer.

The Commercial Potential of Science & its Realization (Evidence from a Measure Using an LLM)

Sharique Hasan, Duke University
Roger Masclans Armengol, Duke University
Wesley M. Cohen, Duke University and NBER

We're building here on the work and data of people in this room, particularly Reliance on Science; thanks for this!

Our work studies ex ante commercial potential. We want to measure commercial potential w/ ML, making forward looking predictions given data we have now.

We realize that commercial potential is a boogeyman.  Some of the questions we have: Are there gaps or biases against women or minorities from certain regions, or specific regions where research is published? 

Outcome: scientific finding, cited in a patent that is renewed.
Predictor: knowledge contained in article abstracts alone.
Method: relying on SciBERT embeddings for abstracts, using NN with predictors as embeddings. We also use two secondary measures of potential: academic cites vs patent cites, and social impact.

We develop one model per year, using citations up the year prior to the prediction year. This provided rolling predictions.

Example: Nobel prize research papers…

Our model accuracy was around .74 (AUROC and Accuracy)

To test this outside of our training set, we started with invention disclosure and outcomes from a TTO at a leading research uni. Matched inventions to articles they rely on. From the resulting set of 96k pubs we could match 13k pubs to 2700 inventions, w/ a median of 2 papers per invention.

The model picks up specifics from the article that suggest future use in patents independent of things like h-index and fixed effect for field and year.

We used the model to estimate whether it gets disclosed, cited in a patent, &c.

Conditional on disclosure, we looked at whether there is investment, patents filed, licensed, &c. Conditional on investment, we looked at whether this model’s predictions continued to predict future stages of commercial success.

We also compared the impact of TTO involvement., and compared across different institutions. About 60% of the variation of patent citations pf papers can be explained by mean ex-ante commercial potential for that institution.

Noting limitations: many contributions reach the market without associated patents. This may only partially capture potential due to indirect paths to commercializaiton. Long time horizons are often required before scientific contributions are embodied in new products.

Q: Do public university limitations on self-funding limit the potential that would be illustrated here? They may have high potential science but are restricted (via TTO options) in how they can make profitable investments. (from the UC perspective)
A: That’s a great question, we’d like to use our model to consider this

Q: Maybe you can look at attributes that characterize good ideas that haven’t been commercialized. That’s the question really on the table — what is undiscovered, what charateristics are undiscovered?
A: We’re looking more generally at determinants of what we call the “realization gap”.

Q: Long ago I made a credit-scoring algorithm re: whether startups would fial within 4 years. I talked to a lot of banks and partners continued to work on this after. The AUC was just like yours. This algorithm didn’t commercialize. Our problem was: even though AUC is 0.74, what’s the value of that signal? If it’s too noisy, even if it’s fairly good, that doesn’t help the user. Thoughts on what success you need for this to get used?
A: Good question. Also how does this compare to human evaluators’ AUC? We can get away with this for a research exercise.

Q: I love the scalability here, you can see the whole field. How does this work on subfields? Compare to fine-graining subfields, where you have easy breakdown by schools, topic area, &c. Are there fields where this is a significant improvement?
A: We are thinking about this. We’ve done field-specific models; you get some extra juice; in some you can get clear separation. Physics, clear difference between basic physics and quantum.

Q: Did you look qualitatively at the embeddings to see how much the model is picking up on language suggesting the author is thinking about patenting or preparing for a patent? Which may be separable from the commercial potential of the idea.
A: We looked at whether asking ChatGPT to make abstracts “sound more like a patent” affected the model prediction; it only had 2-3% of an effect. But things that did have an effect were: indications of tackling a big vs narrow problem, something likely to affect a lot of people, [subjects related to a commercial product or service. See note above re: theory vs applied physics]

Q: What about more path-breaking innovations? Things w/ long gestation periods?

Q: Did you consider whether you’re picking up on terms indicating where the field is going, so that those mentioning that are more likely to be cited in patents? Could this lead to the high predictions for papers that were later Nobel Prizes?
A: That’s interesting, want to think it through clearly. Possibly

This data will be public, yes?
A: Yes, soon!

New Facts and Data about Professors and their Research

Kyle R. Myers, Harvard University
Wei Yang Tham, Harvard University

Context: Discourse on “Science! Does it work, is it broken? Will we ever know?” Since we don’t have prices, or traditional labor and capital, we have a wide range of inputs and outputs and payoffs and preferences. So we make lots of assumptions about these things to analyze them.

Challenge: databases we’re using were created for scientists advancing science, not to evaluate or study science. We want to subjectively measure un-observable things, focus on breadth over depth, across fields. Consider this a potpourri of correlations. When you need a motivating fact, hopefully this dataset will have something for you.

Data selection: we emailed a subset of 260k people to get data.

~ ~ ~

Lots of areas to explore:

  1. Much more earnings variation is within-field compared to across-fields

  2. Institutions, ranks, tasks and sources are quite important for earnings

  3. Standard meta-sci, used research output is not very important for earnings

  4. Publs-per-year is a mediocre proxy for pubs-per-research-hour

  5. Research risk beliefs: fundraising, personal risks, and generating theories

  6. Life-cycle changes in research inputs and outputs, but not audiences

  7. ‘Edisons’ tend to take more personal rsiks and earn more than ‘Bohrs’

6:45 pm Dinner

Saturday, December 2

Index and Validation Datasets

Agnes Cameron, Knowledge Futures

The I³ open innovation data index turns 2 this year. You can see and browse it at, and add new data or edit metadata directly on github.

Validation datasets are tools in their own right, and we’ve started to gather them. Please publish yours. Many papers have open data but don’t publish their validation data. Talking to authors, people either feel they are very valuable and are wary of publishing, or on the other end of the spectrum don’t think they would be useful to anyone else.

Existing models for this in machine learning are robust: in part because dataset creation and validation are a central feature of sharing research, and everyone appreciates the release of validated datasets and the chance to reuse them. So there is a lot of infrastructure for this: 🤗, kaggle, papers with code. HuggingFace in particular shows you which tasks projects and models are used for.

Current gaps: projects that make heavy use of validation data and rarely publish those datasets. How can we credit and promote what’s happening? What validation data do we want to bring into being? Contributions welcome!

Q: Do we have a sense of how many people use validation datasets vs other datasets? [other than their own]
A: This comes into its own once there is a norm around sharing and reusing them. Rarely considered something to credit. People currently just use it in the construction of new datasets, more than in reevaluating existing ones or shared between projects.

Q: Good to compare the ML models of credit for naming such things. But credit in CS has often been different from other fields. We need to think about how to adapt to the field norms. For instance: creating a dataset paper alongside the dataset helps [for NBER data]; something like that here could be important for this community.

Logic Mill - A Knowledge Navigation System

Erik Buunk, Max Planck Institute for Innovation and Competition
Sebastian Erhardt, Max Planck Institute for Innovation and Competition
Mainak Gosh, Max Planck Institute for Innovation and Competition
Dietmar Harhoff, Max Planck Institute for Innovation and Competition
Michael E. Rose, Max Planck Institute for Innovation and Competition

Current status of : Beta w/ 8 API functions, 228M documents (200M from S2). 130 users from 30 institutions.

Road map: improving our language model, moving beyond our current 512-token limit, updating/expanding data sources (to include Open Alex, and PATSTAT). We also want to add API functionality and offer precomputed datasets.

Q: What about providing pairwise similarities for all ~50Q pairs?
A: [That might be hard.]

Q: what does it take to compile this, technically?
A: we need to store a lot of vectors in memory, we have > 1TB of RAM in our HPC. Thanks to Max Planck for this.

Q: Have you segmented the overall corpus and found any interesting subsets with shared structure?
A: I’ve looked at this a lot, for instance we looked at the patent families of single companies, including large ones like Siemens. There are definitely useful clusters that can be seen there with this tool; we haven’t done something like this across the entire corpus. Algorithms for this tend to take additional memory so we’d need even more compute.

DISCERN 2.0: Extending and Enhancing the DISCERN Dataset

Ashish Arora, Duke University and NBER
Sharon Belenzon, Duke University and NBER
Larisa C. Cioaca, Duke University
Lia Sheer, Tel-Aviv University
Dror Shvadron, Fuqua School of Business

Updates in 2.0: extended coverage to 2021, added R&D firms that don’t patent. Moved from ORBIS to SEC filings, from PATSTAT to PatentsView, added patent-level reassignments and pre-grant applications. Moved from Web of Science to OpenAlex

We’ve improved our matching algorithms and expanded the doc-family by moving to OpenAlex. We used raw affiliation strings from the PDFs (81M uniques!) and not the normalized data in OA — for data quality reasons?

We’re using gen AI models every day, tried to feed these strings into an LLM to get out the firm name. We took the raw string, asked a model to guess the firm, and get back a clean name to aggregate. We tried the same for entity resolution.

We’re using LLAMA and vLLM for this clean-up.
cascade of cleanup steps for OpenAlex matching.

Q: Can you ask OpenAlex to do that [themselves]?
Audience to heckler: no, but you can :)

Q: You mention looking at ownership changes. Are you looking at other things such as name changes, across renewals, and publications following assignment?

Q: Do you also track sales, as part of tracking ownership changes?
A: We do look at that. [to the extend data exists]
Adam: USPTO asks firms to report, but it’s not obligatory. I don’t know if we have estiamtes of how complete the reassignment file is. Many of us have said for years the law should be changed so that you have to tell the PTO for assignment to be enforceable. I’ve said publicly that most of the things we argue about have two sides, here there’s only one thing to do.

Q: I wonder about the reliability of the LLM system you used. That seems like it should depend on the quality of data used; how did you check its validity?
A: We did look at it, can discuss it more after.

Q: You said you’re looking at pre-grant pubs as well:
1) assignment is often missing for ungranted applications, how do you think about it? 2) if you look at European pairs, can you pick up that data anyway?
A: Interesting idea, thanks

10:00 am — Patenting Firms

Improving Patent Assignee-Firm Bridge with Web Search Results

Yuheng Ding, The World Bank
Karam Jo, Korea Development Institute
Seula Kim, Princeton University

Firm innovation is a major source of creative destruction and economic growth. We constructed a longitudinal patent-assignee-firm bridge between assignees and firms, using administrative data from Census. I’ll talk about data sources, matching methodology, and a practical example.

We link the USPTO to the Business Register and Longituinal Business Database. Stages: name standardization, fuzzy-name matching, patent-firm crosswalk (a “search-aided bridge” search within USPTO + extraction from internet searches) Search-aided matching accounts for 4-8.5% of total assignee and patent matches, provides a more stable bridge over a longer horizon, and can help study non-public firms that aren’t in the PTO data.

Example: looking at impact of Chinese competition on firm innovation.
We were able to use the bridge to help match with other sources like young firm data.

Q: Are there any characteristics of the new matches you were able to make?
A: Fuzzy name matching only fixes small character shifts. But significantly different formats, such as “IBM Corp” and “International Business Machines corp” benefit from this bridge.

Q: Any idea what fraction are private firms?
A: Not sure exactly, but one eample might be young firms that wouldn’t be in the public firms list. These are ~30% of the total population:

Q: How does accuracy change over time? I can imagine internet search is more helpful recently than for firms in the 1980s.
A: We do look at this; nothing published yet, but we haven’t considered working this into the overal stats about matching-improvement

Q: I understand this is set up for people who can access it through the Census system. Have you thought about a version of this that could be used by anyone online? A scaled-down version you could release publicly?
A: Absolutely. There’s a version of matched assignees that is totally public, that could be published.
Q: Aside form compustat firms, what about small firms? to have names standardized within patent data. That would be valuable, I think. PatentsView has done some of this for names disambig. But this seems better (esp for private firms). You might want to talk w/ someone in the office of the chief economist, which owns PatentsView, to incorporate your better assignee-name disambig into their public data.
A: Great idea.

Q: (AT) a) Since you’re working in the Census Bureau, what’s the relationship b/t your work with the BR and what others are doing in CES and Census?
A: Work by Nathan and others starts at 2000, trinagulating by inventors. He said he’s planning to accomodate all these methodologies, collecting existing bridges over multiple periods. That’s in progress within Census.
b) impressive to see what you’re doing linking patent data to companies. Now we have lots of different versions around: DISCERN, PTO, NETS (nat’l time-series database) w/ D&B. It would be nice to know to what degree these datasets agree with one another about linkages. As this group wants to reduce duplication, when we have many variateis of the same linkage data going on, how can we support comparison across them?
A: That’s certainly part of the I3 Index, Agnes can speak to how to facilitate that comparison. It’s hard for dataset-generators who can’t access/read the Census alternatives.
c) You mentioned patent applications, that word has a specific meeting : a company might file 10 applications at one moment. Then over time, some are granted/allowed and become visible, some remain pending, often invisible but could be in pre-grant data, and some are abandoned. Those may never show up in public data depending on if it happens before 18 mo after the publication date. I think you mean when you say ‘applications’, “granted patents”. We’ve tried to link this via the lawyers, as lawyers are consistent over time for a company; can be used to confirm that something abandoned is from a given firm.

The Government Patent Register: A New Lens on Historical U.S. Government-Funded Patenting

Daniel P. Gross, Duke University and NBER
Bhaven N. Sampat, Arizona State University and NBER

World War II was the first time we had a shock of government funding. A proliferation of funding agencies included PHS, DoD, AEC, NSF. We looked at this; ideally would have precise geographical and field division of funding.

We looked at government-funded patenting: traditional measures include - assignment to a government agency; those with a gov. interest statement, which links it back to a grant or contract. Those don’t work well historically, so we tracked down another: the government patent register.

Bayh-Dole in 1981 created a uniform license policy & required gov interest statements. …

We scanned and digitized the Register; at some point there’s a transition to amodern electronic register. Our records go through mid-1990s. Based on this, the gov share of patents peaks during WWII, mostly DoD, then declines.
[Ex: Fermi patents]

One of the best sources for patent citations to science is the Fleming, et al. dataset; there’s also material since the 70s that isn’t in the register but is in the Fleming dataset, and we’re still looking into what led to that shift.

The way forward: we want to link the Register, work that Matt and Lee have done, and providing a modern register that stitches them together and irons out the creases. We’ve also thought about using words in patents, to see how those relate to different regimes.

Uses: as historical indicator of pub vs private R&D. but patents != R&D…
As control for evaluations of specific changes, especially where correlated w/ DoD funding. Evaluating “title” vs “license” policies, exploiting all the variation before Bayh-Dole.


Q: How big a role does gov funding play in innovation direction for major new paradigms and breakthroughs? E.g. recent COVID vaccines? iPhone tech based on gov-funded research?
A: good use case for us and others!

Q: Do you have a measure, citations or other, to show whether gov funded patents are more important than private, by industry / by department, per patent or sectorally?
A: We’d like to go beyond citations here.

Q: Is the stitching together available yet? Can we email you?
A: It’s available, we’re still working on stitching for a few months but will share it with you.

Q: Do you have a sense of the connection b/t these patents and the subsequent creation of institutes like Stanford Engineering, pushing a new frontier?
A: We do; we’re getting into the creation of new fields, in medical contexts, not using patent data. That’s something we’re pursuing. It’s a longer answer, but happy to talk about that after.

Q: Niggling Q - you posted the Roosevelt exec order but also showed data from before…
A: They backfilled — the agencies did. Great question, and relates to another point I didn’t make: agencies did keep records. I don’t have the total # in the Register over time, it’s small before WWII. How reliably records were kept I don’t know; there wan’t much funding of R&D and debates then were about context where someone in Ag had their own invention, intramurally; who gets the rights to that. Those are USC 266 patents.
Q: Might be useful to see which agencies did and didn’t backfill.

Q: How well do people comply with requirements to report back?
A: There are requirements to report back to agencies; we didn’t get in too detail into complaince, but we coulde explore that with modern data.

Startup Patenting

Michael Ewens, Columbia University and NBER
Matt Marx, Cornell University and NBER

If you asked 10 people on the street, who was more responsible for innovation: startups or big firms, they’d all say startups. But we don’t have a lot of evidence on this point. What are our public options for matching startups to patents? (for those w/o Census data)

: PatEx has a small/micro firm indicator, but size isn’t a good proxy for age.
Still nearly impossible to find new ventures that hold patents. You can download a Crunchbase-2013 snapshot, but don’t have firm founding year for most firms. Even DISCERN only has year it was publicly listed.

: We use OpenCorporates: yet another Sloan project, a B corp that has solved the conundrum of making data sustainable. (The have a successful model: free web searches for a certain amount, but pay to be able to harvest.)
First we try a loose API search. Insisting on an exact match for short names and common names. We looked at 190k assignees, got results from 160k, went through scoring+cleaning. We penalized if it dissolved long before the 1st patent, or incorporated too long after.

: We also used some PitchBook data (which we can’t give out; can only give a flag saying ‘this firm raised VC’)

Current status (Beta, 142k assignees found.
That’s 85% of patents, 73% of assignees. Not as good as Census, but fully open.
— 135k from OC, 2k from Discern 1.0, 4k from PitchBook
— PV assigneeID from 2022, OC ID
+ founding yr / confidence score, + VC / confidence score

Young vs old firms: more likely to be cited, and cited by a larger range of patent classes. Can I say these are better patents? (A: no…)
Young vs (small, not young): fewer cites to science.
There’s selection bias: all old firms used to be young. So say we only look at firms that stop patenting after 3-8y. They peak early in # and in novelty of patents

Do young firms shape new industries? There’s clear correlation.
VC-backed firms are about 1/3 of all. There’s no difference in novelty (compared to non-VC funding). But they don’t cite science as often and don’t get cited as often.

Burden of knowledge” — is innovation getting harder to find? People staying in PhD programs longer? Do we see this in patent assignees (in firm age)? No. First patent assignment is coming earlier. But they’re also hiring older inventors, and people who have patented in more classes before they are hired.

Next Steps: international assignees; work with the updated patent-paper pairs dataet, mentioned over the summer, which is now updated to 2022, fully moved to OpenAlex: at the new


Q: Which comes first, good patent or startup? Do good ideas get funding b/c the idea is good and patented, or do they patent after starting up?
A: Unclear. Maybe both? See related work on non-startup firms.

Q: Any anomalies you’ve found in there?
A: The dataset is now online, you’ll see the first patents are sometimes in -5 years. There are a lot of anomalies; we find historical assignees harder to get.

Q: Compare this to the Startup Cartography work?
A: I think they also build on OpenCorporates, but we could get data directly.

Q: if time to first patent is declining, but inventor age is increasing, does that say anything about time to invent? Is it partly that they delay founding and patent fast once they start up?
A: That’s a very compelling idea. We could look at whether inventors come from academia, or leave a firm and do this and decide to do it as a startup?

Q: Cf. the Graham/Samuelson survey from 2008: if you just use startup assignees, you miss out on founders who had patents that weren’t assigned to the firm.
A: Aha!

Q: Interesting you’re working w. PitchBook. We’re doing something mixing PitchBook, Crunchbase, Capital IQ — and found less than 20% of firms in dataset that showed up in more than one source. We matched to NETS which is great, 80% match there. But be cautious about relying on just one of these.

11:30 - 11:45 Break

I3 Fellows presentations!

Patents or Defensive Disclosures? — Bernhard Ganglmair, Univ. of Mannheim
Using Pydrad — Bernardo Dionisi, Duke University
Linking Scientific Articles to Media Mentions — Saqib Mumtaz, Berkeley
Statistical Evidence from Clinical Trials — Maya Durvasula, Stanford

Patents or Defensive Disclosures? Bernhard Ganglmair

Do patents tell the whole story? How do defensive disclosures relate to patents? Learn CPC classes for disclosures with transfer learning, find similar patents.

Challenges: Heterogeneity (no given format) + Domain differences (between patents and disclosures). We used an NLP pipeline with Llama2 to help generate abstracts to reduce heterogeneity; and an elaastic search to support domaina daptation to reduce differences.

We want to build a model good at classifying [patents] but bad at distinguishing b/t disclosures and patents. Similar to DANN (2015)
Our main contribution will be our general pipeline to map texts into CPC space; aiming for two papers: one descriptive on patents vs disclosures, one analytic on the effect of changes in patentability.

Questions for the audience: can’t share the raw data yet. If you have something we can test this on, where you want CPC classes, let us know and we’ll try to do it, and that could be published on the I3 site / in the Index.

Q: Why do people use disclosures vs publication (scholarship, &c)? Can you compare disclosures to other publication, in various fields?
A: It’s easier; don’t pay as much. $120 vs $1k+, no referees, just a single invention or idea rather than a writeup. But yes, could be an extension

Using Pydrad Bernardo Dionisi

How do we access data? Some of us are still using flatfiles :)
Pydrad is a python library that lets you pull in a range of common large datasets: Reliance on Science, Open FDA, PatentsView, &c. You can import the library and then reference the datsets by name.

Registering and integrating a dataset is easiest if its in a repository like Zenodo with a clean API. Aiming to integrate registration with the I3 Index once it is online (not quite yet).

Linking Scientific Articles to Media Mentions Saqib Mumtaz

Many researchers share updates at conferences and public talks, and there’s a media landscape dedicated to promoting science; this is understudied compared to paper publication and patents.

We started with EurekaAlert data, then linked it to author data on OpenAlex. First pass matching is 58%. Some of this is from date errors in OpenAlex. Using dates from CrossRef increases match rate to over 80%.

Q: There’s work on hype in scientific publications, and Qs around gender dynamics, and characterizations of uncertainty (does confident science get more attention than that which exposes its uncertainty?) Have you looked at these aspects?

Q: Are things more likely to be licensed if they get press releases?
A: Certainly things that are licensed get more pess attention; I haven’t seen results that are too closely linked [for novel discoveries?], but the further you are from the source of [the science], the more a press release helps.

LLMs as Research Assistants: constructing a corpus of medical evidence Maya Durvasula, w/ Sabri Eyuboglu and David Ritzwoller

Excited to share this work!
Recently we’ve been using clinical trials as measures of innovation: how firms respond to gov policy and prioritize investments. But just as unweighted patent counts seems to leave information on the table, unweighted trial counts miss out on their details: trials are standardized, for quantitative output; studies can be more or less well designe dand mnore or less successful.

We combine scientific pubs, records, and FDA approvals.
Most of PubMed doesn’t include clinical trials; we flag abstracts by keyword, and look at NLM categories, to get 1.8M abstracts. Colleagues made us a nice interface speeding hand-labelling by 5-10x, to build meaningful training data. Then we iteratively developed prompts for GPT-3 and GPT-4; final prompts got us down to 5% false negatives/positives. [trying a range of visually similar prompts improved the success rate by a factor of 4-5, esp with GPT-3]. We also extracted 60k ‘noisy’ labels from each model.

Q: “why not use GPT-4 for everything?”
A: this was quite expensive. It seems expensive for them as well, and not suitable for this sort of labelling at commercial scale. also, chatGPT is a black box, and we’d like more visibility into their workings.

We chose LLaMa, Mistral, and Pythia; but got comparable performance from BERT directly. Fine-tuning w/ GPT-3 labels we could match perf of GPT-3 with open models. Fine-tuning w/ GPT-4 labels we could match perf of GPT-4!

Finally, we tried a few version of the open models. BiomedBERT trained on PubMed, &c. These didn’t necessarily do better than off the shelf BERT.

Conclusions: we can use LLMs to get very close to hand labelling. Prompt design matters. Fine tuning let us beat the perf of proprietary models.

Next Step: streamlining a workflow for label extraction, getting better statistical information, using that to reduce errors.

Q: Have you talked to the NatLib of Medicine to update MESH tags in a similar way? A: they’ve been using algorithms to do that since 2022. Not this technique but they’re working towards sth like this.

1:00 pm — Lunch

Thanks for joining! For further questions or conversation, please join the I3 discussion list (i3-open). To register new datasets or validation datasets, see


Friday, December 1

4:00 pm — Welcome!
Matching Scientists and Inventors
The Commercial Potential of Science & its Realization
New Facts and Data about Professors and their Research
6:45 — Dinner

Saturday, December 2

8:30 am — Coffee and Pastries
9:00 am — Reports on Tools
Index and Validation Datasets
Logic Mill - A Knowledge Navigation System
10:00 am — Patenting Firms
Improving Patent Assignee-Firm Bridge with Web Search Results
The Government Patent Register: A New Lens on Historical USG Funded Patenting
Startup Patenting


I3 Fellows Presntations! (lightning talks)

No comments here
Why not start the discussion?