Skip to main content

Winter 2022 Workshop

Published onDec 02, 2022
Winter 2022 Workshop

Edit these notes! | Agenda (on NBER) | Zoom link during the event

Friday, December 2

4:30 pm — Welcome!

Building a Corpus of “Patent-Article Siblings”

Jean-Marc Deltorn, University of Strasbourg
Dominique Guellec, Observatoire des Sciences et Techniques
Jiangyin Liu, Observatoire des Sciences et Techniques
Chenyin Wu, University of Strasbourg

Scientific papers can serve as a background approximation to the field in which the patent exists. Trying to use citations for this has limitations. So does language similarity: Language serves different purposes in the scientific and patent worlds. Standards of clarity are different.

Keywords are brittle: latent semantic analysis is not currently capable of going beyond the style to the underlying technical object. We are looking at embeddings, and want to evaluate whether text embeddings can discriminate true patent-paper pairs from non-pairs.

We started with a manual ground-truth corpus. Chose a subset of domains and their CPC codes. Then evaluated different embeddings: GANs, transformers, and other models, and developed a classifier based on them.
Selection of corresponding articles: based on inventor last names, pub date, and priority date. Then querying the resulting corpus, looking at Jaccard index, date difference, and text embedding difference (of abstracts)…

Ongoing work:
- extend to other technical fields (ML, crypto, quantum computing, ARNm)
- build a public corpus of field-dependent PPPs
Future work:
- extend to other sources (preprints, other media)
- fine tune language models
- address limitations: reliance on author/inventor names

Questions + Comments:

  • Q: Timeframe? A: We want to extend this to a much larger set…

  • Q: How do you ensure patent/paper have an overlapping author?
    A: we query Semantic Scholar, to maximize proximity of patent + author

  • Q: my understanding is that there are differences normatively and legally re: what counts in being an author / inventor. How do you think about that?
    A: the rules are different, though it’s a similar principle in Europe and the US

  • Comment: great to see this, fantastic use of patent data. As you expand on this project: institutional affiliation could help evaluate some matches. And have you considered looking at full claims, not just abstracts?
    A: yes, affiliation is interesting.
    Our experience from a first test was that the full claim tree did not add much value in classifying true matches, and takes a lot of computation time. (LeeF: seconded)

    • Comment: Affiliations are missing for 20% of OpenAlex. When there are multiple authors + institutions, they don’t always link correctly.

    • From the other side, patents from universities can be assigned to someone else (a grant provider, a startup) leading to other notable gaps

  • This is a perfect example of what we want presented at this workshop: a sense of what’s going on, before it is finalized. Thank you!

  • Q: (Adam) in your first slide you had something about “we looked at PPP to understand the relation b/t science and invention. If we knew which were the pairs: can we use this analysis to get a deeper understanding b/t the science and the invention? and predict which science linked to which inventions would produce valuable inventions?

    A: that’s part of the idea, one of my colleagues (Dominique?) should join there. that’s a longer goal, to get that kind of understanding
    A: often you might take inspiration from a scientific idea…

  • NB: Lee Fleming is talking about related work tomorrow.

5:20 pm

Patents Phrase to Phrase Semantic Matching Dataset

Grigor Aslanyan, Google
Ian Wetherbee, Google

This is a new public phrase-to-phrase dataset for semantic textual similarity (STS): the Google Patent Phrase Similarity Dataset
- human rated, contextual, focused on technical terms, w/ similarity scores
- with granular ratings/flags (e.g. synonym, antonym, hypernym, hyponym, holonym, shared domain, unrelated…)

Existing STS collections build on Wikipedia, books, &c. not technical terms.
Our dataset was used in a Kaggle competition in March 2022.

We focused on phrase disambiguation (adding CPCs for context), adversarial keyword matching, and hard negatives (explicit non-relation)

Dataset features: >100 CPC classes… (pull out this slide)

Choosing anchors:
- we keep phrases that appear in 100+ patents,
- randomly choosing 1000 anchor phrases from these.
- randomly sampling up to 4 CPC classes for each, for context
Generating targets:
- we randomly select phrases from the full corpus with a partial keyword match with an anchor,
- we use a masked language model, asking BERT to generate replacement phrases after masking out each instance of the phrase

Questions + Comments

— This is meant to be a tool to fine tune models rather than intended as a model itself

6:30 pm — Dinner : Taylor Ballroom

Saturday, December 3

8:30 am — Continental Breakfast Bagels

Mapping Patents to Technology Standards (slides)

Fabian Gaessler, Pompeu Fabra University
Dietmar Harhoff, Max Planck Institute for Innovation and Competition
Lorenz Brachtendorf, Max Planck Institute for Innovation and Competition

Summary: We try a new method to link patents to standards based on semantic similarity. Useful for SEP litigation, strategic patenting, contribs to tech progress.
Future Improvements: refine similarity measure, extend to other standards, historical and other regions

Data sources + challenges

  • Standards: We focus on ETSI - a standards db with 40k standard docs, varying in size >1k pages, may describe multiple technologies and split at a chapter level.

  • Also 18,000 declared SEPs, at family level

  • Challenges: long texts, multiple docs, two - corpora.

Similarity approaches

  • octimine: closed-source🙁, via Natterer(2016). ‘vector space model’, cosine + other? fine tuning

  • tf-idf, embeddings: we checked results with this

[Statistical overviews]

— linking essentiality status to predicted essentiality
”if this worked better I would be out founding a company offering patent-tech”

we compared this w wifi patents, and hand-marked gold standard, w/ similar results.

We posted everything on Harvard Dataverse: with csv’s for 60k docs, use cases, and more.
Semantic similarity of patent-standard pairs (ETSI, IEEE, ITUT)


  • Tim — we are basically citing one another’s unpublished papers here. great to see this. Can you say more about similarity score coeffs and how this changes for claims?

    • w/ claims: Effect size gets smaller. there’s a tradeoff b/t the noise you have from just having [claims] vs [standards], and bias that may come in (from also looking at description, less uniform?) and

    • also interesting that the final decisions (across the corpus) didn’t change that much when using just claims or using full.

  • Tim: ideas - take claim + description, find ones that are more similar on one or the other. perhaps claims change in ways that make them fall w.r.t. standard. There’s such time variation in patenting and standardization processes, this micr-longitudinal change could show which ones move in or away from specs. From a policy standpoiint: can you find things that wer close substitutes but isn’t essential? Policy debate her is that stands process created a monopoly that wasn’t there before. So: what were the alternatives early in stands?

    • We looked at undeclared-by-humans but high-similarity. overall seem less valuable. also much older at the time; patent had expired. Share of SSO membership was not lower among these.

  • One way to get essentiality is to put language in your patent into the standard (in one direction or the other). Given the slow standards process, you have opportunities through continuations + more. My concern w/ this approach: it’s not that patents are written in a eureka moment, and language summarizes that insight!

    • A: this is complex. there are some just-in-time patenting studies, also personal ties: can you see prelim drafts, file a patent just before adoption? This is an issue for many such efforts.

    • Adam: sounds like you could document this phenomenon by looking at increasing similarity over the course of the standards process

9:40 am

The NBER Orange Book Dataset: A User’s Guide

Maya Durvasula, Stanford University
Scott Hemphill, NYU Law School
Lisa Larrimore Ouellete, Stanford University
Bhaven Sampat, Columbia University and NBER
Heidi Williams, Stanford University and NBER

Patents are used as measure of innovation; but mappings to products can be unclear. There’s no straightforward linkage. But this is different in pharma, Via the Orange Book.

We introduce a newly digitized OA dataset of Orange Book records, w/ annual editions providing snapshots at points in time; giving a comprehensive portrait of legal protections over drug lifecycles.


  • OB records are self-reported, not audited by the FDA. we validated against external benchmarks

  • appropriate use may differ across researchers. use case: calculating market exclusivity.

“Orange Book” is a misnomer. Real name: Approved Drug Products with Therapeutic Equivalence Evaluations, for pharmacists to track generic-brandname links. Related: Hatch-Waxman Act for drug price competition.

Not everything is eligible for listing. Only patents that could reasonably be asserted (to track infringement of generics). Incentives to list all patents: generic competitors have to challenge every patent that’s still in force at the point they enter the market. [challenge means: legal certification, required to argue each specific patent is either invalid or not infringed]

Any infringement suit leads to 30 months of blocking generic approval. Free extra exclusivity! Sometimes called ‘pseudo-patents’. We see 5 categories of exclusivity:

  1. New chemical entity (NCE): 5y

  2. Orphan drug (ODE): 7y

  3. Pediatric (PED): 6mo

  4. Generating Antibiotic Incentives Now (GAIN): 5y

  5. 180-day exclusivity (Generic DE): 6mo

Find our dataset on NBER: the NBER Orange Book Dataset (Readme). Good news: you can just use this w/o worrying about gaps!
[4 data files: Drug patents, patent use codes, drug exclusivity, drug exclusivity code…]

Of 2500 new drugs 1985-2015, 80% have one form of such protection. Of 800 new molecular entities, 96% of these ‘innovative drugs’ have one. This exclusivity is granted and recorded directly by FDA.

We run a comparison to IQVIA/Ark, look at litigation records, and at extended patents under Hatch-Waxman. The majority of drug patents do get recorded in the Book.

We also looked at 4 predictable changes to expiry.
4.Maint. fee non-payment (at 3.5/7.5/11.5)

At high level all things are accurate, but non-payment doesn’t appear! firms just stop self-reporting. 45% of patents expire before full term. Must find other resources to track.

Context: this was originally an appendix for another project estimating market exclusivity: Nominal, Expected, Realized
(what do legal protections say they confer, what shielding should be expected, how much time actually elapsed in a case)


  • Dan Gross: could you list some potential misuses as well as good uses of this dataset?

    • Bhaven: positive use cases included… Don’t count all entries as if they are equal. e.g. those that actually matter for generic entry.

  • Errors/disputes

    • If you’re a competitor you can state that a record doesn’t actually conflict w/ listed drugs. there are 54 active disputes here, from generic competitors. This is hard if you’re not yoruself in the weeds re: what the patent text allows. Our period ends in 2015, and there weren’t disputes that ended in delisting in that period.

    • Until 2003, you could get multiple 30mo stays. firms would start pulling new OB listings out of nowhere. Had to be fixed in the Medicare Modernization Act

  • Q: have you thought about how tool/IP rights complement one another?

    • The project this was originally an appendix for is doing that: looking at new uses for approved drugs, and how patent + regulatory exclusivity interact. Definitely something you can do with this data!

  • Comment: [from USPTO] the patent examination research dataset (PatEx)includes patent status, expiry due to non-maintenace. Published annually since 2014. Google Patents may have more, if you need older years reach out and we can track down where it actually is.

  • Comment: renewals can be found in PatStat.

10:35 am

Progress Report on an Inventor-Author Crosswalk

Lee Fleming, University of California, Berkeley

We have initial results, model, and proposed flow.

[Doudna, Langer examples]

The Gatekeepers of Science:
Papers for pure sci, Light bulb for pure inventor, Gate for sciinventor.

Relationship width: # of coauthorships.
Heterogeneity across lab clusters?


  • Sankey diagram of Doudna’s OA ID / PatsView ID shows over-splitting in OA.

  • For a ML model estimating missed links:

    • take subset of Marx/Fuegi patent-article citations that are self-citations. look for scientists w/ ORCID IDs [‘best thing we have right now’]

    • Features of prediction model: many! Clean up all fields, plug in, publish accuracy estimates.

    • Enable sub-setting, intermediate results

some features used in distance calculation

Challenges: Queries directly draw from OAlex right now. Encoding takes time. For clustering, let users choose an algorithm, show both sanity check + confidence measure.


Lee - Our server died, so we’re trying to figure out the economies of hosting own severs and BigQuery.

  • Say more about this? How could BQ help?

    • Some IT guys suggested using [it] to help keeping this running

    • NB: for Gaetan, curation feels like the critical upkeep cost

    • What are the most expensive parts of keeping this updated?

      • OA updates their dataset every 6mo. PatentsView each year. We want to turn the crank. Needs ~10k compute each time…

OA is changing in real time, since last year. Maybe they’re trying to cluster researchers. (A: we should incorporate those)

Q: Are you using OA’s field concepts? They’re a bit wonky, too sparse, too many of them associated w/ each researcher.
A: We’re using those and rolling our own, trying both. I’m a pathological case: my patents are in materials science, and I’m a social scientist now.
A: (Adam) I waded into this in NZ, looking at the work of NZ scientists who worked in NA/Europe and returned. typically in db’s they are treated as different people. Different location + institution! May be impossible to handle

  • Doesn’t seem so impossible for a human reader, or a great machine reader

Q: is there a way to [use email to prompt to dig deeper]?
Q: what fraction of scientists have an ORCID?
A: as of 2022: about 6M distinct ORCIDs are associated with at least one work or external identifier. Recency bias.


11:30 am

Panel: How do we validate patent metrics derived from semantic analysis?

There has been an explosion of development of semantic-based measures seeking to capture similarity across patents; novelty; disruptiveness, value or impact, etc. The panel will bring together researchers who’ve been working on these metrics to discuss questions such as: what is the relationship of these metrics to older ones based on, e.g., patent classification or citations; how should validation be structured (should people validate their own measures or should we try to create some kind of shared or over-arching validation process); what is the relationship between different modes of validation (e.g. correlation with exiting metrics, testing against subjective expert judgements, correlation with outcome indicators such as productivity or prizes)

Moderator: Bronwyn Hall, Stanford University and NBER (TBC)
Sam Arts, KU Leuven
Dokyun Lee, Boston University
Ina Ganguli, University of Massachusetts Amherst
Josh Lerner, Harvard University and NBER

1:00 pm — Lunch : Taylor Ballroom

Thanks for joining! For further questions or conversation, please join the I3 discussion list (i3-open).

No comments here
Why not start the discussion?