Skip to main content

Winter 2022 Technical Working Group Meeting

Published onDec 02, 2022
Winter 2022 Technical Working Group Meeting

Edit these notes! | Winter I³ technical working group, Cambridge MA
Agenda (NBER) | All TWG meetings: 2019 | 2020 | 2021 | 2022 | 2023

Friday, December 2

4:30 pm — Welcome!

Building a Corpus of “Patent-Article Siblings”

Jean-Marc Deltorn, University of Strasbourg
Dominique Guellec, Observatoire des Sciences et Techniques
Jiangyin Liu, Observatoire des Sciences et Techniques
Chenyin Wu, University of Strasbourg

Scientific papers can serve as a background approximation to the field in which the patent exists. Trying to use citations for this has limitations. So does language similarity: Language serves different purposes in the scientific and patent worlds. Standards of clarity are different.

Keywords are brittle: latent semantic analysis is not currently capable of going beyond the style to the underlying technical object. We are looking at embeddings, and want to evaluate whether text embeddings can discriminate true patent-paper pairs from non-pairs.

We started with a manual ground-truth corpus. Chose a subset of domains and their CPC codes. Then evaluated different embeddings: GANs, transformers, and other models, and developed a classifier based on them.
Selection of corresponding articles: based on inventor last names, pub date, and priority date. Then querying the resulting corpus, looking at Jaccard index, date difference, and text embedding difference (of abstracts)…

Ongoing work:
- extend to other technical fields (ML, crypto, quantum computing, ARNm)
- build a public corpus of field-dependent PPPs
Future work:
- extend to other sources (preprints, other media)
- fine tune language models
- address limitations: reliance on author/inventor names

Questions + Comments:

  • Q: Timeframe? A: We want to extend this to a much larger set…

  • Q: How do you ensure patent/paper have an overlapping author?
    A: we query Semantic Scholar, to maximize proximity of patent + author

  • Q: my understanding is that there are differences normatively and legally re: what counts in being an author / inventor. How do you think about that?
    A: the rules are different, though it’s a similar principle in Europe and the US

  • Comment: great to see this, fantastic use of patent data. As you expand on this project: institutional affiliation could help evaluate some matches. And have you considered looking at full claims, not just abstracts?
    A: yes, affiliation is interesting.
    Our experience from a first test was that the full claim tree did not add much value in classifying true matches, and takes a lot of computation time. (LeeF: seconded)

    • Comment: Affiliations are missing for 20% of OpenAlex. When there are multiple authors + institutions, they don’t always link correctly.

    • From the other side, patents from universities can be assigned to someone else (a grant provider, a startup) leading to other notable gaps

  • This is a perfect example of what we want presented at this workshop: a sense of what’s going on, before it is finalized. Thank you!

  • Q: (Adam) in your first slide you had something about “we looked at PPP to understand the relation b/t science and invention. If we knew which were the pairs: can we use this analysis to get a deeper understanding b/t the science and the invention? and predict which science linked to which inventions would produce valuable inventions?

    A: that’s part of the idea, one of my colleagues (Dominique?) should join there. that’s a longer goal, to get that kind of understanding
    A: often you might take inspiration from a scientific idea…

  • NB: Lee Fleming is talking about related work tomorrow.

5:20 pm

Patents Phrase to Phrase Semantic Matching Dataset

Grigor Aslanyan, Google
Ian Wetherbee, Google

This is a new public phrase-to-phrase dataset for semantic textual similarity (STS): the Google Patent Phrase Similarity Dataset
- human rated, contextual, focused on technical terms, w/ similarity scores
- with granular ratings/flags (e.g. synonym, antonym, hypernym, hyponym, holonym, shared domain, unrelated…)

Existing STS collections build on Wikipedia, books, &c. not technical terms.
Our dataset was used in a Kaggle competition in March 2022.

We focused on phrase disambiguation (adding CPCs for context), adversarial keyword matching, and hard negatives (explicit non-relation)

Dataset features: >100 CPC classes… (pull out this slide)

Choosing anchors:
- we keep phrases that appear in 100+ patents,
- randomly choosing 1000 anchor phrases from these.
- randomly sampling up to 4 CPC classes for each, for context
Generating targets:
- we randomly select phrases from the full corpus with a partial keyword match with an anchor,
- we use a masked language model, asking BERT to generate replacement phrases after masking out each instance of the phrase

Questions + Comments

— This is meant to be a tool to fine tune models rather than intended as a model itself

6:30 pm — Dinner : Taylor Ballroom

Saturday, December 3

8:30 am — Continental Breakfast Bagels

Mapping Patents to Technology Standards (slides)

Fabian Gaessler, Pompeu Fabra University
Dietmar Harhoff, Max Planck Institute for Innovation and Competition
Lorenz Brachtendorf, Max Planck Institute for Innovation and Competition

Summary: We try a new method to link patents to standards based on semantic similarity. Useful for SEP litigation, strategic patenting, contribs to tech progress.
Future Improvements: refine similarity measure, extend to other standards, historical and other regions

Data sources + challenges

  • Standards: We focus on ETSI - a standards db with 40k standard docs, varying in size >1k pages, may describe multiple technologies and split at a chapter level.

  • Also 18,000 declared SEPs, at family level

  • Challenges: long texts, multiple docs, two - corpora.

Similarity approaches

  • octimine: closed-source🙁, via Natterer(2016). ‘vector space model’, cosine + other? fine tuning

  • tf-idf, embeddings: we checked results with this

[Statistical overviews]

— linking essentiality status to predicted essentiality
”if this worked better I would be out founding a company offering patent-tech”

we compared this w wifi patents, and hand-marked gold standard, w/ similar results.

We posted everything on Harvard Dataverse: with csv’s for 60k docs, use cases, and more.
Semantic similarity of patent-standard pairs (ETSI, IEEE, ITUT)


  • Tim — we are basically citing one another’s unpublished papers here. great to see this. Can you say more about similarity score coeffs and how this changes for claims?

    • w/ claims: Effect size gets smaller. there’s a tradeoff b/t the noise you have from just having [claims] vs [standards], and bias that may come in (from also looking at description, less uniform?) and

    • also interesting that the final decisions (across the corpus) didn’t change that much when using just claims or using full.

  • Tim: ideas - take claim + description, find ones that are more similar on one or the other. perhaps claims change in ways that make them fall w.r.t. standard. There’s such time variation in patenting and standardization processes, this micr-longitudinal change could show which ones move in or away from specs. From a policy standpoiint: can you find things that wer close substitutes but isn’t essential? Policy debate her is that stands process created a monopoly that wasn’t there before. So: what were the alternatives early in stands?

    • We looked at undeclared-by-humans but high-similarity. overall seem less valuable. also much older at the time; patent had expired. Share of SSO membership was not lower among these.

  • One way to get essentiality is to put language in your patent into the standard (in one direction or the other). Given the slow standards process, you have opportunities through continuations + more. My concern w/ this approach: it’s not that patents are written in a eureka moment, and language summarizes that insight!

    • A: this is complex. there are some just-in-time patenting studies, also personal ties: can you see prelim drafts, file a patent just before adoption? This is an issue for many such efforts.

    • Adam: sounds like you could document this phenomenon by looking at increasing similarity over the course of the standards process

9:40 am

The NBER Orange Book Dataset: A User’s Guide

Maya Durvasula, Stanford University
Scott Hemphill, NYU Law School
Lisa Larrimore Ouellete, Stanford University
Bhaven Sampat, Columbia University and NBER
Heidi Williams, Stanford University and NBER

Patents are used as measure of innovation; but mappings to products can be unclear. There’s no straightforward linkage. But this is different in pharma, Via the Orange Book.

We introduce a newly digitized OA dataset of Orange Book records, w/ annual editions providing snapshots at points in time; giving a comprehensive portrait of legal protections over drug lifecycles.


  • OB records are self-reported, not audited by the FDA. we validated against external benchmarks

  • appropriate use may differ across researchers. use case: calculating market exclusivity.

“Orange Book” is a misnomer. Real name: Approved Drug Products with Therapeutic Equivalence Evaluations, for pharmacists to track generic-brandname links. Related: Hatch-Waxman Act for drug price competition.

Not everything is eligible for listing. Only patents that could reasonably be asserted (to track infringement of generics). Incentives to list all patents: generic competitors have to challenge every patent that’s still in force at the point they enter the market. [challenge means: legal certification, required to argue each specific patent is either invalid or not infringed]

Any infringement suit leads to 30 months of blocking generic approval. Free extra exclusivity! Sometimes called ‘pseudo-patents’. We see 5 categories of exclusivity:

  1. New chemical entity (NCE): 5y

  2. Orphan drug (ODE): 7y

  3. Pediatric (PED): 6mo

  4. Generating Antibiotic Incentives Now (GAIN): 5y

  5. 180-day exclusivity (Generic DE): 6mo

Find our dataset on NBER: the NBER Orange Book Dataset (Readme). Good news: you can just use this w/o worrying about gaps!
[4 data files: Drug patents, patent use codes, drug exclusivity, drug exclusivity code…]

Of 2500 new drugs 1985-2015, 80% have one form of such protection. Of 800 new molecular entities, 96% of these ‘innovative drugs’ have one. This exclusivity is granted and recorded directly by FDA.

We run a comparison to IQVIA/Ark, look at litigation records, and at extended patents under Hatch-Waxman. The majority of drug patents do get recorded in the Book.

We also looked at 4 predictable changes to expiry.
4.Maint. fee non-payment (at 3.5/7.5/11.5)

At high level all things are accurate, but non-payment doesn’t appear! firms just stop self-reporting. 45% of patents expire before full term. Must find other resources to track.

Context: this was originally an appendix for another project estimating market exclusivity: Nominal, Expected, Realized
(what do legal protections say they confer, what shielding should be expected, how much time actually elapsed in a case)


  • Dan Gross: could you list some potential misuses as well as good uses of this dataset?

    • Bhaven: positive use cases included… Don’t count all entries as if they are equal. e.g. those that actually matter for generic entry.

  • Errors/disputes

    • If you’re a competitor you can state that a record doesn’t actually conflict w/ listed drugs. there are 54 active disputes here, from generic competitors. This is hard if you’re not yoruself in the weeds re: what the patent text allows. Our period ends in 2015, and there weren’t disputes that ended in delisting in that period.

    • Until 2003, you could get multiple 30mo stays. firms would start pulling new OB listings out of nowhere. Had to be fixed in the Medicare Modernization Act

  • Q: have you thought about how different IP rights complement one another?

    • The project this was originally an appendix for is doing that: looking at new uses for approved drugs, and how patent + regulatory exclusivity interact. Definitely something you can do with this data!

  • Comment: [from USPTO] the patent examination research dataset (PatEx)includes patent status, expiry due to non-maintenace. Published annually since 2014. Google Patents may have more, if you need older years reach out and we can track down where it actually is.

  • Comment: renewals can be found in PatStat.

10:35 am

Progress Report on an Inventor-Author Crosswalk

Lee Fleming, University of California, Berkeley

We have initial results, model, and proposed flow.

[Doudna, Langer examples]

The Gatekeepers of Science:
Papers for pure sci, Light bulb for pure inventor, Gate for sciinventor.

Relationship width: # of coauthorships.
Heterogeneity across lab clusters?


  • Sankey diagram of Doudna’s OA ID / PatsView ID shows over-splitting in OA.

  • For a ML model estimating missed links:

    • take subset of Marx/Fuegi patent-article citations that are self-citations. look for scientists w/ ORCID IDs [‘best thing we have right now’]

    • Features of prediction model: many! Clean up all fields, plug in, publish accuracy estimates.

    • Enable sub-setting, intermediate results

some features used in distance calculation

Challenges: Queries directly draw from OAlex right now. Encoding takes time. For clustering, let users choose an algorithm, show both sanity check + confidence measure.


Lee - Our server died, so we’re trying to figure out the economies of hosting own severs and BigQuery.

  • Say more about this? How could BQ help?

    • Some IT guys suggested using [it] to help keeping this running

    • NB: for Gaetan, curation feels like the critical upkeep cost

    • What are the most expensive parts of keeping this updated?

      • OA updates their dataset every 6mo. PatentsView each year. We want to turn the crank. Needs ~10k compute each time…

OA is changing in real time, since last year. Maybe they’re trying to cluster researchers. (A: we should incorporate those)

Q: Are you using OA’s field concepts? They’re a bit wonky, too sparse, too many of them associated w/ each researcher.
A: We’re using those and rolling our own, trying both. I’m a pathological case: my patents are in materials science, and I’m a social scientist now.
A: (Adam) I waded into this in NZ, looking at the work of NZ scientists who worked in NA/Europe and returned. typically in db’s they are treated as different people. Different location + institution! May be impossible to handle

  • Doesn’t seem impossible for a human, or a great machine reader (esp if they can trace the full timeline of other work of each person)

Q: is there a way to [use email to prompt to dig deeper]?
Q: what fraction of scientists have an ORCID?
A: as of 2022: about 6M distinct ORCIDs are associated with at least one work or external identifier. Recency bias.

11:30 am

Panel: How do we validate patent metrics derived from semantic analysis?

There has been an explosion of development of semantic-based measures seeking to capture similarity across patents; novelty; disruptiveness, value or impact, etc. The panel will bring together researchers who’ve been working on these metrics to discuss questions such as: what is the relationship of these metrics to older ones based on, e.g., patent classification or citations; how should validation be structured (should people validate their own measures or should we try to create some kind of shared or over-arching validation process); what is the relationship between different modes of validation (e.g. correlation with exiting metrics, testing against subjective expert judgements, correlation with outcome indicators such as productivity or prizes)

Moderator: Bronwyn Hall, Stanford University and NBER (TBC)
Sam Arts, KU Leuven (slides below)

Dokyun Lee, Boston University — InnoVAE
Ina Ganguli, UMass Amherst — Using NLP to study innovation
Josh Lerner, Harvard + NBER

Approaches to validation:

  1. Internal validity: expert ratings (13 from 5 fields)

  2. External validity: citations, shared family, shared ID

Ideas: use new combinations of classes to estimate novelty
Special case: look at 400 patents that got rare awards.
Control case: not novel / impactful? (US grant, EPO/JPO rejection).

General remarks

  • preprocessing text makes a big difference.
    articulate tradeoffs b/t simple and advanced approaches.
    explain why text works better (than trad measures).
    define the goal (better to estimate what?).
    we need new benchmarks to validate metrics.

  • To everyone: provide raw data, and the code that processes it!
    That lets others replicate, vary steps, chain processes.

InnoVAE - Generative AI for patent innovation

We looked at different representations for innovation.

— token-based methods, embedding methods, topic methods are used for similarity, but there is no regularization of the learned manifolds/ latent spaces, so distances are hard to interpret.

— semantic orthogonality + dimension independence is not internalized or sharpened. —> problem of unique interpretability

Innovae: develops Innovation Space — constructing economically interpretable measures, in a well-disentangled representation.

  • What could you get by combining two patents? E.g., combinational creativity for claims. How exceptional is one patent in the context of a portfolio / with respect to one tech factor?

  • Desirable features: Multimodal, … (insert slide)

  • Innovation factors, for one topical specialty:

  • Our ideal: A common dataset, task, metric + benchmarks

    • compare how speech + language models evolved; Kaggle today

    • unite scattered approaches to this for patent/scholarly data

  • Downstream application: predict Tobin’s q.


  • Cautious optimism about advancing the s.quo with language models

  • Is there a way to look inside black boxes to describe what’s driving the process (and compare options)?

  • Limits of language similarity: (Andy Toole) Applicants have an incentive to morph language to differentiate their applications and help overcome objections related to novelty and nonobviousness. Now Bronwyn is talking about strategic language

  • (Adam) One concrete thing that seems to be coming out of this:
    We should create a parallel index of validation datasets
    Get everyone [running contests] to compare a) their prize matching[?], b) standard tests to run new measures against. Then referees could say “you have to run your new metric against those testing datasets”

  • Josh’s comments hinted at this:
    Patents are issued through a resource-intensive admin process. There’s just a lot of eyes on these docs. As we saw today there’s a lot of emphasis on similarity, but we haven’t targeted what Julian was talking about : examination process, assignment to examiners, to a unit, what kinds of rejection specific claims get.

    • That’s more informed than a lot of RAs who look at patents

    • That plugs into a community of commercial/professional services [tools to get your patent assined to a favorable art unit]

    • If we do things relevant to that community they could come help?

  • Validation tasks seem specific to an application… but for human coding, we could draw standards from how to design the coding task.
    People are expensive, if consistent over time… and a recent emerging patent pool is “Open RAN” for 5g. They decided there was so much IP clamoring to get in, they had to turn it over to computers to dtermine essentiality.

    • [machine reading / synthesis is good enough to get lots of leverage here]

    • (Adam) we could explore this more systematically.

    • the first AI winter came about due to a lack of shared validation?

    • compare how Papers With Code/Papers With Data work now

  • More clarity would be useful for specific common ideas:

    • Similarity (much more than just cosine of feature vectors)
      ex: consider what we’re looking at similarity between

    • Quality (Bronwyn has a whole chapter on this)

    • Novelty (compare to what, on what timescale)

  • General Q: Is public pair data available?
    A: it is discontinued; queries go through Patent Center

1:00 pm — Lunch : Taylor Ballroom

Thanks for joining! For further questions or conversation, please join the I3 discussion list (i3-open).

No comments here
Why not start the discussion?