Skip to main content

Linguistics Workshop 10/28

I³ Fall Workshops, #1

Published onOct 19, 2020
Linguistics Workshop 10/28

Part of the 2020 I³ Fall Workshops series.

Workshop and Notes

Linguistics seminar : Wed. 10/28, 12 pm – 1:30 pm ET ( recording below )

Linguistic similarity measures for patents (workshop)


Nancy Kong: Linguistic Metrics for Patent Disclosure: Evidence from University vs Corporate Patents

Jonathan Ashtor: Linguistic Similarity in Patents: Investigating Cohort Similarity as an Ex Ante Alternative to Patent Forward Citations

Ryan Whalen: Patent Similarity Data and Innovation Metrics


Collaborative notes below. Questions and links welcome!

Nancy Kong: Linguistic Metrics for Patent Disclosure

Nancy Kong, Uwe Dulleck, Adam Jaffe, Shupeng Sun, and Sowmya Vajjalak

Sufficient disclosure helps avoid repetitive inventions. How can we improve this process? We define a method for estimating [depth of] disclosure, to see if corporations and universities tend to disclose different amounts.

Conceptual challenges to normalize for the study:

  • Corporations + unis could have different budgets and models.

  • They could focus on different kinds of inventions.

We try to fix both effects: by looking at the patent attorney, and at patent-citation patterns.

Hypothesis: corporate patents are harder to read.

Are differences in readability driven by the nature of inventions? With more experienced patent applicants, differences between corporate and university patents are more pronounced

Linguistic analysis of patent text is a promising area of further exploration.


  • Are there other patent-citation patterns that you considered / that would be of interest for classifying patent type?

  • Q: I am looking for a discussion of why Flesch answers were different - are the correlations expected to be negative? or not? If not, why are they?

    • Yes, 0 in Flesch means “hardest to read”. [suggest: sign it the other way in the paper]

  • Q: Very interesting analysis! I wonder how much of this is driven by fact that academics often have complementary publications and sometimes the text is copied verbatim. One could look explicitly at patents with/without such pairs. Or compare industrial vs academic publications by same metrics?

    More general: Is readability the same as “disclosure” (which is about describing in a way a person skilled in art can carry it out)? Maybe it is necessary not sufficient? (Bhaven Sampat)

    • A: Related in part to the existence of publications parallel to university patents. Patent attorneys told us that with university patents they often cut and paste text from papers. The second quesiton is harder. We don’t have an answer other than the empirical regularities we report.

    • We included the academic papers used in university patents as one explanation to the differences in the paper. It is great idea to look at the pairs as you suggested!

      Regarding the relationship between readability and disclosure, we are currently collecting subject evaluations of the two and will analyze the correlation.

  • Q: Do your linguistic measures mostly capture general (public) readability or technical readability? I am asking because your motivation is about technical information contained in patent documents often being inadequate, so it seems you are interested in technical readability/completeness. But my sense of the finance literature (you are referring/building on) is that it is mostly about general readability.

    If your measures are mostly about general readability (university patents are more readable and require less education), then one interpretation of your results is that university patents target more general public (because they want broader set of actors to use/license their patents), and corporate patents might target more technical crowd of researchers of other corporations (they are less generally readable not because of inadequate technical information, but because of more of it).

  • Q: Related to your original motivation on measuring obfuscation in patent text: Can we expect that patents with obfuscative text are more likely to be associated with patent infringement lawsuits (obfuscating firms sue other patentees who could not figure out if the obfuscative patent is related)? If so, could you use follow on lawsuits as another cross-sectional measure of correlating your patent text readability measures?

    • Great point! We do plan to look into the infringement in another paper, especially regarding patent trolls intentionally obfuscate information to cause infringements.

  • Q: can you analyze further why universities might have different goals in patenting?

    • There is further discussion in the paper about companies being more likely to patent for reasons other than licensing + clear communication

    • Bronwyn: Don’t some companies patent for licensing?
      But also many patents explicitly try to use new language + buzzwords to avoid a conflict w/ existing descriptions of the same root concept + flow.

  • Q: What do you think about the rise of software that helps people avoid review? Will that make these sorts of analyses harder?

    • (like a patent ‘Lucas Effect’?)

Jonathan Ashtor: Linguistic Similarity in Patents

Investigating Cohort Similarity as an Ex Ante Alternative to Patent Forward Citations
Jonathan Ashtor, Journal of Empirical Legal Studies, November 22, 2019


  • USPTO Patent Claims Research Dataset, weekly XML dumps from Reed Tech

  • EPO backfile (of claims) + XML downloads (in English)

  • use a handful of personal scripts (follow up for detail)

    • Claim Text processing: group binary words into couples sometimes. Get a full count of terms + frequencies, put into a term-freq matrix


  • Build a measure of pairwise cohort [technology, age, filing-year specific] similarity for every patent in the dataset

  • Only working with Claim Text (standard pre-processing, removal of stop words)


  • Q: Here you are comparing word similarity directly, not sentence representations in some embedding space (a text model) — why not try something like BERT?

    • I wanted to focus on claim text, not the full text + description + discussion of prior art. Just looking at claims filed or issued.

    • It’s not necessarily helpful to use a pretrained model (or we should identify regimes in which it is) -Ryan

  • Q: We should think about articulating the different sections of patents and what we’ll learn from linguistic analysis of each (claims, descriptions, cite-nets). As we push the frontier forward, researching those sorts of Qs, what kinds of tools can we leverage for each subset

Ryan Whalen: Patent Similarity Data and Innovation Metrics

Patent Similarity Data and Innovation Metrics
Ryan Whalen, Alina Lungeanu, Leslie DeChurch, and Noshir Contractor,
Journal of Empirical Legal Studies, Vol 17, September 2020

  • Introducing the Patent Similarity Dataset (Zenodo + Github)

  • And a demo Jupyter notebook

Examples that you can explore:

  • What is the network of embeddings of inventions by a given author? By the coauthors of a patent?

  • What is the centroid of those inventions in the embedding space? This is an estimate of the typical sort of concept being patented.

  • (many more explicitly in the notebook!)


(Zenodo) + the Jupyter notebook in the last presentation are so helpful. Thanks Ryan! - SS


  • What other data sources did you consider using?

  • Q: These measures seem like they are analagous to originality and generality (HJT), so it would be interesting to compare. How do you control for the fact that forward cites are not as available in later years. Could use your random match to control?

    • We could set up something like that as a control.

  • Q: You’re doing exactly what we’re trying to encourage! Creating new measures, and making them available to others to work with them :)

  • It seems to me semantic analysis has to supercede citation analysis. Much better than asking if one chose to cite the other. The challenge is to unpick all of this, and use that information

  • For individual citations — you could come up with predictive models: did patent A cite patent B?

    • A: There’s a body of work on “missing citations” — cites that should have been but weren’t made. That could also be done here…

    • Zhen Lei at Penn State and Brian Wright at Berkeley have done such work using semantic analysis to identify "missing" citations.

Q: Why compare word similarity directly instead of using a larger text model like BERT?

  • Mainly, BERT and similar models are trained on text online like Wikipedia, which doesn’t map well onto the technical specificity of patent language. (Follow-up Q: are there good (larger) text models fine-tuned on patent text?)

Q: Are there good larger text models that are fine-tuned on patent text?

  • There are some — you can use the pretrained model from the Zenodo repository.

Q: Have you looked into using extra-close matches as being predictive of suits and revocations?

  • I’ve looked a bit at rejection and litigation. I wouldn’t expect a strong signal for litigation. (too rare?) But yes, would expect for rejection.

    • M-Cam out there does litigation prior art search

No comments here
Why not start the discussion?