Skip to main content

Citations Workshop 11/04

I³ Fall Workshops, #2

Published onNov 04, 2020
Citations Workshop 11/04

Part of the 2020 I³ Fall Workshops series.

Citations and citation metrics: November 4, 12:00-13:30 ET

Workshop and notes


Matt Marx: Patent/paper linkages; documentation and reliability (Reliance on Science)

Cyril Verluise: Comprehensive database of citations; in-text and citations to non-standard sources (PatCit)

Osmat Jefferson: Lens Lab reflections


Aside: mention global context — appreciating everyone’s time in this moment.


Cyril Verluise (noting contribs: Cristelli, de Rassenfosse, Gerotto, Higham, … , and you!)

Code: We welcome new contributors on the GitHub repo.


  1. Linking patents to larger innovation system

  2. Improving measurement of knowledge flows

Context: front-page NPL references are increasing rapidly since 2000, and are increasingly heterogenous: including office actions, search reports, and wikipedia refs

Precedent: Jefferson et. al, Marx+Fuegi, Bryan et. al

Contributions: [list]

  • one-stop shop for community (many kinds of citations)

  • non-scientific NPL

  • in-text citations


  1. Categorization + classification

  2. Extraction + sequence labelling

  3. Parsing (domain-specific design, plugins)

  4. Consolidation

There’s a big difference between in-text and front-page citations.


  1. on github:

  2. patcit-public-data is available to play with on bigquery , and can be accessed via Google Colab in your own notebook (example given)


We combine different platforms to ensure discovery, ongoing access, reproducibility and interoperability. You can find links to each of these from our github and Zenodo projects.

Stay informed: (newsletter)

Lens Labs and Patent API

Two take-home message from Cyril presentation: data is messy, complex, and heterogeneous and so we need transparency; science and enterprise are key components in the innovation process and to unlock their knowledge, we need to build bridges between the two corpora.

Lens tools are not just for academics but public: investors, policymakers, analysts etc to achieve evidence-based decision-making, and guide precise partnerships for faster social outcomes.

Recent features: informed by user base

  • “Dynamic collections“ that update automatically with each release once you save your search and enable notifications.

  • Private and secure institutional accounts

Lens Labs: engages academic community in the practice of transparency (

Considering patents as ‘dynamic meta-record’ — readability is multi-dimensional and contextual: legal, technical, industrial, and scientific. Lens seeks to capture this meta-information for both patent and scholarly records using a meta record concept.

  • We use a  15 digit open and persistent identifier, LensID, to expose credible variants, sources and context of knowledge artifacts, such as  scholarly works or patents,

  • while maintaining provenance,  and allowing aggregation, normalization, and quality-control of diverse metadata.

about to release: new patent architecture. (beta-testing API), will expose extended family, as well as simple family, of patents. And potentially integrating INPADOC along with the Canadian patents next year.

Future goals:

  • Search is moving from “patent centric” into “invention centric”, to help you understand innovation based on invention rather than patent-clusters.

  • Adding additional citation types, w/ deduplication + harmonization

  • Adding cluster + class. codes, to support open metrics like In4M

  • Launching a Lens Cooperative for orgs joining with paid membership in January 2021.

Upcoming institutional feature allows ‘narrative landscape’ documents, ‘live, interactive, dynamic, accountable’ document. Bringing together diverse search/collections and 3rd party content in one place.

Q: (GG) You mentioned patent families. But citations seem largely in a world of citations by or to individual patent documents. What is the state of the art in managing citations linkages at the family level?

A: (Osmat) Look at PatCite (on Lens) as a way to explore citations based on simple patent family. Different jurisdictions provide different citations for the same invention and so it is important to consider citations based on a family rather than a single patent record
A: (Cyril) thoughts on how to evaluate [family-like] clusters

Matt Marx (+ Aaron Fuegi): In-text Patent Citations : core dataset + results.
~ so far 30k downlaods, 9-10 papers been published with the data
~ current NBER working paper

2019 version:

  • worldwide citations to science on front-page citations. Originally using USPTO, now switched to Google Patents. However, pre-OCR era (1947-1975) is dodgy: wrote a classifier.

  • Beta release includes in-text citations.

2020 version:

  • no longer beta

  • worldwide, not just USPTO

  • performance characterisation

  • studying pre-47 citations to patents (pre-47 has no front-page citations)

Little overlap between front-page and in-text citations (Bryan/Ozcan/Sampat), and qualitative differences. In general, less provincial, more scientific, and more likely to come not from attorneys but inventors themselves (less recycling of in-text citations “perhaps the patent attorneys are doing a little bit of copy-paste“).

Heuristics outperform Machine Learning (GROBID): 23% of citations found by heuristic methods not found by ML, compared to 6% only found by ML.

  • Q: What’s the difference b/t heuristics and ML here?

Performance metrics: (spot checking by hand)

  1. Precision: 1 - False Positives

    • hand-check samples

  2. Recall: 1 - False Negatives

    • curated known-good test set

    • citations that “we should have found”

    • citations categorised as correct or reasonable (accounting for errors)

Neither Matt nor Aaron has seen the test set (it’s not being tailored.)

Data sharing:

Free-knowledge license; DOI per release; archival storage via Zenodo.

Comment (KB): this is great, Matt. We also noticed a couple years ago that without starting with paper lists and heuristics we couldn't come close. There must be an ML way to do this better, of course, but we couldn't do it.

Q: Can you discuss processing before this? E.g., check "Muller" and "Mueller" and "Muller" with umlaut? I think this is one of the most useful heuristics to fix, but I imagine you do much more than this. (how did you do this w/o looking at data?)

A: Built our own perl script for flattening that does everything we wanted re: unicode and other charsets —> ASCII. We’d love to hear a better way.

here's our "flattening" code if anyone wants it

Q: One thing that will be useful to do on U.S. side for in text (patent-patent and patent-science in text) is to flag those “incorporated by reference.” These may be in there for different reasons than those in rest of text. Check out 37 C.F.R. § 1.57. I think this is easy to do since the language around this is standard.

Q: You mentioned using fields, and this is a clear (hard work) extension of that: have you thought about using the ciations in the papers that you know are cited in patents to try to disambiguate others you feel less sure of?

Questions around consolidation

Q: how can we maximize our effort, rather than duplicating? Making a whole greater than the sum of its parts?

A: [OJ] I know everyone looks at different aspects, including links to products (IPR: nonstandard NPLs), and heterogenous links (PatCit)

A: There are some big shared dbs for people [see the DocDB master db - they have a citation db. Is our effort to feed into that, alongside OECD’s work?]. Can we harmonize this work w/ people who are maintaining a self-declared global citation db. [see also: work by MAG and others]

A: [Bronwyn] Letting 1000 flowers bloom is useful. I like the idea of building on DocDB, ultimately. Efforts listed today all seem well-defined and self-contained. It’s not clear to me that they should spend their time finding ways to come together.
But a useful activity now might be: using these different dbs to look at basic questions that all could answer, and see if answers differ. [e.g.: where do the differences in approach make a big difference in estimated outcomes?].

A: [CV] We asked ourselves this at the start. So an OS project w/ github and open community was important. Working w/ EPO and NLP-research networks. [NB: their slogan! -Ed.] So a parallel answer about approaches needed:
~ Share code, training data and models. Version control data.

Q: (Kevin Bryan) Are you still planning to include these patent-patent and patent-science intext cites in an NBER dataset also? It looks from the folks here that we have some quite reliable extractions that can be used. Osmat's point about OECD as a curator makes sense too.

Q: We’re going to continue to talk about how to make multiple different sources of data maximally useful. Logistics: limitations of what volume of data you can hold; what is allowed or not by different repositories.
What’s a framework for what we need to work with? And see if NBER would be the best place?

Bronwyn’s point about a thousand flowers blooming: there are some things that are closer to being standardized (what data is std?) and others where we’re figuring out how to capture and represent!

B: there’s preprocessing neeced to produce a flatfile everyone’s happy to download. Current arrangements require a certain amount of learning.

Adam Jaffe: what’s the tradeoff between centralised warehouses, vs letting everyone know what’s out there.

B: I don’t see an advantage to NBER hosting exept for the fact that
[1] a simple flatfile / better cite file would be useful to people who find startup costs high. Other than that,
[2] a central repo that explains access and coverage (of other sources).

An I3 repo itself as opposed to NBER.

M: A central place to go can be nice, but standardisation can limit innovation: I am more in favour of an index that can point and compare. as long as they can be joined. Jim Bessen’s index used a lot as it merges with everything. A real issue is discoverability.

To type-[2] solutions: we should define the schema + access constraints of each repo is enough, lets anyone host their own subset of a [theoretical, actual] global repo.

Q: How do you make things appropriately available? How do you register pointers to it?
A: At least ask NBER to link to the I3 site, to point people in our direction (so they’re not using out of date info)

Q: Asides — Matt’s notes on how + why to share data, and detailed questions to answer when sharing data + models.

Q: Is there a journal looking for a special issue on novel uses of patent citation data? [Kevin Bryan] —> Something we could publish here? Sense of the room : yes!
[asides: would Gems do that? that would be great.]

Q: (Bitsy) Everyone seems pretty well sold on the decentralized idea, but one more advantage: I've seen people treat the NBER patent data as mysterious data dropped from on high, rather than created by humans they could talk to. I think this is bad, and decentralization would help.

(SJ) Gaetan + Matt: yes, we should make it possible to register that sort of details about each resource. Bitsy: to your point, including details about who produced it, how/if it is maintained, how to engage or discuss or revise.

Ways to extend this work


  • How can this work be harmonized with long-standing citation efforts like DOCDB? 

  • What other efforts should be on this list, how are they currently updated?

Registration + visibility:

  • How can we track the existence, access, coverage, and maintenance of these datasets?  

    • We can keep on the I3 site a list of datasets, decentrally hosted, that are of interest to people studying citations; all described in this way. (related reflections + questions

  • How can we best map new datasets into such a framework? 

    • Contributors could include: researchers generating data, those relying on it for other analysis, curators + ontologists linking different datasets  

  • Where do people currently look for such data? Connecting those places, informing them about I3, inviting input. 


  • How does construction of different datasets + models differ?  (ROC curves, where available for parameterizable models)

  • What interesting questions could be answered from more than one of these datasets? How would the answers differ?

  • Where are current workflows for creating / cleaning / spot-checking / reconciling duplicative? What tools or scripts can be shared? [sometimes: worth its own process paper. Where could this be published?]

No comments here
Why not start the discussion?