History Workshop 12/02

I³ Fall Workshops, #5

Published onNov 01, 2020
Part 5 of the 2020 I³ Fall Workshops series.

December 2, 2020, 1200-1330 ET. Register to participate (via Zoom)


Mike Andrews: Identifying Patentees through U.S. History by Linking Patent Records to FamilySearch

Historical Patent Datasets: A Practitioner's Guide - Mike Andrews
This compares several historical datasets that are available.

Comprehensive Universe of U.S. Patents (CUSP): Data & Facts - Enrico Berkes This describes version 2.0 of the CUSP patent dataset.
We are making version 3.0, which we will describe in the talk.

Dan Gross: Research on historical innovation with archival and other non-USPTO data sources

Tania Babina: Patents, entrepreneurship and the Great Depression


Mike Andrews:

Using USPTO .tiff files to get much higher quality OCR

Nothing disciplining how inventor records their location on an application. They geolocate each pattern to GNIS. Looking for people within a given radius of each patentee and only consider candidate match within a given radius, deals well with administrative boundaries.

Patent-to-census matching algorithm: quite conservative cutoffs when it comes to matching names. Matches 58% of patentees to the closest census.

It’s possible to get ~100% when you liberalise the matching metrics but then the data gets a bit meaningless. Much higher match rates in the year of the census that they’re matching to.

Many cases (like 1855 Mass. state census) are pretty unknown! Non-profit FamilySearch has a lot of these documents digitised. Allows much greater granularity.

Dan Gross:

Research on archival / non-PTO sources. Encouraging everyone to use ‘non-traditional’, non-digitised data. Many historical examples:

  • world’s fair exhibition catalogs

  • corporate r+d labs


  • individual historical patent examinations

  • registers of patent applications

  • interference case files

  • trademarks and trademark cancellation files

Structure of the talk

  • finding and obtaining data

  • extracting contents in a dataset

Places to look:

  • US national archives (NARA) <- one of the richest sources here!

  • corporate archives

  • ‘digitised’ books and trade journals

    • google books, hathitrust

some tacit knowledge in working with NARA… reach out to talk if interested.

Getting quality data requires some research budget and some TLC.

Tania Babina:

A measure of US Technological Entrepreneurship over past 200 years.

2 methods, one is based on Enrico’s method, and another linked to Mike’s methods.

Longditudinal county-level measure of US technological entrepreneurship. Application: studying local severity of great depression on innovation

Longditudinal inventor-level data 1900-1945 linked with 4 full-count censuses. Application: measure reallocation in response to local severity of great depression.


derived from USPTO using Enrico’s method, also gives patent citations, names, locations, classifications etc (see appendix A of ‘crisis innovation’ data on website).

You can compare different counties by how hard they were hit by the Great Depression. We find a big drop by independent-inventor patent counts.


Similar to Mike’s method of linking inventors to censuses. Linking people across different censuses to study relocation effects. you can see the 200-page Appendix B on my site. (link needed)

Issue: we don’t have a ground truth sample here to compare to. Does the distribution of inventors look like the rest of the population? No:

In some cases, for instance, [the proportion of inventors who are farmers is very different from that in the general population]

Some observations:
- Geographically, tech entrepreneurs were much less concentrated than they are currently.
- We can study the impact of shocks on innovation — the distribution of quality across different innovations.


(to Mike Andrews:)

Matt Marx: does CUSP include full-text or is it focused on metadata?

Enrico Berkes: We start from the full text
. We OCR the original documents from the USPTO website.

Kevin Bryan: That is a crazily ambitious thing to do, Enrico - amazing!

Kevin Bryan: Are the matches to prior or subsequent census? For questions like inventor mobility, I imagine it matters quite a bit

Enrico Berkes: we match the closest Census (e.g. 1855-1865 inventors to 1860 Census)
 we hope to be able to use FamilySearch to link people across Censuses

Mary Kaltenberg: Did you work in cooperation with FamilySearch or webscrape the information?

Enrico: Joe Price is our link to FamilySearch

Bitsy Perlman: The full count census that is on the NBER servers is, I think, the family search version.

Joe Price: The data he is access is the Family Tree on It is not possible to download the whole database but they have an API that you can use to query it.

Osmat: @Enrico, as you may know, Cambia was the first to OCR full text patent data back in 1990s to make it public. I’ll check with our team on whether there is another good source and if there is, I’ll let you know.

Ina Ganguli: Bitsy, is it correct that full count census on the NBER server is only the U.S. Census but not the State ones, so we couldn’t find cases like the person Mike mentioned?

Bitsy: Yes, only the Federal Census.

Bronwyn: surprised by how much information is able to be extracted from some of these census records

Mike: Just because we can’t match from the patent to the census, doesn’t mean we can’t match: familysearch is really good for this

(to Dan Gross:)

Bitsy: Pre-1836 patents are summarized in The Journal of the Franklin Instute. ….a lot of them are in google books. It's breaking them up into parts that is the trick.

Enrico: is there a sense of how many of them are there?

Bitsy: They start publishing in 1822 -- 4 times a year?

SJ: 606 docs available here

Enrico: How many patents were filed between 1790 and 1836?

Bitsy: About 10k

Mike Andrews: My understanding is that the pre-1836 patents were not numbered. So it makes it hard to know if the Franklin Institute books are complete and what is still missing. Others may know more. But there are a lot of fascinating pre-1836 patents!

Bitsy: I'm sure they aren't complete, and they also have foreign patents. But the list we have of X patents has names and tiles, so one could prob get a number on them
Of course, some X patents are still just totally lost

SJ: An interesting overview from the Franklin Institute librarian in 1888 (summarizing the overviews they had published previously)

Enrico: I guess, the full-text is not available, or is it?

Bitsy: Remember the fire (of 1836)… Some of the patents (not type set) have been found elsewhere
. re: counties across time. Also, if you want to geo-code historical towns come talk to me.

SJ: The USPTO has image-scans of ~2600 X patents that were recovered, but OCR of the script used is hard… cursive handwriting! Has anyone tried to digitize that?

Bitsy: I believe M-CAM made a pass at trying to digitize them. They did some hand transcription; gave me a few examples.

Alenka: It is possible to join patents to other sources?

(To Tania: )

Adam Jaffe: In the linking of early patents to other places…. they’re potential, not actual entrepreneurship… do we have other indicators?

Tania: Correlating with other measures of local entrepreneurship.. empirically highly correlated. Raises a deeper question and discusssion: what do these independent inventors represent? We know that most won’t start their own company. Invent, sell patent, move on.

Think of it instead as early-stage innovation! In terms of matching to other datasets, one thing I’ve done which is available from patents themselves, see patents assigned to companies named after the inventor. That’s a small fraction of innovation, though.

Looked into case studies: some start firms, some don’t but don’t know of comprehensive data that matches to people who go on to start companies or not. See that independent inventors more likely to be entrepreneurs.

Osmat: Thinking of patents as business assets though did not start until the early 1900s no? And so before that time were inventors patenting to sell their invention? Kevin: Yes. Bitsy: They also licensed them. Often several licenses to people with geographic bounds.

Adam: correlating with other measures of entrepreneurship: how do the magnitudes compare? Is the average rate of patenting 10x the entrepreneurship rate?

Bronwyn: I don’t think that’s what you want, I think you want independent patents per total employees. The RHS tells you about patent composition, which is not the right thing

Adam: I agree, I think it’s not the right comparison

Nicholas Ziebarth: look at stuff like the Census of Manufacturers in the ?19th? century. Random sample now but currently being digitised. Dunn and Bradstreet. For the cases where you know the business you could match it in those records. Definitely some more work to dig through it.

Bitsy: in the 1930’s there were other censuses too — economic and other censuses… Nicholas: those are bit of a mess

Mike Andrews: NB: The assignee is as of the time of filing the patent.

Look at patent assignment documents… they contain not just who patent transfers to but the terms of the sale…. good for someone with a lot of money and time

Adam: Does the PTO track assignment? Mike: the legal requirement has stayed consistent; you want to report it to the PTO if you want to assert the patnet, but there’s probably a lot of assignment and even more licensing that don’t get reported.

[Sarada + Tania] For future discussions: do they have to be about patents, or can they be about innovation more broadly? Sarada - I have some work at the philly fed, looking at hiring + job vintages, and how that affects R&D. will share a paper of what I’m thinking about for this.
Adam: I’d love to hear your thoughts, w/ the caveat that we haven’t decided yet how we might broaden the scope. If you have suggestions of what our domain should be, I’d love to hear it.


