Notes from the first Working Group (agenda), Dec. 6-7, 2019
Feel free to update or leave comments below.
Answer on the dedicated page for Q&A
~ Do you have have datasets to share?
~ What do you need for your research?
~ General requests of one another (mentioned during the day)
Adam (slides)
History of the III — NBER’s long-standing interest, Sloan helping to link related projects (and influencing the name!), MIT facilitating. This is the first working-group meeting of a series.
Plans for the next 2 years: two of these working meetings and one summer session each year, publishing research that comes out of this. Hope to expand this to include post-docs. In two summers, we will have to revisit continuing this work; there may be opportunities for ongoing support.
I³ is open for contribution and participation by all. However the grant founding the initiative specifically funded four efforts that will produce data and data-services, in addition to making space for sharing and discussion:
Osmat (Slides)
Lens.org: an overview of current data available online, via search and API
LensLab: a planned portal in collaboration w/ MIT, with access to bulk and API data drawing from Lens.org
Initial data snapshots are available (see above), more available on request
The LensID aligns varied data — MAG, DOI, PubMed — into a metarecord.
Future: Adding new sources all the time, aim to integrate prior art data
NB: Data cleaned up via this alignment has been reintegrated into MAG data.
Comment: Loved your past gene sequence data! As economists: please release time stamped dataset snapshots, not just API output.
Sam (slides)
The Knowledge Futures Group - Working to make public knowledge graphs a public resource. An underlay (public graph), publishing tools for docs and data, protocols for data registries, and applications building on them.
Prior art archive: IDs, classification, permalinks, Google Patent integration.
Future: add traditional knowledge, wikidata integration
Open metrics: IDs, links to versioned algorithm + full dataset used
Hosting LensLab; data archived on IPFS.
Future: shared dataverse for community data
Q: Consider the fragmentation of corporations into shell companies and divisions. How does that fit in; could that be traced through [such graphs]?
A: Different datasets + curators would have different approaches. This would help you to find many such datasets, and to properly cite the one you used
A: (general debate on different clustering + disambiguation approaches - to continue tomorrow morning)
Matt Marx (slides)
RelianceOnScience.org: a new public dataset.
Beyond the first page: one must extract links from the body text.
Citation extraction: how our process works - lots of Perl/shell, hat-tip to Aaron Fuegi, rough overview of the algorithm for extracting + estimating links
Many thanks to BU’s 18,000-core data cluster, which is used very heavily!
(noted later: their tweaks helped speed up core analysis by 4000x)
Large resulting datasets are all public at http://relianceonscience.org, source code at https://github.com/mattmarx/reliance_on_science. Body-text cites are in beta, feedback encouraged. Front-page cites have been expanded to worldwide patents using DOCDB (hat tip: Google Patents). Data appendix here.
NB: the recent declining % of citations that are only-in-body-text
Comment: this is extra helpful for economists, because it adds identifiers at every step in the process. Very useful for cross-linking and reproducibility
Q: Regarding coverage for older patents: how good is old Microsoft Academic Graph (MAG) data?
A: Hug & Braendle (2017) benchmark MAG against Scopus and the Web of Science using 91,215 verified, multidisciplinary publications from the University of Zurich’s Open Archive and Repository as of October 2016. Coverage of these publications was 47.2% in WoS, 52.0% in Scopus, and 52.5% in MAG. Are there gaps? A few opinions shared; some reusers like Lens have been sharing enriched data back w/ MAG; more clarity needed.
Gaétan (slides)
Details of the IPRoduct Patent-to-Product links. Data + workflow
Lots of work by hand, facilitated by extraction algorithms. Exploit information from companies on packaging.
Lots of web scraping, ~500M pages: limited in part by capacity for that.
ML classifier for early pruning + estimating what pages need a closer look
Current stage: prototype and site built; expanding the project to include community data enhancement. Idea: bulk access to enhanced data will be proportional to contribution, else limited to researchers.
Neat video walkthrough of using the current tool.
Q: Links to UPC codes? A: Not yet, but plan to allow users to add the information.
Q: Can you use OrangeBook data to estimate coverage at least in pharma?
A: Maybe .. (limitations here noted; access to the archives needed; Heidi has access and can share)
Comment: Alibaba has internal data similar to this, but won’t share. Perhaps other retailers do and could be convinced to.
Q: Tacit knowledge is essential - there is so much in the room here. But no one likes writing the data appendix… (mumbling as a few people say they do)
Idea: Set a standard and templates for how to do this well. Cf the OEC website.
Manuals; A page for each term or measure with definitions + limitations; what users can and cannot expect from using each {dataset, project}. Distinguish “widely known by experts” from “sufficiently documented for newcomers”.
Q: Heidi is excited about drafting data documentation, can share; discuss w/ her before the final session (goals, mission, priorities).
~ Morning delicacies at 0800 ~
Chair: Bronwyn H. Hall
Disambiguation + resolution.
Matching names to another list (authors, papers)
In the case of firms: clustering and evolution of firm history (and ownership transfer!)
Company Names and Ownership Changes in the Dynamic Reassignments of Patents (slides)(Lia Sheer, Ashish Arora, Sharon Belenzon)
— Extending NBER 2006 data to 2015, and tracking ownership across 35y
Example: Conoco-Phillips: Compustat doesn’t give you the name. CRSP’s monthly stock file has historical ownership; needs a crosswalk, and a manual check of 10k filings by year to see in which year it changed.
Outcome: a standardized name list: from a string to an ID per year.
Can cover 55% of utility patents in the US. In 80% of cases before ‘06 there was agreement, in 17% they updated / found better data.
v. important for individual-patent (asset-) level analysis.
C: IP may not shift w the name change. [no idea how to register that]
Likewise reassignment even with no name change.
Challenges and Solutions in the Construction of Chinese Patent Database
(Deyun Yin) (slides)
CNIPA Challenges: no inventor disambiguation, standard names. No cite db!
Improving Patstat coverage, geocoding. Map innovation hotspots.
Supplement missing data from CNIPA, JPO, KIPO.
Improve: patent family > 1, ID tech clusters w/ DBSCAN.
Coverage: only 60k Chinese companies in US dbs?
Quality: 2/3 of applications not examined. only 1/15 renewed. (~50% of other) 1/4 the claims per application.
Name dab (paper just published in scientometrics, through 2016)
2-step ML model.
Avoid duplicating dab and harmonization.
Applicants harmon; inventor dab; gender indentification; geocoding.
Race, Ethnicity, and Patenting: USPTO’s New Data Collection Effort (slides)
(Lisa Cook)
”Counting who’s patenting and whose patents count”!
Previously: collected the largest dataset of patents by African Americans; first systematic documentation of pre-Civil Rights era “Black names”.
SUCCESS Act of 2018 - a rare bipartisan effort, to promote women, minority, and veteran participation in patenting. could only use publicly-available data. Everyone has to be able to replicate our work (to support this policy). The report noted: little publicly available data!
IDEA Act of 2019 - provide for collection of demographic info, including gender, race, ethnicity, national origin, sexual orientation, age, veteran status, disability education level attained, income level — for each inventor. Annual reports, w/ no PII.
Historically: agents were very wary of giving away the identity of clients. (consider the general case for human- and machine-processed forms)
Please give feedback — I have to submit corrected testimony next week.
What suggestions might we pass on to Congress?
C: why not ask for IDs (SSNs?) rather than survey data? (from census)
A: still undercounted by census data. so … spot check for undercounting?
(doesn’t get those w/o SSNs, intl applicants; adds security problems)
C: what about PTO customer number?
Chair: Samuel J. Klein
The Complexity of Knowledge (slides)[novelty]
(Martina Iori)
— Measuring concept novelty w/ topic-modelling
Why a new novelty measure: Citation network approac only looks at inputs. similar to interdisciplinarity/diversity. Text analysis focuses only on final text. Loses context, related to # of topics, Boolean ‘breakthrough’ meme.
Look at how it is being recombined, not just what was combined. Measure for novelty in ideas: how what was combined. topic modeling + (dis)similarity
~> Hellinger distance ~> avg distance from a small neighborhood.
fine grained measjre to avoid bleed over from novelty of the field.
Works in progress: test on Economics, then in Physics
Measured on nobel prize winners.
Ex: Romer’s endogenous grown theory scored low in both dimensions…
Ex: a certain Jaffe paper…
Robust to: topics, neighborhood size, technicalities in language. !
Mapping Firms' Locations in Technological Space (slides) [clustering]
(Mitsuru Igami)
— Topological analysis of embedding in a high-dimensional space
(next up: try w/ text-based metrics!)
Mapping firms positions in patent-portfolio-space: what can we say about the topology of the resulting distribution of firms? (also a modern use of original VSM approaches).
distance ideas: correlation, cosine, min-complement (2012)?
dim reduction ideas: PCA, MDS, k-means?
Collapsing is often done down to 2 dimensions. [Maybe too far?]
Alt proposal: toy model, 2d is high dim, project down to 1d.
Recover a senes of continuity by choosing overlapping regions. (like the parlor game of drawing-telephone!) applied to rna, tumors, politics…
You see ‘flares’ — differentiation over time.
Q: what happens to the topology as you shift granularity? [animate it!]
Applications of Textual Similarity to Measure Construction and Evaluation
(Jeffrey M. Kuhn, w Ken*)(slides)
Ideas from our long-running p2p text similarity work: vector model, built form the technical descriptoin; patents classified by terms, weighted by tf-idf
cosine distance calculated pairwise b/t patents.
Find a first-mover with priority advantage, and property rights . Other firms can follow on. BUT: requires full similarity/correlation matrix to find pairs of very-similar patents fitting some constraint. [same classes, similar timing] —> 100s of TB in a dataset. So 10M observations per patent.
Approach: release public dataset w/ 100/10k K-nearest-neighbors under a given matching algorithm from a given base dataset..
NB: similarity and novelty have many possible definitions.
similar text or claim? new idea or combination?
We need multiple measures, [named and versioned!], at the same time. grounded in crisp definitions w/ qualifiers (“combinatorial novelty” and “distance novelty”, not just single terms)
We need measures to understand our measures.
Evaluate potential corrections to biases. Hard to establish ground truth training data. How to train on an abstract concept? Develop alternate approaches. federated bootstrap.
Novelty and Impact (slides)
(Dominique Guellec) - measurements using text, metadata + NLP
The patent officer’s approach to novelty is not just novelty, it is non-obviousness and quantifying the ‘inventive step’. How obvious or surprising is a new step given its past? there is a literature on surprise.
Novelty is not distance -not symmetric. there is time, asymmetry. No magic recipe some ideas but concerns also. Worth delving into the experience of patent officers.
past measure of novelty: originality, radicalness. drawn often from metadata, expressed in combinations of cites, classes. limits: too reductive, citations include type 1 + 2 errors, have little detail/hierarchy.
Cleaning microdata is more art than science . prepare the textual data — a dark art. arbitrary choices, tokens, stop words, stemming… fractal dependence? be transparent! test variations?
Proposal to all: in addition to cleaned, enriched patent data: please include a reference database w/ the text of documents, and various vector representations of text.
Questions overall
Q: Are you naming your metrics, versioning them?
Important for x-measure analysis: to see other measures and what they are for. In a tech setting you’d have a plug-fest, or an “interop fair” take one output and put it in another paper and try to get the same result.
A: ”If only someone would publish a whitepaper on that process!”
Q: [mainly for Mitsuru] what happens to the topology as you shift granularity? [animate it!]
Q: If I want to explore a potential part of the graph - -where is possible?
Q: in some fields you have math and equations in titles. In others like patents, they try to turn diagrams and math into english language in the claims. Is there a cross-walk for this / way to solve this?
A: in some scenes, a ML model could distinguish different contexts for [rho]
When burrowing into the text of claims, how do you deal with this? has anyone intergrated topic modeling w/ vector similarity methods? [beyond Martina’s work - use a known [named?] framework?]
A: Some models: you start w/ an embedding, a given matrix representation.
There’s a lot to be said about positioning these things heirarchically; inherit a submeasure from a general one, or convolve it. [“claim length”…]
B: for chem/gene similarity: you can just take text that describes that piece — patents might be identical save the chems, but be totally different.
Context: what is the critical piece? compare text and chem and other similarity. articulate the limits.
Q: is there a good measure for distance, not similarity?
Ex: BLAST algorithm (in bioinformatics).
There are also a lot of IBM patents that are looking for 1D stream similarity…
Google Ex: Internal to google, we used document models, bag of words, cite and class-code based — as proxies for similarity. Didn’t get meaningful variants. We need to come up w gold sets of similarity. we have resources dedicated to this. If you have thoughts on how you’d collect/define Golden Sets of well structured evaluations, come talk to us.
As a parallel example, we did this just for ClaimsBreadth, and got wildly different definitions.
Ex: PubMed has a canonical set of keywords that will hint related citations. I’m drooling over the idea we could do this w patents. [What social and tech norms were needed to make this work?]
Chair: Osmat Azzam Jefferson, Cambia
Gaétan de Rassenfosse, EPFL : PatStat (slides)
(nb: Dominique + Julia were parents of PatStat and PatentsView…)
Andrew Toole, USPTO
Name disambiguation (from 9 different validation sets)
PatentsView: [more visuals later]
Sharing new classes of data; maintaining tools and interfaces.
New datasets coming out this year: (3-5 specifics [litigation data?]; get links!)
Ian Wetherbee, Google Patents and BigQuery (slides?)
Contact: [email protected]
(G Patents data is drawn from IFI Claims + old US OCRs)
(BigQuery designed to smooth out the process of sharing large datasets), subscription model?
Shared primary keys: DocDB isn’t complete enough. Pricing concerns w/ datasets — keeping them online and available for processing, is hard!
Distributing platforms to let people do research on this: how do make it work?
Pricing is very important. How do you prioritize, make sure it’s available ~forever. How can you recover costs? Cost split b/t storage cost ($0.02/GB/mo by uploader) & query output (1TB/user/mo free, then $5/TB).
Host a csv file on some URL, and you’re done.
Spent 1yr negotiating w/ the G data provider and EPO to republish their data freely for everyone. Now this is done for most data! all biblio data in DocDB and PTO-specific data.
Anyone can access it within BigQuery, run SQL, download it. Alleviates a whole host of problems.
BigQuery, federated: acts like a giant SQL database. You pay marginal bits.
There are no separate silos at each university. Just shared ACL per table. Inside BQ we have PatentsView, DocDB keys; OCE data; ChemBL data.
“Google Patents Research” table where we share our similarity vectors, translations, top terms and other extracts.
Similarity vecs are trained on the full text of patents, to their CPCs (WSABIE, https://media.epo.org/play/gsgoogle2017).
We trained similarity embeddings; eyeballed them… launched it on GPat. When you click “find simiilar” that’s what you get. Iterate w/ human feedback
Q: is there a way to accept submissions of public data?
A: talk to Ian ([email protected]) for Cloud Credits/a spot to upload. There’s no default way yet (for datasets > 10GB).
Q: Algorithm naming and sharing: are the internal metrics for an output X described w/ links to code and a name for the metric or script?
A: Aside on Metarecords: LensID, WikidataID, dbPediaIDs
Chair: Adam B. Jaffe, Massachusetts Institute of Technology and NBER
~Bronwyn H. Hall, University of California at Berkeley and NBER
Let’s not recreate PATSTAT: one central data repository.
1) Align identifiers. We have international ? [standards?] on patstat-ID and doc-identifiers.
2) Standards for data release. Documentation: defn’s of each term
On bulk data access — you may take a while to publish, need a fixed source of data. Need at least a permalink to a snapshot, preferably a download.
3) Metaresearch: catalog of definition of terms. Resulting definitions get a ~unique, qualified name, specific enough to avoid confusion over a decade.
4) How do we validate a [similarity] measure? What is it good for? [connect to research into limitations on fairness]
5) Our big advantage is communication w/ one another
~ Elisabeth Ruth Perlman, Bureau of the Census
1) we’re sharing knowledge by sending people to this room. also talking about federating data automatically and writing it down… not always the best model. Benefits of [central? multicentral?] organization:
Have paid staff, have Heidi and Pierre manage people to write data documentation rather than writing it themselves [cf people w/ ample funding: G, PTO]. Have some long-term commitment
2) Lots of things that ?? care about that we don’t?
3) I’ve never worked w/ patents since 1900 unless I was paid to. The older ones are labors of love; how do we connect these w/ recent active commercial work?
4) Warning I put on all my patent work: warn them about using patent data; then show them how to use it… “research into patent records is frustrating; one comes in hoping to find invention, and sees property.” — get source, put this on the site!
~ Bhaven N. Sampat, Columbia University and NBER
1) Exciting time to reflect. If we just met like this it would be useful.
What are the first-order problems we need to solve, what’s the state of the art, what constraints need to be overcome? Let’s find ways to set priorities [and explore more than one]
2) Some things we can try that might not be too novel. Patent refs to science — we have a survey of patentees asking about why they cited that. We might not be able to help that much in getting such feedback. But we could help more in: validating concepts [bring in legal scholars], validating samples of ground truths [from historical OCR to patent references]
3) Documentation — we should include the data generation step. Where is it coming from, with what founder effects.
3b) Think about not just taking in new datasets, but teaching how to use old ones. Best practice in using bigquery and other tools; [in using older standard datasets]
4) A listserv where we can ask questions of one another. Now I know who to contact re: the bodies buried in PATSTAT, or about chinese patents.
~ Heidi L. Williams, Stanford University and NBER
The closer! We’re all here excited about something in particular. I want to know how to get more ec students to work on the most important set of questions.
1) —> develop course materials about this . these issues don’t get taught to groups who could help solve them!
2) —> lower the barrier to entry for students to start on this research. expand inclusivity in who is working on the problems. reduce cost, angst .
3) Simplify language: lower cost to break in. don’t have people use data w/out knowing what it means, do have people who would otherwise do something else. Cf tacit knowledge discussion above.
4) What’s the simplest way to get started — imagine this for students; doesn’t have to be just one person. Have a group write this set of pages about techniques and sources. A github page or WP entry, whatever is appropriate.
~ Other thoughts for the future:
Task: bootstrap federation - make federation work, iteratively.
How does this interact w/ social media? to complement social media: Twitter is good at expanding science circles! Connect w current students. [Bitsy, +atl]
Pierre - Specific example of outreach to core audiences: how far are we from the point where (patentsview keywords) are used inside the Census?
Mary - imagine an R or Python or Stata package where: you document both where it came from and in which papers it was used…
Look at case studies: what’s been successful in other fields?
Invest in a mapping exercise:
What variables/data are we looking for, and who has that now?
An argument for centralization, at least of badging: if your algorithm is posted on “the site” then your tenure committee can appreciate that. Like the STATA journal. cf:UMAmherst: w/ a data page w dab programs in python. Github has lovely templates.
Templates for bundling (arbitrary choices) configurations of a complex data cleaning, into a package (named, versioned) that can be reused.
Catalog supply and demand, have a little bulletin board covering both (a number of people here offered data, offered metrics, asked for requests, asked for data they could host for free, offered analytics, asked for recommended metrics and training data)
Discuss pricing: visualizing, paying for up-front and marginal costs (for storage, computing, cleaning)
Papers for the summer: not those that would be in a session like this; technically oriented? papers being published elsewhere?
Wonderful to meet everyone! Please share answers to questions at the top, and other thoughts for the future in the last section, immediately above.