10/13/23
Work with Bernhard Ganglmair
Is there something we’re missing by discounting other sources. Defensive disclosures published in Research Disclosures. Journal / publication service publishes and sends to patent offices, considered as prior art.
Research questions:
do patents tell the whole story?
How do (defensive) disclosures relate to patents?
2 motives:
straightforward: defensive, don’t want anyone else to get a patent on this thing
strategic: extend a patent race, 2 competing firms working on a similar invention, motive to disclose to prevent the other firm from continuing patenting process / raise bar
little empirical work on this.
For the fellowship: search for CPC classes for disclosures, find which patents are similar to disclosures, both using NLP.
56k disclosures in Research Disclosure. Pay to publish, disclosure fees are now $120.
Matt Marx: the DISCLOSURES are cited in patents? would love to see an example of that
peaks in the 2000s, now constant in numbers.
‘companies that publish with us know that they can rely on the disclosures being found. PCT minimum documentation status’
90% of “leading” companies have published in RD
Challenges for NLP Classification:
where are they? what patents are similar?
Disclosures don’t follow a particular format. Lots of heterogeneity. Have very long and very short disclosures, illiustrated and not illiustrated. Difficult for a LLM to make comparisons between these things.
Solution: ask the LLM to write abstracts for disclosures.
Second problem: don’t have CPC classes for disclosures. Tools in ML / deep learning world to transfer patent classification to disclosure classification. Use transfer learning to train a classifier for disclosures with patent data.
Used LLama-7b, prompt to mimic style of patent abstract.
Domain adaptation: use txfer learning
Source domain: patent
Target domain: disclosures
used Google BERT for patents. Putting out representations of the text input, BERT model is just a classification, here’s a patent, tell me the CPC class. Domain Loss — punish differentiation btwn patents and disclosures.
Homogenisation — learning pipeline. Didn’t get accuracy wanted. (how is this being evaluated?). Patents and disclosures in same space and can be compared.
Share of patents published with assigned class tracks roughly with disclosures published.
Papers:
1) descriptive paper on patents vs disclosures, trend comparison, anonymous vs non-anonymous. Case studies of firms (e.g. HP, Dupont)
2) Analysing effect of changes in patentability — potential for reduction, but more positive would be the idea that people still continue invention
Bhaven: did Deepak have a paper on IBM technical disclosures?
The questions are twofold:
1) how do you think about these vs publications?
A: firms are publishing now. Especially in AI/ML, the disclosures are still unique in that that’s their only purpose. Not subject to the same kind of peer review process, much more work than just paying for disclosure. If you’re big enough as a firm there’s also ways to publish on website. Also what makes it unique. How relevant, it dates back to when it was in the ‘hard copies’ era. Trying to have a way to have similarities to patents without mapping into CPC space. Also thinking about standards using code. The code will be a part of the publication.
2) the general purpose thing here might be for oving arbitrary texts into cpc space
Agnes: the CPC classification -- do you have a validation process for this? what about the 'abstractness' of the abstract
A: we have some CPC classifications, self reported from 1970-1976, because those weren’t yet part of the dataset (obtained from physical archive!). Difficult to argue that early period is representative of the whole thing. Other thing would be to look at, what are patents similar to them, case-by-case analysis. To the ‘abstractness’ question — for now we don’t have real validation, apart from subjective analysis — several methods, ask a more powerful LLM whether that’s a good abstract, can at least scale to a larger level.
Matt: would be great to see an example of disclosures being cited! Could be great to look at whether these affect patenting but also whether science itself. Using PPP trying to do both at once, looks like more about disclosure more than patent granted. but — what happens when there is no patent? why disclose and not publish? v exciting
A: with similarity, we can look at them being cited, could use value measures that we see in patents. Moving into disclosures in general
This presentation is about software. Python for Difference, Rate and Direction. Python interface, design around a core of innovation datasets, built on top of polars a querying engine.
Autodownload data/metadata on computer. Pulls from the web, gives access to preset numbers of pipelines, functions that manipulate datasets in certain ways. Takes a couple of parameters, weights and dates. Forward citations for granted patents.
Entities -> Panels -> Pipelines -> Datasets
Datasets: pull down and load data into memory. returns a dataframe with metadata attached to it
Pipelines: functions that build on the dataset, manipulate it
Panels: cool specialised pipelines :) instead of giving you a Dataset, gives you a DataPanel — has a plan associated with it, does operations on the dataset e.g. counting and weighting. Can change the plan, only created on ‘build’
Entities: specific methods attached to objects that refer to units of analysis. e.g. ‘want to create a counterfactual for method of CPC’.
Talking about what you can do in datasets/pipeline models.
Walkthrough example: do drug patents cite more papers than non-drug patents?
calls reliance on science
returns a Dataset object
build a comparison function using filters and builtin methods
can still use external datasetsa using the ‘read’ method
also possible to add a dataset/source to pydrad
aim: to make as maintainable as possible — how to handle changes to names etc on datasets.
Bhaven: potentially would advise stating with the problem it’s meant to solve, you should make the case for this! Are there similar packages in other data domains? potentially talk about why it was needed for your workflow.
Organising publicly-available clinical trial data
10k foot view of project: very much work in progress. Progress joint with David Ritzwoller (econ) and Sabri Eyuboglu, Arjun Desai, Karan Goel (compsci)
came out of PhD research. Clinical trials increasingly used as measures of innovation. When patent structure looks different across categories of technology. Standard use of clinical trial data. Clinical trials are randomised trials with structural designs. Outputs are quantitative objects — they result in P-values and confidence intervals
people have made the argument that these are not all equal, can be more or less useful/informative. Comparison to patents — more exclusionary force than others.
Clinical trial as count -> clinical trial as bundle of statistical information
issue: this is usually presented in unstructured text.
possible questions:
treatment-outcome pairs
questions about which drugs with similar statistical information get prioritised
just need a little bit more information: scientific publications, FDA, clinicaltrials.gov
prelimiary work — 3 sets of information record different information, firms are disclosing different information.
Approach: LLMs to extract / validate structured data from unstructured text. Clinical trials are the pet problem chosen for this. Main challenge — the concern with LLMs is that it can be very hard to ensure quality output — concern is that any errors in P-Values extracted from publications/counting data in a political/fraught industry undermine the value of the research.
Validation/checking is a very high priority.
Trilemma: extraction from varying structure, validation that quality is high, implementation at scale (100s of thousands of docs)
‘data production frontier’ -> regex, which has lots of quantity and not lots of quality (e.g. Noel-Storr et al, Feldman et. al), works for very structured info
Hand Labelling — one paper ~150, very low volume but valuable
Proposition is that reduction in quality is small compared to increase in scale.
Attractive: clinical science publications are highly structured. Information presented in roughly same order. Common elements include design details etc, measurements. Simplifies task, but still variation.
Even GPT-3 can do a good job with pulling information and structuring from an abstract. Structure can vary across publications.
methods: defining a sample, extraction and structure, processing and validation
How do we figure out what a clinical trial is?
Samples: exclusively human subjects, reduce 35mil papers in pubmed to clinical trials. Handlabelling abstracts, right now at about 4k, labels ‘does this abstrat conform to definition’
Using meerkat python library: meerkat.wiki, built a platform to optimise for handcoding and producing innovation dataset
GPT-3 and GPT-4 do well on data, false positives under 10 and 5% respectively. Producing labels of various quality at scale. Not feasible on sample of 1.8mil. Working with BERT on LLAMA model, GPU access to do LLAMA 7mil model.
Data availability as active constraint. Tools for high quality data at scale.
Meerkat coauthors
Media and the Diffusion of Scientific Knowledge
Use of news media to disseminate/diffuse knowledge. press officies publicise research. Looking at lots of things — geography, networks, technology transfer offices. press information so far quite ignored. Most major newspapers have science journalists. Can we make some progress on discussion?
More on the data part of the project. Existing datasets:
Altmetrics, DOIs to media sources, don’t know extent of coverage compared to ground truth. ~40% of true mentions
CrossRef: more comprehensive, best post-2019
PlumX: not as familiar, coverage between finds similar to crossRef
Saqib — look at entire news ecosystem, paper with 4 coauthors at different universities, press offices make independent decisions on press releases. Central platform called EurekAlert — consumers are science journalists subscribing to the platform. Continued to be the central platform from which most science communication happens.
In 1996 created by AAAS. Very important role in process. >300k covered between 1996-2020. Data on awards, grants, businesses.
Focus on research side. Data from press releases and universities joining.
structured information about coauthors, affiliations etc. Fuzzy-matchinf approach. Links to OpenAlex papers. Extracting information from downloaded databases, run through potential matches.
Using the fact that most press releases come in 1-2 weeks of publication, narrows set of papers need to look for in OpenAlex. Matched 36% of posts on EurekAlert to top 100 US Universities.
Validation: last 6-7 years, links to DOIs in press releases. Where this is present can validate whether approach is working. 70% of true matches where DOI linked.
Issue: publication dates not always accurate on OpenAlex — better on CrossRef. Sometimes quality of press release also not good, but the date is improvable.
In terms of results — Altmetrics data, being on press release strongly media mention, also tend to get more mentions. 300% increase in total patent citations — people further away benefit more from mentions. Mid-rank universities and journals benefit more. Don’t see difference in male and female authors in press release, but female authors less likely to be mentioned in press.
Matt: can you clarify the top 100 universities thing?
Takes 3 days for matching to run, to start with want to do with sample that able to manage. Just a bootstrapping thing.
Matt: shouldn’t be sensitive to commonality/notoreity?
A: no not at all
Bhaven: interesting to know what kinds of fields get a mention. Difference between fields a really interesting dimension.
A: the fields that get a lot are med, bio, chem, psychology. Outcome measures: policy documents, econ+polsci more likely to contribute
connected with maya’s work — potential connect to clinical trial outcomes
Agnes: how do press offices play a role in this? who decides?
Saqib: very unstructured approach. Top journals by default on everyone’s radar, but a lot of variation there, if I know the press office I can show how the office is concerned