Lessons from the Lens Lab
These are collaborative notes; feel free to add your own.
A discussion on strategies for knowledge reconciliation, moderated by Osmat Jefferson from the Lens Lab.
This is part of the I3 Spring 2021 Data Sharing workshop series.
The initial session was at 1200 EST/1600 UTC on April 14th, with a second at 0100 UTC the next day.
Video will be posted shortly.
What research matters in the innovation workflow?
Science, patents, policy, and context are all relevant. Tradeoffs:
Look at “traffic on the bridges” across corpora of knowledge
Findable, Accessible, Interoperable, reusable, and Enabling (FAIRE) access to current knowledge.
Focusing on patents as indicators of invention: socioeconomic status of parents is strongly correlated w/ rate of invention. This could be limiting the rate of [realized, published] discoveries by a factor of 4.
Mariana Mazzucato research: sees decreasing reinvestment into economy. Industry pays more in dividends than re-investment. How to mobilise public sector to become more effective?
Comment: (BH) Mazzucato's argument is incomplete - there may be better investment opportunities outside the 500 firms, and the dividends and repurchase proceeds are doubtless being reinvested elsewhere
Because multiple elements need to come together to produce a product or a service and we need to connect different types of data to test the hypothesis collectively, for the lens, we need to start building bridges between the knowledge silos and create public tools to navigate, map, and share them under the FAIRE principles.
Q: (TB) Related to bullet point 2: Do we have statistics on what fraction of, say, new products are covered by patents? Process innovations covered by patents?
Where? in the USA? which time frame? I think it would be hard to estimate as many utility patents can have both a method and a product claim (biotechnology-related patents)
Started in 1999 with 60k biotech records when Cambia developed its first IP resource bought from USPTO and the OCRed content was released as searchable full text. Currently Lens has 2 corpora:
scholarly 230mil+ (update biweekly, coming to weekly)
core data (supplemented full-text data)
patent data 130mil+ from 105 jurisdictions
These are parsed merged and linked using aligned architecture: by record, by person, by organisation. Developing a taxonomy of functional relationships between different entities assisted by the Lens ID.
Using the LensID: global open persistent identifier, to aggregate and catalogue contextual information around an entity creating a patent MetaRecord normalized by different standardized information, then keep adding supplementary information.
The bottom diagram how the scholarly MetaRecord actually works. How we feedback the outputs of some of our data pipeline.
3 major public sources: MAG/PubMed/Xref ingested and are given a Lens ID each.
We put these into a resolver (crosswalk of their identifiers)
Then this goes into a modeler [Scholarly MetaRecord Modeller] for matching metadata, quality control, and deduplication along with maintaining provenance of the logging history
Where there’s a conflict between two records, we maintain separate records (one record may have Open Access status, another may not)
For the rest we merge metadata records, enriching w/ supplementary data
Sharing data upstream: biweekly to PubMed, monthly w/ MAG, resolved DOIs w/ others
With every data update, whether from us or in MAG for example, more metadata is always added, records shuffled and re-aligned. Interject supplementary data from various sources.
Q: (MM) what's the full-text coverage? Google Patents has USPTO ft back to 1836, but EPO only back to 1978
Full corpus coverage can be explored in the preview environment now.
A: Full text from EPO, from USPTO, from WIPO, from Australia (Australia is still to be added). All stats on the data can now be explored and checked on the website by doing a blank search in the structured search page and view the analysis tab/dashboards can analyzed and checked too..
Have a biweekly sharing of the NPL with PubMed, resolvable by PMIDs, monthly patent data sharing with MAG. Microsoft developed own resolver. Shared resolved DOI with CrossRef. Also in discussions with patent offices interested in pipeline.
Users can search, analyze, and export data with and without registration. Registered users can access 50k documents at a time. Users can also use applications such as PatCite or In4M which is a custom metric for influence that we developed, navigate the data available in these apps and also download these freely.
Individual + institutional training in using the data
Lens Labs exists to support collaborations w/ universities and other institutions, and to improve access + use of the (open) data.
Possible to get ‘Lens for Institutions’ (https://www.lens.org/lens/institutions) -> librarians can access bigger datasets / w fewer rate limits and package up to three APIs plus additional features. For example an institutional user can export 100k instead of 50k at a time and has permission to use the site for their professional work.
Research interest in Citation Chaser (Neal Haddaway, sorry for misspelled last name in the presentation): systematic review application owner who collaborated with the Lens and has access to an open API to share its tool.
Bulk data is also available, with and without regular updates; currently aiming for every 1-2 weeks. In a few months, hoping to add live indexing.
Certain datasets for particular applications, such as COVID-19, available for free on Lens.
Stepwise guidelines available on the site. Can search all 120 fields, choose particular fields, allow to make versioned datasets. To speed up the process of getting the data, please ask your librarian to submit the request. Through institutional toolkits, launched in February 2021 (https://www.lens.org/lens/institutions ) you and others at your institution will get better and longer access to various tools including APIs.
Licensing and access:
Research use: just include attribution (‘enabled by the Lens’) + include the LensID in redistributions.
Commercial use: contact the Lens.
All other ids (MAG, PubMed, DOIs, etc..) also provided and available.
For anyone working on entity resolution and reconciliation: you can access the whole corpus; we’re making it available to all who address problems in data quality and want to enrich the open data resource.
Tracking reuse is not automatic (unless cited formally in a paper). Citation chaser tracks who reuses their data.
We’ve seen 80k data downloads exported based on email notifications as requested by those registered users. Other exports can not be tracked. But these aren’t linked to future [re]use
New Site: preview.lens.org
Through I3 collab we released patent API. However, based on other support, we were able to have all the new Lens architecture added into Lens platform. Agent/applicant address. legal status, extended/simple family, backward, forward, NPL citations, etc.. scaling the number of fields from 45 in the current production site to 120 searchable fields. But note: whatever you do on preview is not saved! we need feedback! Please use it. It is a great place to explore and experiment.
In a few weeks, we hope it will be released as production. (Lens Labs site is updated here, and not in production where there is still the old architecture, just updated look + feel)
e.g. structure search page:
last update of data shown (today: april 7th)
for the stats, look at breakdown of different properties of the dataset, number of patents/articles w/ each property or do a blank search and check the estimated counts in the filters.
see what is there from each jurisdiction
e.g. limit to full text then sort by earliest published date per jurisdiction to see coverage
[TB] Do we have statistics on what fraction of, say, new products are covered by patents? Process innovations covered by patents?
BH: Mansfield asked that, for firms he interviewed; w/ some information in older Carnegie-Mellon surveys. Not very up to date!
[SJ] indicators of how data is being used / reused now?
Only visibility is what people write to us or published in the media
We are happy to support use case applications, and it’s a pretty recent activity since the API release in December 2020. So far we’ve been swamped by requests and interest. Intention is to make the tools and aggregated datasets as open as possible.
[BH] strikes you have 2 ways to track use
1 — your database will give you citations (ha), though that could take a while
2 — you have registered users, could in principle survey them
Again Lens users can be registered and Not registered.
[MattM] It's great that you decided *not* to track users. Just so I understand, you have to create an account and log in, but you don't keep records of who downloads what?
You do not need to have an account to use Lens.org
the only thing we can track is if they request export download notification (have 80k)
477k API calls and 123 M records downloaded so far
[BH] we felt [H/J/T] we made a mistake in not tracking downloads and use, until we waited for citations to show up and people started send us inquiries. You might consider tracking at least emails to follow up. And funders might care about usage stats.
[BH] have you checked for citations
I tried, but we don’t have consistent referencing: the Lens, lens.org, Patent Lens, Cambia, Jefferson etc — it’s all over the place
BH if people find it easy to cite they will cite, you need to give people a detailed way to cite
[AC] Is there a plan to include a broader range of NPL — e.g. references to Github repositories, Wikipedia articles — as entities that can have Lens IDs
We include these and serve the citation strings. We give each an internal ID to pass them through the resolver but we do not expose that ID.
[MikeM] Question: is Lens.org team working on parts of entity resolution internally? (e.g. recognition, standardization, harmonization, disambiguation)
We are mostly leveraging and reconciling what others have done.
ResearchRabbit uses the API already. Lens already has ingested ORCID + MAG IDs, now starting on human dab. We offer a user-based approach to attest and authenticate authorship and inventorship
Planning to work on human (author/inventor/applicant)+ org dab via GRID, and on quality control to improve matches in MAG.
If we find things are not working with any of the public data sources, from misaligned data to bugs, we let them know, very responsive interaction.
Arash: [from VTT] we are using name disambiguation; when we bring in something like a company name in from WOS, do you have a disambiguation method we can use to match this (to something in the Lens)?
We avoid using commercial datasets that we couldn’t make available or share the process of how we got the data. When we developed the In4M metric, we used the DOIs from Incites data but we were not allowed to share the institutions scholarly works in that study. so, we do not. If the applicant name of a business in available in DocDB, and as we use their standardized applicant name as a benchmark, then you may be able to find that name.
If you want to train your own tool on our dataset, you are welcome to request access to the API.