Challenges of unstructured web data, case study of IPRoduct
Edit these notes ~ part of the I3 Data Sharing workshop series
This session will feature discussion of the challenges involved in working with unstructured web data, with examples from IPRoduct.
When: Wednesday March 17th, from 12:00 – 13:15 EST (1700 UTC)*
Where: Zoom ~ Registration link
*A second viewing will be held Thu. March 18th, at 0600 UTC. Contact Kyle for details.
Patent to Product - Challenges of unstructured web data & the crowd based solution of IProduct. Moderated by: Gaétan de Rassenfosse
Questions recorded in these notes will be forwarded to the presenters.
Find data online (Virtual Patent Marking or other formats), and enter it.
Prevalence: 10-15% of patents in recent USPTO had a VPM page online, but not all firms commercialise products. This translates to 25% of population that has produced a product.
(Secondary goal: understand the prevalence of physical marking: these estimates are based on samples)
Q (SJ): are there related efforts to extract the physical markings and UPCs from photos?
—> A: Of interest in the future, not sure of any parallel efforts. Getting the UPC is of particular interest.
Data in focus: from VPM — company metadata + patent metadata
what are product prices? are there associated trademarks?
Economist view of data/knowledge production:
- high fixed cost, low marginal cost
- incentives: publishing papers on public data, support from foundations
IPRoduct doesn’t fit this well: medium fixed cost, url changes, continuous updating. Need to crawl continuously. No standard data structure for the same document; very hard to automate. (what’s the current status of a once-patented item, or a transferred patent?)
See a lot of patents being transferred (where we know the original assignee). Gives us another way to study the status of patents.
Incentives: how to maximize both production and diffusion?
Club good for now
can access data by working on tasks for the platform
after enough contributions you become a ‘trusted user’, we don’t need to validate submissions
Differentiated pricing in the future
reduced for researchers, full price for companies
Full subsidy in farther future, for uni researchers + nonprofits
Q: also some OurResearch approach / brief embargo? [Y, if support develops]
Funding a hard constraint: with sufficient funding (e.g. big 3yr grant), we could release data for free.
Q: what’s your estimation of the costs on an annual basis? A: 100k bare minimum. Would take 200k to hire a data editor or similar.
Q: approaching trade assocs? for funding + practice A: notyet, chicken + egg. (If we only have 30k products, not so useful yet)
NB: dataset sustainability is expensive. It is not costless, no matter how much you automate; but noone wants to fund ‘maintenance’
Dolores: There are some open-ended projects, where we’re not just happy to have a single dataset. There’s a lot of problems in industry where datasets are in their infancy. Having more of a presence in industry forums could be a good place to find support.
Compare: Data Carpentries, CS&S data funds, economies of scale?
Add other non-exclusive, non-rivalrous funding sources
Adam Jaffe: Sloan foundation supports OpenCorporates. If someone wants to think about a project to connect firm names in patent data to the OpenCorporates, they would probably fund it.
Downloading: Currently 30k patents for 80k products.
‘The export function currently says 6,028 credits would be needed to download the entire dataset. Do you have an estimate of how many hours of labor it would take to earn that many credits?’
Still working on incentive schemes.
Can contribute and review, can’t download yet.
Can validate users but little community management, not so friendly for new users.
Classifying potential web pages -> developing a classifier for VPM pages. Needs to be very accurate, need people to train classifier
Sharing web documents: if you find a VPM webpage, open a link
Sending pictures — capturing physical markings
ideally develop this in the phone app
Enrich product information
products can have quite fine-grained classification: particularly useful for trade users.
connecting products to amazon reference really important, you get a lot of metadata
Enrich company information
systematic linking to Linkedin
here we want to enrich things like no. of employees, age of company, social network info
Still figuring out the exchange value of work on the platform! e.g. 1 edit gives 1 credit… we don’t quite know what that means yet.
Share web docs
Category - Harmonized commodity descriptions
Company product numbers: UPC / ASIN / EAN / Trademark data
Company data: Sector, Industry class
Company name: linked to LinkedIn name (upgrade to OpenCorp!)
Data access and cost:
We depend on open data. So for instance we don’t link to PatStat — every 6 mo we’d need to install and pay for the new db.
Downloaded the fulltext DocDB from GPatents, it would have been cheaper to buy from PatStat. Is there a less expensive way to get it?
We want to avoid harmonization of records being duplicated — ensure this gets integrated once people do it on their own.
Q, Dolores on parents/subsidiary confusion. Good point about parent/subsidiary firms. Or having a way to link to Compustat or another company dataset would be useful.
A Compustat only for listed companies, want data to be open as possible. PATSAT — every 6mo you need to install and pay for a new database. Something like OpenCorporates might be a useful tool. Will also provide code to match to PATSTAT, to help . Don’t want to leave matching tasks to individual researchers! Duplication is such an issue. If someone does something like linking to trademark data, sharing data with them on condition that it can then be added to platform. Linked Open Data by EPO is good but Google Data very convenient
TaniaB: Re: matching LinkedIn firms to Compustat, I am working with data from startup that have done this - data used in this paper for context. LMN if you are interested in talking to them for potential data swap alone the lines you are pursuing with trademarks data ppl.
Yes, would be interesting! Access to individual data points like this is expensive for researchers; some partnership would help. (cool, will be in touch via email)
Q: What fraction of products don’t link to patents?
Majority don’t, but a lot do. The products that don’t tend to be quite low tech.
User guide - detailed description of each variable, and the sample frame
Visual guide for each interface
Forum or FAQ — but who reads them? maintaining one can be a pain
Patents (DocDB, PatStat ID)
Products (ASIN, UPC, EAN)
Trademarks (US IDs)
X Firms — don’t yet have a Firm ID to link to.
Questions, comments, thoughts?
—> examples welcome…
What can this learn from Wikipedia and other participatory efforts?
—> happy to discuss further
What price points and reward structure to use?
—> (see discussion above — 20-100 hrs for a full set?)
Can I share raw data (VPM pages)?
—> looked into Internet Archive &c but they are slow [for regular use]
Tempted to share the simple product-patent table, but worried it would kill survival prospects of the project. With some lag (1yr, other)?
—> OurResearch releases w/ a 3 month lag; Clarivate make old data cheaper
Q: have you noticed if VPM includes patents that are licensed (but not owned) by the firm producing the product? A: yes …
Q: Is visibility in media and industry forums useful? Getting people used to seeing you around, not only in academic and research circles.
Q: See also NASA and ESA which love to track patent to produce life cycles over time, could be sustainers and big contributors/subscribers
Q: Have you thought about a general product catalog, without patent links? How much data in prod catalogs do you see including in IPR’s dataset?
A: That’s not the goal; there are many examples of product catalogs, generally expensive, hard to maintain in their own ways [involve more frequent updates?]
Kyle Higham: What other models did you consider around the production and diffusion of the data?
A: Many of us here are either economists or data scientists, in addition to the tension of maximising diffusion and production, needs a continuous income for development. But I don’t think that’s how public money should be used! It should be made open. I also need to publish!
There’s one project with HKUST where working with RA’s to find trademark data… but we need funders! If it’s possible to get a large sum then it would be wonderful to open source.
Frank Van der Wouden: What variables do you currently capture?
A: prod name, prod type, patent authority, patent number, design, kind, firm sector, firm type, no. of employees.
Frank VDW: What does it look like to motivate participation? Difference with Wikipedia as as wp editor you have creative agency
Frank VDW: Do we have products sold in specific types of markets
A: that’s why we put trademark data, allows you to derive a lot of that information
Sadao Nagaoka: Does the database cover both virtual marking, and also physica marking? Who uses the marking system + when is the marking made?
A: We want to cover physical marking as much as possible. Having an app would be wonderful, but that’s a big ask on the technical side. Primarily virtual b/c easier to collect.
On who uses VPM, NBER working paper in the slides. Based on a random sample of US assignees, find their information online. Find that 12% of the US assignees have some sort of virtual marking webpage. Might just be a note in the bottom of the site, but we can find some form of information.
Need assignees to commercialise products or no mark, but anywhere between 1/3 and 2/3. Say 50%. Thus ~25% of the relevant population uses VPM.
Kyle: A lot of issues with link rot in VPM — big issue with judging timings etc.
Gaetan: a lot of the timing depends when the product is commercialised. That’s why physical products are great — ‘patent pending’ can’t be returned and changed by the company so much easier to keep track of different stages.
Tetsuo Wada: Do you have a legal entity in mind that owns/manages/operates this project? (funding/hosting combination)
Gaetan: See a couple of options. Should this be managed by an individual researcher? I do this because it’s interesting from a research perspective. But would be much better if patent offices (esp UK US CN Aus as they use this marking). Maybe reduce renewal fees if you submit this information to your patent office
Interest from other actors: bodies like NASA interested in product lifecycle. Foundations interested in which scientific papers cited that end up in products as a great measure of impact. Paper cited in patent which ends up in product. Social returns of public funding of research. Patents to products are the missing piece of the pipeline.
Public funders of science like the NSF should be interested in this data. Not giving up on having long term public funding.
Kyle: US/UK have patent marking explicitly written, other places like Aus in particular, it seems like a VPM would satisfy their requirements, so de facto included. Like that in Germany and the Netherlands.
Gaetan: we do see Japanese firms in data, but many for export to US market (where these markings required)
Frank vDW: Should and aim be to merge someone like Matt’s Reliance on Science Data into IPRoduct
Gaetan: the aim is to make datasets that can be plugged into each other. Matt, the Lens, PatentView, Patentlink… we need to give enough keys so that these things can be plugged in. That’s essential for getting the whole pipleline.
Frank vDW: Integration great for funding. Re: data production -> what if I ask 300 undergraduates to go to website and search for trademark data, won’t there be a lot of overlap? How do you deal with overlap that’s not exaclty a duplicate (e.g. chinese and japanese version of a website). What’s the real truth?
Gaetan: Implement 3 levels of users: newcomer (contributions validated by human), trusted (no validation), read-only (no more contributing).
Hope we have conflicts as it shows we have enough users. Right now prioritise lowest info. As soon as someone fills in some trademark info, that product priority becomes lower.
FVDW: What happens with assigning credit? Do you get the same credit if evidence in system?
Gatean: Yes. We’ll know if people are abusing the platform, but in general we want to reward people’s time.
FVDW: Redundancies: there’s a universe of products out there… aren’t a lot of people going to photograph named brands first (e.g. apple).
Gaetan: With classification, easy to manage, we have a lot of control over prioritising what pages people see on the platform. With product, we could start to exclude over-submitted products.
KH: Can see licensing information in data. Not really a database with a list of patent licensing agreements, is it enough to give insight into licensing
A: there are databases, e.g. uspto report on voluntary basis, but it’s about 5% of patents. It’s a new type of information, it’s useful, but it’s more useful for studying the transfer of patents. If we go on the logitech page, we see which are licensed by logitech / another firm, a university. So a really useful source for patent rights.
KH: Any other crowdsourced projects?
Gaetan: Perhaps millions of unsuccessful ones! Wikipedia was a big inspiration. Need a good amount of human intervention. With stable funding we could pay data curators.
KH: Chicken and egg -> could you provide incentives for early investors?
Gaetan: Companies want large scale data, 30-40k patents not enough. Can’t afford something off the shelf. One thing we’ve been doing is consulting, NovoNordisk wanted to study the data so we did a report for them based on the dataset. Analysis can be a good way to attract funding. Would love data to be free/cheap for academic researchers.
FVDW: What is the count of unique patent numbers in database so far?
Gaetan: 30k patents, US-only. We have many many more, but data not yet fully cleaned and extracted. Very quick potential for 50k patents.
FVDW: What number of unique firms?
Gaetan: (?10k? not 100%) lots of medical device firms, but in part as a demo we searched a lot of medical device firms. next is consumer goods, then consumer electronics, software, sporting goods
difference in tendency to mark?
products easy to reverse engineer, or whether it is in stiff competition, ease of linking