A collaborative home for innovation data
This document is a living record of the current state of the I3 Open Innovation Dataset Index, providing an overview of new features, tools and ways to contribute. For upcoming features, speculation, and plans for the index, please see the I3 Index Roadmap.
The call for a Validation Dataset Index in the Winter 2022 Workshop raises a number of interesting questions wrt the existing structure of the I3 index, how validation datasets are (or aren’t) currently shared, how we want them to be shared, what the role of papers + code is etc etc.
My initial reaction to this question is that validation datasets are datasets in their own right, so should be indexed like datasets, but with a flag that gives their relationship to the parent. I still kind of think this, but given the data model is edited directly by people I think it’s worth thinking about how to make the addition of these datasets make sense.
In order to think through the question, I’m in the process of adding some validation data to the index. Validation datasets are rarely published separately from their ‘parent’ dataset, if at all — sometimes they can also simply be described in the paper. (NB: I think publishing this ‘description’ as a pointer to something that makes a dataset from a source is also an interesting thing to think how we can support).
A good model for how validation data is handled is https://paperswithcode.com/, which maintains a pretty gold standard for machine learning reproducibility and validation. I think the more machine-learning ends of the I3 should definitely be aiming at this level of quality.
The other thing that’s true of validation datasets is they’re typically used in conjunction with another dataset, and the link between them is defined by a paper (or papers, if the data is reused). This is partly making me wonder if it’s a good idea to simply do away with the flatfile architecture altogether, and start using a site that’s built from a SQL backend, that’s just archived by Github actions but not ruled by them. I feel… conflicted about this idea, because I also think there’s something really nice about the way the site is structured and the public-ness of the data model.
Modelling thoughts from SJ (to expand upon):
Perhaps in general:
: dataset used in paper D
: generator G used to make it from shared sources
: method M
: validation dataset V(M,D)parent dataset D` could be a common reference, you should share D = G(D`) it's not enough to just point to the parent (like patentsview)
I think perhaps the best way to resolve this (after I’ve done some more tests) is to send round a survey to the I3, asking 1) for validation datasets, and 2) for thoughts on how they’d be most usefully indexed.
The other thing I’ve done in the past few days is tidy up the code a bit, resolve some Python package bugs in the Github workflows, and touched up the Advanced Search prototype (need to add boolean fields next).
The iiindex is back under development! Between now and July, I’m hoping to improve a bunch of the functionality of the index, and generally make the experience a lot more robust. I’ll publish an update here every 2 weeks, with a view of documenting and getting feedback on the development process.
This weeks’ theme is the development of an advanced search for the index, which has been something under discussion for a while. Currently, I’m interested in different modes of search people might use to find datasets — namely, there’s a difference between wanting to find more information about a dataset you already know about (well-supported by current fuzzy search), and wanting some more precise tools for finding a new dataset within a given topic area, or that is related to a dataset you might already be using.
The collections provide a nice, analog way to do this (and I think in some ways are actually one of the most important bits of the site) but I think the strength of a good search tool is also to throw up datasets that perhaps aren’t as well-known/used but that could still be important for research.
I’ve been experimenting with ElasticLunr to build a lightweight client-side search (also porting the site to react to make it the JS tidier), you can test it out here.
You can now add relationships between datasets in the I3 index! After a bit of back and forth, we decided that maintaining consistency (and ease of editing) means for now we will keep a semi-human-readable data structure in the sheet, to be edited either in markdown, or directly through the site. (The google sheet remains editable directly, but that’s less friendly)
Currently the supported relationships between datasets are limited to:
similarity (a symmetrical relationship)
Adding a relationship between 2 datasets will also automatically generate its opposite — e.g. creating a parent-child relationship in one direction will make a child-parent in the other. These can be deleted manually if for whatever reason this is not desired.
Related datasets are added using a form on the page of an existing dataset, which allows them to be searched. The page takes a couple of minutes to update (via GitHub actions, which requires a short time to recompile the site). What it lacks in immediacy, it makes up for in the ability to track each change publicly in the repository.
There are some loose ends that still need to be tied up from May’s tasks, namely integration, handling edit cases, and the expansion of the salient fields sheet. In addition, as described at the end of the iiindex 2.0 doc, I’m prioritising development of a ‘User Friendly Relationship Description’ field that should be ready for use by the July meeting. This involves developing the frontend somewhat, and might be the catalyst for replacing the current search implementation also.
This month, we are knitting together updates to metadata with search, and integrating new sources of metadata into the repository. As a part of this, we have started to think about the development and integration of external harvesting servers, which can suggest new datasets, using the OAI-PMH protocol.
We are also developing ideas around search that can incorporate this new metadata, including thinking about a search based on joins across disparate datasets.
I wrote a longer post about both of these things here.
Specific tasks for this month:
Refactor metadata downloads into a separate action and normalise the scripts used for each source
Integrate OAI-PMH metadata requests into the repository
Integrate requests to the Dataverse and Zenodo APIs into the repository
Expand the ‘aliases’ in the Salient Fields Google Sheet, integrate into fuzzy matching script
Investigate OAI-PMH search queries that can form the basis for a harvesting protocol
The theme of this month is thinking through new lenses for filtering datasets in the index. At present, we are looking at gathering new metadata about datasets that will allow filtering by particular fields (e.g. patent/paper identifiers, families, authors, )
Currently, the plan is to:
Periodically pull lists of fields included in datasets (this exists for BigQuery datasets, expanding to automate as much as possible)
Use that information, plus human input, to tag datasets that contain ‘named fields’ that we care about. This will involve a little bit of NLP to do fuzzy matching between fields
Define an editable list of salient fields that people would like to filter by, along with definitions, variations (e.g. a summary of different conventions for Patent Identifier notation), and aliases defined by different namespaces (e.g. WIPO’s namespace, the Lens namespace)
Create an updated ‘advanced search’ interface that allows datasets to be grouped/filtered by common fields
Another possible use for this metadata could be used to suggest relationships between datasets, relating to another goal on the roadmap.
Two major additions have been made in the past couple of months, namely:
The addition of a linked ‘related datasets’ field in the index, which may be edited either via the Google Sheet or the Github repository
Edits either via GitHub or the Google Sheet get automatically timestamped, and that information is displayed to track ‘most recent updates’ on each page
Accompanying these are a number of smaller infrastructure changes that make the site more robust, and hopefully the code more reusable, as well as the development of accompanying tools like the ability to include Google Forms submissions, improvements to search, and modifications to the appearance.
We have also worked with the Heliophysics Knowledge Network to implement a sheets-based dataset index using the same codebase, the experience of which has been invaluable in developing our ideas for the future.
Lastly, the analytics for the index remain open, and are recorded using the open-source tool GoatCounter: to see the visits to individual pages on the site, visit https://iiindex.goatcounter.com/.
This week, the I3 released the beta version of our collaboratively-edited index for open innovation datasets and tools: iiindex.org. We hope this can become a place where resources across disparate platforms may be shared, annotated, and linked to one another by the research community.
The index is the result of a longer thought process about what a useful home for innovation data might look like. Early on in developing this site, we realised we should not try to replicate the many platforms available for data publication (Dataverse, BigQuery, Zenodo, Dryad, &c.), nor push people to use any one of these — each has its own affordances suited to different projects. Instead, this is a lightweight index of resources, an overlay pointing to their canonical home, with metadata that can be curated and updated by its community of users.
When we set out to design this site, we had a number of requirements. The first, and perhaps the most important, was that anyone should be able to contribute edits, with the lowest possible friction. Another was that the site should be fully versioned, so changes over time can be tracked and annotated. We also wanted it to be easy to archive a static version of the site that could be kept up with minimal maintenance, so that it can remain online for years in a stable state.
Lastly, the first version of the index was a public Google sheet, a popular workflow that we did not want to disturb. As a result, the site remains editable by anyone, via this spreadsheet, and it is managed, hosted and versioned using Github infrastructure (and so fully versioned, while remaining an essentially static site). In addition to the sheet, valid pull requests to the Open Innovation Dataset Index Github repository are automatically integrated into the site, without contributing accounts requiring write access or prior approval.
The index currently houses lists of datasets and tools (and soon, data-publishing platforms) that have been recommended by people and institutions linked to innovation data research.
Each of these lists corresponds to a different tab of the Google Sheet, where each row contains metadata about a particular resource. This metadata including common fields such as title, DOI, and authors; but also harder-to-find information such as licensing details, derivative and superceding resources, and the range of years each resource covers.
When additions or edits are made to the sheet, a corresponding change is made to a markdown file on Github, containing this metadata information (in the file header) and a space for freeform notes, annotations, code samples and annotations (in the body of the file). It is these files that comprise the website itself.Currently, the site also includes a basic search and tag-based filtering.
Another question that arose when we designed this site was the role that the Index plays in relation to existing catalogs of innovation data, such as the Lens Labs Apps and Data collection, the NBER Research Data portal, and Google Patents’ project to publish well-used datasets (public and private) as queryable resources on BigQuery. We hope the I3 Index will help track and version changes in these sources, while creating space for others to contribute similar guides and resources without needing the infrastructure to maintain their own website.
As a result, there is a section of the Index dedicated to Collections, where we invite contributions of and to curated sets of resources. These can be thought of as a ‘start point’ for a strand of research — for example, take a look at mine and Matt Marx’s Essential Patent Analysis Datasets collection, which is also rendered on the home page of the site. It is within this section that we also track and link out to collections curated by others in the community.
In the near term, a key goal is to gather contributions from a broad section of the innovation research community, with particular focus on the development of collections. If you have a dataset that you think people should know about, please add it to the Google Sheet! Likewise, if you see information about a dataset that’s inaccurate or incomplete, the sheet may be used to edit that information. In order to add or edit longer-form text, code, notes, or a collection, contributions can be made by making a pull request to the Github repository. Full instructions (with videos) about how to do these things may be found on the about page of the index site.
Another form of contribution that we are excited to see is how people use this data to produce other resources. Versioned .csv files are available for each of the different indexes (datasets, tools, data publishing platforms), within the Github repository. We also want to let more researchers know about the Index — so if you plan to make use of this in your research, do write about and cite it! As the site does not yet have its own DOI, for now you can simply reference the main URL, https://iiindex.org.
Other than broadening contributions, a key next step is to automate more of the processes that obtain metadata about resources in the index, and run them on a regular basis, flagging any broken links and version changes, to ensure the site stays up-to-date. At present, contributions made through the GitHub are augmented with MediaWiki resource search results, and in the near term we’d like to expand that to include calls to common APIs such as Dataverse, BigQuery and Github, and also to replicate this behaviour in the Google Sheet.
One longer-term goal is also to index and link the schemas of datasets as well, so that someone could browse by the indexed variables of different datasets and explore how they might be composed.
This project is still at an early stage, and we are excited to see where it goes, and how people use it. Please email us witha any questions, feedback and feature requests.
Thanks to Ian Wetherbee for the recommendation of GitHub actions as a lightweight way to manage the index, and to Matt Marx, Lia Sheer and Cyril Verluise for support, feedback and contributions. This project was developed by the Innovation Information Initiative, and is supported by a grant from the Alfred P. Sloan Foundation.