Skip to main content

IIIndex 2.0

Not the metadata we need, but the metadata we deserve

Published onMay 17, 2022
IIIndex 2.0

You're viewing an older Release (#3) of this Pub.

  • This Release (#3) was created on May 30, 2022 ()
  • The latest Release (#5) was created on Feb 07, 2023 ().

The IIIndex is an index of metadata. Specifically, it is a community-maintained list of pointers to innovation datasets that are hosted across disparate platforms, augmented by an interface to collate and write about these datasets, space to record relevant notes and code samples, and collaboratively maintained structured metadata.

At present, metadata about innovation datasets gets added to the index using a hybrid of manual and automatic methods, with the aim to eventually automate as much as possible. For example, adding a dataset with a DOI will mean that relevant citation metadata is also added, using Wikimedia’s Citoid API endpoint.

The metadata populates a basic full-text search, plus a ‘filter by popular tag’ interface (tags are right now manually added). For a while now we’ve been thinking about ways to advance this search interface, which has led us to the question: what are the most useful questions we can ask of this index? And, a corollary: what metadata do we need to answer these questions?

A class of question we’re really interested in asking goes along the lines of:

What datasets do I need to map from from variable X to variable Y?

e.g. as in the diagram below: how can I combine datasets in a way that lets me ask a question about researchers, and gets an answer about trademarks? How about asking about molecules, and getting an answer about research institutions?

To do this, you need:

  1. an index of all the fields for each dataset in the index

  2. something mapping these fields to a list of standardised, named fields, that are the things we care about (trademarks, patent identifiers, DOIs, molecules, etc)

  3. the list of the standardised, named fields that we care about, ideally with a list of aliases to allow this matching to be somewhat automated (check out the google sheet here + add your own!)

  4. steps 1-3, carried out on a whole load of relevant datasets, including ones that aren’t strictly ‘innovation-related’ (think NBER’s Public Use Data Archive, plus a lot of structured metadata)

Work on items 1-3 is currently underway; progress on them is tracked using Github issues (1, 2, 3), and you can also read narrative updates in the iiindex: Status and Updates document. This implementation can be thought of as laying the metadata infrastructure for later developments (e.g. asking, to what extent can we automate the scraping of all of the dataset fields for a dataset hosted on an arbitrary platform?).

Refactoring the IIIndex metadata pipeline

Some metadata can only be human generated (a nice description of why someone finds a dataset interesting) and some metadata should almost always be generated automatically, because of either volume or standardisation (citations, for example), but lots of metadata is most often going to benefit from a joint approach (such as suggestions, related datasets, parent-child relationships, what tags to apply to something). At present, most of this latter class are done manually — it is the development of these fields into scalable, human+scraper collaborations that will characterise IIIndex 2.0.

The current refactoring process of the metadata pipeline is shown below — this includes the calls to scrape named fields included above, as well as getting structured metadata through a number of sources that don’t necessarily have to include a DOI, or could include additional metadata sourced through APIs like BigQuery and Zenodo, or services like OAI-PMH.

OAI-PMH and Dataset Suggestions

In order to expand the number of innovation datasets we index, we plan to utilise the OAI-PMH protocol, both to search existing metadata repositories for potentially relevant datasets, and to act as a ‘Metadata Service Provider’ for innovation data. Large OAI-PMH data repositories such as Dataverse and Zenodo support the use of ‘sets’ — families of entries defined by search queries, whose metadata can be harvested periodically. It is these search queries that would allow us to define a criteria for what a ‘potentially relevant’ dataset looks like.

These queries would be run from a cloud server (e.g. separate to the current I3 index repository), based on a combination of statically-written queries, and/or a queries based on the ‘salient fields’ identified collectively in the I3 Index google sheet. Below is a sketch of the whole system, consisting of these 3 interacting elements (self-hosted OAI-PMH crawling server and service endpoint; Github repository w/ actions and basis of iiindex site; collaboratively-edited google sheets).

The core element of this process is the curation process of turning the raw scraped list into a set of useful datasets, which only then get added to the index. This requires human input, as the outputs of an OAI-PMH search may well return quite a long tail of datasets in terms of interest. By flagging a dataset as ‘relevant’ (potentially also adding a note about why), that entry could be automatically copied into the main repository, along with relevant metadata.

Streamlining Manual Contribution

In addition to improved human-scraper collaboration, another goal of the next phase of development is to realise improvements in the kinds of metadata we are able to crowdsource, in particular dataset relationship information. At present this is collected informally in the form of collections (that describe narratively groups of related entities and make comparisons between them) and has been implemented in a limited way using the ‘record_superceded_by’ and ‘related_project_shortnames’ columns in the Google Sheet.

Making these fields more user-friendly is a core goal, as this is important metadata which is currently tricky and fiddly for a casual user to contribute, despite the importance of the information. This will involve moving the edit interface for relationship descriptions away from the google sheet, to be shared between markdown files and a graphical ‘relationships’ interface supported by the site. A sketch of the new interface is included below.

No comments here
Why not start the discussion?