Skip to main content

IIIndex Roadmap

Published onFeb 28, 2022
IIIndex Roadmap
·

At the end of last year, the first iteration of the iiindex was presented at I3’s 2021 Technical Working Group meeting. At present, it functions as a living repository of pointers to innovation datasets, curated and used by the I3 community, with an ongoing record of current status and features detailed here. In the coming months, our plans for development have two overarching directions: the first is the development of the tool itself, expanding what we index, and adding features to the platform, and expanding contributions, and the second involves the release of a general version of the tool for other communities interested in indexing datasets.

To contact us about anything mentioned in this release, or more generally, please email agnescam@mit.edu.

Adding Support for Multiple APIs

At present, when new datasets are submitted, the Wikimedia Citoid tool is used to scrape additional citation metadata to complement the entry. While this provides a good basis for basic metadata, we are looking to supplement this information with additional metadata from platforms such as Dataverse, BigQuery and Zenodo, and via the OAI-PMH protocol.

Open Question: we are interested to hear which other platforms we should seek to gather metadata from, and whether there are specific pieces of metadata that would be of interest.

Contributions to collections

A key step in using datasets to answer research questions is knowing what values are indexed by particular datasets — e.g. whether a dataset includes patent identifiers, PubMed IDs, trademark information, etc. As a first step, we will start to index dataset schemas to make these fields directly searchable through the site.

Initially, this will be done quite simply — we plan to index every column header included in a table in the dataset as a single array (e.g. not indexing hierarchies), then filter and flag the incidence of a handful of key fields (e.g. Patent IDs, DOIs, PMIDs, Lens Ids, etc.). It will be possible both to perform a full-text search on the full array of fields, and also to filter for datasets that index on a specific kind of identifier.

Ideas for schema search are also still being worked out — we’re very much open to suggestions.

Open Question: What kind of fields would you like to be able to filter for?

Relationships Between Datasets

In order to see the impact and usage patterns of different datasets, another feature we are seeking to implement is a tool to indicate an ‘inheritance’ relationship between two datasets, e.g. when one research dataset was constructed from one or more existing datasets in the repository, building a citation graph of different datasets. At present, we have a basic version of this implemented: a set of ‘related projects’ plus a field for describing in plain text the nature of those relationships.

Open Question: We think this is one of the more important aspects of the index, but also one of the harder

Parallel Relational Database Release

As the amount of relational data stored by the index grows (e.g. recording relationships, schemas, other one-to-many relations), so it might become more useful to publish a fully-relational version of the index, as well as the flattened version that can be seen in the Google Sheet or the archived .csv files. We would plan to do this using the Datasette tool, compiling a SQLite database at the same time as the site gets compiled. This will open up a second way to query the index, and allow us to explore possibilities for incorporating more structured data.

Open Question: Let us know if you have any use cases for a database of this format.

Development of general tool

As the Index is built using free and widely-available infrastructure (Google Sheets and Github Actions), we are also hoping to produce a more general version of this tool that can be used by other groups conducting similar community-indexing projects. We have collaborated with Ryan McGranaghan of the HelioPhysics Knowledge Network to prototype a generalised version, which can be viewed here.

Open Question: If you are involved in, or know of, another dataset indexing project, please get in touch.


Comments
0
comment

No comments here

Why not start the discussion?