Skip to main content

I3 Datathon 04/28

Contribute to the I3 dataset catalog

Published onMar 03, 2021
I3 Datathon 04/28

Add questions to the collaborative notes for this session

To conclude our Spring 2021 Workshop Series, we are hosting a set of small-group workshops on tools and techniques for sharing and editing datasets. When: 1200 EST / 1600 UTC, Wed, April 28th.

This is an open workshop, suitable for students, colleagues, and collaborators outside of academia who work with and transform public datasets.

We will open with an introduction to how to contribute to the I3 data catalog, and a brief overview of each breakout group. After that, the session will split into breakout rooms, each led by a member of the I3 community. Then we will reconvene to share summaries with the wider group.

Session Overview

Main session + Introduction :
Collaborative Data Design :
Data Cleaning + Reconciliation :
Using BigQuery + Kaggle for data analysis :


  • Collaborative Data Design: using GitHub as a data-sharing platform  (Cyril Verluise, slides)

    • Theoretical models have long been thought with continuous improvement in mind. By contrast, datasets are often shared (if shared) as a snapshot, making collective continuous improvement difficult. In this session, I will argue that collaborative data design is both key for the future of empirical research and easy to implement. I will share practical insights on tools and workflows for implementing a collaborative data design standard. This session will be interactive, for anyone interested in developing and contributing to collaborative data design projects.

  • Data Cleaning and Reconciliation tools (Agnes Cameron + Sam Klein, slides)

    • A discussion about current tools for entity resolution and data cleaning, building a shared repository of scripts, and demonstrations of tools including OpenRefine.

  • Using BigQuery and Kaggle for data analysis and distribution (Ian Wetherbee + Jay Yonamine, slides)

    • Using BigQuery for performing large-scale analysis over multiple sets of patent data, and for distributing and collaborating on large data analyses

  • Open Discussion (Adam Jaffe)

Collaborative notes

Please edit the notes below; an overview will be posted here next week.

Google doc — One section per session.

No comments here
Why not start the discussion?