Skip to main content

What comes after Microsoft Academic?

Potential futures of the open academic graph.

Published onMay 27, 2021
see published version
view latest release
What comes after Microsoft Academic?
·
connecting...

This is a draft; feel free to edit this page + comment

Since 2016, when Microsoft expanding Microsoft Academic from a research project for data mining, entity linking and visualization, into public services and datasets, it has transformed access to bibliometric data. They compiled their distilled and cleaned data into an openly-licensed knowledge graph (MAG), which has been a foundational source for many influential tools and datsets (search engines, citation indices, literature mapping tools, metrics aggregators, a broader but more sporadic Open Academic Graph) ever since.

Aaron Tay for one has written broadly about the impact of this work, and of its closure without a direct replacement. (The next generation of citation indexes ; Why Open Alone is Not Enough ; Thread on the closure of MAG)

OurResearch (creators of Unpaywall) has offered to continue producing annual datasets in the same format, and maintaining many of the elements and features of the MAG data, through a new project named OpenAlex. But the community of reusers is also interested in federated and distributed approaches that maintain a commons of the knowledge we depend on while minimizing single points of failure in the pipelines that produce them.

So: what sort of commons is this, what do we want it to be, and what projects are currently maintaining pipes that could be part of a robust and distributed pipeline that matches and surpasses what has come before?

Overview of citation graphs and tools

We need an interlay for scholarly citation graphs, covering:
A. Their scope and purpose; (what counts as a scholarly document, or a cite)
B. What comparable graphs + datasets exist now for various contexts;
C. How these are updated, by which curators + what processes;
D. What are the upstream and downstream sources + derivatives; and
E. What do we want the above to become, in the fullness of time?

A. Scope: Focus and challenge

Focus: Compiling a global citation graph (or a subset relevant to a specific research context), in a format that is convenient for [re]calculating metrics and training models, making both citations and derivatives (including common bibliometrics) available as a public resource.

A shared scope across these projects includes articles published in [a broadly recognized subset of] academic journals, and their explicit citations [as recorded in Crossref, or in standard citation templates]. Some include a broader range of source documents (such as patents or books), and some identify and extract references that are more loosely formatted inline.[1]

Challenge: What do we want this to become? Here are 5+ guidelines for building an open-academic-graph commons that everyone can contribute to, with shared purpose and low duplication of effort, and a way for different participants + data centers to specialize in producing part of the whole.

Elements of the [citation] commons we want

 0. simple standards for being part of the commons :)
    open, forkable code + data, transparent processes
    commitment to register IDs, scripts, vocabularies, schemas, processes w/ a shared registry (WD or equivalent)

 1. a federated data pipeline —> what can others build to speed this up?
   source catalog + associated scripts
   a script library for processing/cleaning and disambiguation
   a federated event feed —> what exists, what more is needed?
   named processes for reproducing dataset outputs from the above

 2. a vocabulary of core entities, and PIDs others can build against for each 
   (not an internal PK for each project; most projects don't need to generate a new PID for most entities)

 3. a set of datasets released on a time series, w/ explicit + consistent
    (MAG used to provide one; whatever OR builds will be another; incremental updates are a bonus) 
— extendable by the public; writable/editable

 4. a set of services available online, for free / at cost / at burden
— public marketplaces for services, needs, streams of resources
— public reconciliation + disambiguation services

 5. internal documentation + interlayer description
    An overview: What is the future of the OAG?  extending 'outside' reflections like this, w/ contributions from everyone providing part of the above
    A maintenance + dependency checklist: what upstreams + downstreams does the OAG depend on?  How can someone rebuild it from scratch; or support its maintainers? 

B. What exists now (as a commons)

Concordance of citation graphs

Other lists + aggregators

  • Do other concordances exist?

  • Lists of resources: (github-awesome lists) (wp list of graphs)

Citation graphs themselves

Planned or private

Search engines

Citation intent

Metrics

Derivative projects refining the above

PIDs, metarecords, authority files

  • Articles and works: LensID (Xref / CORE ID / MAG ID), fatcat ID

  • Authors: MAG ID, SS ID, ORCID, LensID 2021

  • Institutions and organizations: OpenCorporates, GRID, future LensID?

C. How are these updated?

Most internal/commercial pipelines are opaque. Dimensions updates some things continuously, other things (GRID) twice a year. MAG used to update biweekly or monthly. Different patent offices update at their own rates; Lens is moving towards weekly updates in 2022.

D. What are related up + downstreams?

  • 170 dataset-papers drawing on MAG (as of 2020)

  • Reliance on Science

  • Most of the search engines are downstream of aggregators like MAG

  • Crossref is both up and downstream at points

  • Tools for researchers (including Publish or Perish) that may include full citation graphs now are in some cases downstreams that could synch more explicitly


E. Where do we want to be? + related research

  • Microsoft Academic Graph changed the landscape of possibility for uses of citation graphs.

    • It was mostly-complete and mostly-free to reuse, at launch 7 years ago.

    • It was updated by a talented team at MS, which did extensive document-processing on a wide range of source formats.

    • It quickly became a staple of any aggregator of such data, and people started to rely on its identifiers, author-identification, and topic-mapping

  • Readings:


Miscellaneous notes from public discussions

Topic maps —>
Citation existence —>
Dissambiguating article + author ID —>
Citation affect : SciCite (S2) and Scite (scite.ai) datasets
Crossref : event stream extensions?

Draft specs:
Event feed —> what is needed?
Data pipeline —> what can others build to speed this up?
ID set —>
(OurResearch spec) —> soon :) mainly want people to actually be open!

Data sources:
Limiting what else is possible?
What about automatic sources [CCrawl]

Open requests:
How do people currently use the MAG API ?
What's missing so far?
(conf proceedings, non-DOIs, open list for requesters, ML classification)

IDs —> What new ones exist? what’s being maintained?
: MAG ID —> AlexID?

Attendees to invite
: IDs — GRID / ROR / SS / IA [new primary key] OAIR
: [SS / Meta / BN ? / Crossref / MAG / Lens]
—> clarify degree of open code + data
—> publisher agreements

API access categories
: read-only GETs (as per MAG?)
: savable queries
: write APIs (posting a suggestion, or new item)
: federation APIs (adding a feed to an event-feed network)

===
TODO: Patent feeds as well?
COAR/BASE compared to UPW?
WD compared to GRID?
OpenCorporates?

Comments
0
comment

No comments here

Why not start the discussion?