Potential futures of the open academic graph.
Update: As of early 2022, OurResearch is running OpenAlex, which covers most of the use cases described below. Internet Archive Scholar has launched as a transparent upgrade to Google Scholar. Semantic Scholar continues to grow in coverage, and is exploring annotating citations.
Since 2016, when Microsoft expanding Microsoft Academic from a research project for data mining, entity linking and visualization, into public services and datasets, it has transformed access to bibliometric data. They compiled their distilled and cleaned data into an openly-licensed knowledge graph (MAG), which has been a foundational source for many influential tools and datsets (search engines, citation indices, literature mapping tools, metrics aggregators, a broader but more sporadic Open Academic Graph) ever since.
Aaron Tay for one has written broadly about the impact of this work, and of its closure without a direct replacement. (The next generation of citation indexes ; Why Open Alone is Not Enough ; Thread on the closure of MAG)
OurResearch (creators of Unpaywall) has offered to continue producing annual datasets in the same format, and maintaining many of the elements and features of the MAG data, through a new project named OpenAlex. But the community of reusers is also interested in federated and distributed approaches that maintain a commons of the knowledge we depend on while minimizing single points of failure in the pipelines that produce them.
So: what sort of commons is this, what do we want it to be, and what projects are currently maintaining pipes that could be part of a robust and distributed pipeline that matches and surpasses what has come before?
We need an interlay for scholarly citation graphs, covering:
A. Their scope and purpose; (what counts as a scholarly document, or a cite)
B. What comparable graphs + datasets exist now for various contexts;
C. How these are updated, by which curators + what processes;
D. What are the upstream and downstream sources + derivatives; and
E. What do we want the above to become, in the fullness of time?
Focus: Compiling a global citation graph (or a subset relevant to a specific research context), in a format that is convenient for [re]calculating metrics and training models, making both citations and derivatives (including common bibliometrics) available as a public resource.
A shared scope across these projects includes articles published in [a broadly recognized subset of] academic journals, and their explicit citations [as recorded in Crossref, or in standard citation templates]. Some include a broader range of source documents (such as patents or books), and some identify and extract references that are more loosely formatted inline.
Challenge: What do we want this to become? Here are 5+ guidelines for building an open-academic-graph commons that everyone can contribute to, with shared purpose and low duplication of effort, and a way for different participants + data centers to specialize in producing part of the whole.
0. simple standards for being part of the commons :)
open, forkable code + data, transparent processes
commitment to register IDs, scripts, vocabularies, schemas, processes w/ a shared registry (WD or equivalent)
1. a federated data pipeline —> what can others build to speed this up?
a source catalog + associated scripts
a script library for processing/cleaning and disambiguation
a federated event feed —> what exists, what more is needed?
named processes for reproducing dataset outputs from the above
2. a vocabulary of core entities, and PIDs others can build against for each
(not an internal PK for each project; most projects don't need to generate a new PID for most entities)
3. a set of datasets released on a time series, w/ explicit + consistent
(MAG used to provide one; whatever OR builds will be another; incremental updates are a bonus)
— extendable by the public; writable/editable
4. a set of services available online, for free / at cost / at burden
— public marketplaces for services, needs, streams of resources
— public reconciliation + disambiguation services
5. internal documentation + interlayer description
An overview: What is the future of the OAG? extending 'outside' reflections like this, w/ contributions from everyone providing part of the above
A maintenance + dependency checklist: what upstreams + downstreams does the OAG depend on? How can someone rebuild it from scratch; or support its maintainers?
Do other concordances exist?
Lists of resources: (github-awesome lists) (wp list of graphs)
List of academic databases: includes Internet Archive Scholar, fatcat
Microsoft Academic — still updating until 12/2021?
Open Academic Graph — what will aminer do starting next year?
Lens.org — used to feed refinements back into MAG (still this year?)
Web of Science
Planned or private
OpenAlex (pending, by 12/2021?)
Internal graphs @ metrics-providers
Depsy (deprecated): (citations for software)
Derivatives: citation-intent, paper-ID, author-ID
Not yet released: Unfold Research, RR, …
Articles and works: LensID (Xref / CORE ID / MAG ID), fatcat ID
Authors: MAG ID, SS ID, ORCID, LensID 2021
Institutions and organizations: OpenCorporates, GRID, future LensID?
Most internal/commercial pipelines are opaque. Dimensions updates some things continuously, other things (GRID) twice a year. MAG used to update biweekly or monthly. Different patent offices update at their own rates; Lens is moving towards weekly updates in 2022.
170 dataset-papers drawing on MAG (as of 2020)
Reliance on Science
Most of the search engines are downstream of aggregators like MAG
Crossref is both up and downstream at points
Tools for researchers (including Publish or Perish) that may include full citation graphs now are in some cases downstreams that could synch more explicitly
Some embeddings of OpenAlex in other work
Suave (Davis) draws on OpenAlex author data
Netvis (Davis) draws on same to find paper-collaborations
Microsoft Academic Graph changed the landscape of possibility for uses of citation graphs.
It was mostly-complete and mostly-free to reuse, at launch 7 years ago.
It was updated by a talented team at MS, which did extensive document-processing on a wide range of source formats.
It quickly became a staple of any aggregator of such data, and people started to rely on its identifiers, author-identification, and topic-mapping
Topic maps —>
Citation existence —>
Dissambiguating article + author ID —>
Citation affect : SciCite (S2) and Scite (scite.ai) datasets
Crossref : event stream extensions?
Event feed —> what is needed?
Data pipeline —> what can others build to speed this up?
ID set —>
(OurResearch spec) —> soon :) mainly want people to actually be open!
Limiting what else is possible?
What about automatic sources [CCrawl]
How do people currently use the MAG API ?
What's missing so far?
(conf proceedings, non-DOIs, open list for requesters, ML classification)
IDs —> What new ones exist? what’s being maintained?
: MAG ID —> AlexID?
Attendees to invite
: IDs — GRID / ROR / SS / IA [new primary key] OAIR
: [SS / Meta / BN ? / Crossref / MAG / Lens]
—> clarify degree of open code + data
—> publisher agreements
API access categories
: read-only GETs (as per MAG?)
: savable queries
: write APIs (posting a suggestion, or new item)
: federation APIs (adding a feed to an event-feed network)
TODO: Patent feeds as well?
COAR/BASE compared to UPW?
WD compared to GRID?