Coverage and Quality of OpenAlex Funding Data

analysis
Authors
Affiliations

Nikita Sorgatz

Humboldt-Universität zu Berlin

Alexander Schniedermann

Deutsches Zentrum für Hochschul- und Wissenschaftsforschung (DZHW)

Published

May 9, 2025

Introduction

The availability of high quality funding information is essential to get an accurate picture of the global funding landscape of science for academics, institutions, funders and policy makers alike. Up to now, such data could only be acquired from commercial database vendors at significant cost. This is changing with the emergence of free and open biometric databases, OpenAlex (Priem, Piwowar, and Orr 2022) among them.

Compared to commercial incumbents, OpenAlex is a very dynamic database and is undergoing constant development. This study is part of recent efforts by bibliometricians and information scientists to assess quality and coverage of this rapidly changing database (Alperin et al. 2024; Haupka et al. 2024; Culbert et al. 2024). Our focus is funding data, which has has been added to OpenAlex in May 2023. We compare and contrast the open funding data available in OpenAlex (August 2024 snapshot) with Web of Science and Scopus snapshots from the the last quarter of 2024. We restrict analysis to publications from 2010 and following years. This analysis relies on infrastructure of the German Kompetenznetzwerk Bibliometrie (Schmidt et al., n.d.).

In our analysis we observe the following:

  • OpenAlex lags behind commercial databases in terms of coverage. Less than half of publications with funding information in one of the commercial databases hold funding information in OpenAlex (Figure 3). This can be observed on a global scale and also for Germany (Figure 7).
  • Only a few OpenAlex exclusive publications hold funding information (Figure 6).
  • Unlike commercial vendors, OpenAlex does not include funding acknowledgements as raw strings, making independent validation or improvement of their disambiguation process very difficult.
  • OpenAlex is more convenient than using Crossref, due to data cleaning and disambiguation. However some data is lost. In our set of five German research funders between 3.4 and 8.9% of publications have missing funding data in OpenAlex, even though this information is deposited in Crossref (see Section 6.2.3).
  • All three databases have high levels of disagreement regarding the number of funders and grant numbers for a given publication. This is also the case for the two commercial databases, pointing to the low degree of normalisation of funding data in general.

We also explore several routes to improve OpenAlex coverage and find that better integration of Crossref, data from open access publications and research funders are promising avenues to boost the currently lacking coverage of funding information in OpenAlex.

Data and Methods

In order to compare the quantity and quality of funding data across databases we first focus our analysis on core publications. Core publications are publications that are simultaneously present across all three databases OpenAlex, Web of Science and Scopus, whereas margin publications are only covered by one or two databases. Due to the fact that there is a substantial number of publications exclusive to OpenAlex (see Figure 1) we will also cover margin publications later.

We compare publications between databases through matching their digital object identifiers (DOIs). Relying on DOI matching means that we exclude all items that do not have a DOI. Depending on the study, this practice can be problematic, due to unequal access to DOI registration resulting in the exclusion of publications from so called developing countries (Turki et al. 2023; Okune and Chan 2023). Another complication are cases, where one DOI is assigned to multiple items. Within the core publication set the relationships between DOIs and database item ids is unique in 99.75% of cases. This leaves us with 70387 DOIs for which there are multiple item ids in at least one of the databases. As shown in Table 1, most of these cases originate from the Scopus database. We decided to exclude these DOIs from our analysis.

scp_item_id wos_item_id oa_item_id DOIs perc
unique unique unique 28626999 99.75%
multiple unique unique 50684 0.18%
unique multiple unique 12133 0.04%
unique unique multiple 6575 0.02%
multiple multiple unique 938 0.00%
multiple unique multiple 48 0.00%
unique multiple multiple 7 0.00%
multiple multiple multiple 2 0.00%
Table 1: Unique Item ids of core publications across databases.

We compare the August 2024 snapshot of OpenAlex with the Web of Science snapshot from July 2024 and Scopus snapshot from October 2024, so that the values for 2024 should be compared with care. To ease computational burden and offer a meaningful time frame for the more recent database OpenAlex, we restrict the analysis to publications year published in or after 2010.

Crossref is one of the two most important data sources for OpenAlex, according to their FAQ and is available as open data (Hendricks et al. 2020), making it another obvious point of comparison when judging the quality of OpenAlex data. The Crossref snapshot used in this study was provided by our partners from UBG and has publications starting from 2012 up to October 2023.

The databases used in this analysis sometimes list different publication years, which is probably caused by differing definitions (for example print versus online first publication date). We unify publication years with the following heuristic: all publications included in OpenAlex get the OpenAlex publication year, publications included in Scopus and Web of Science are assigned the Scopus publication year and publications found only in one database get the respective year.

Figure 1: Distribution of publications across databases.
Figure 2: Database publication overlap.

Coverage of funding data

Comparing the global coverage of funding data between databases, it is apparent that OpenAlex is lagging far behind commercial databases. Only 40.15% core publications with funding data in one of the databases have funding data in OpenAlex, 0.40% of which are unique to OpenAlex. The commercial databases have much better coverage but each has their own blind spots, represented by the information exclusively included in the competitors’ database. At present, analyses striving for completeness need to combine information from both commercial databases – Web of Science and Scopus – whereas the unique information included in OpenAlex seems negligible (see also Figure 6).

Figure 3: Publications with funding data across databases.
Figure 4: Overlap of publications with funding data between databases.

Looking at the coverage across publication years, we see that coverage for core publications in OpenAlex improved steadily between 2013 and 2021, reaching a plateau of around 55%. A similar trend is present in the number of publications with funding data available in all databases. We are not sure why coverage in OpenAlex drops in 2022, while commercial vendors could improve theirs.

Figure 5: Funding data coverage across publication years (core publications). Sum of publications with funding data in OpenAlex shown by black line.

OpenAlex exclusive funding data

On of the potential benefits of using OpenAlex is that it has less restrictive inclusion criteria than the established databases an therefore covers a much wider set of publications (see Figure 1). However, OpenAlex provides relatively few funding information for core publications as we have seen in Figure 3. For margin publications displayed in Figure 6 the share of publications with funding information is even lower, albeit slowly increasing over time.

Figure 6: Funding data in OpenAlex exclusive documents.

Coverage of German Publications

We consider all publications that have at least one German affiliation in one of the databases as a German publication.

Funding data coverage of German publications is similar to global coverage; only 38.60% of German core publications with funding data have funding data listed in OpenAlex , 2.29% of which are unique to OpenAlex.

Figure 7: German publications with funding data across databases.
Figure 8: Overlap of German publications with funding data between databases.

Looking at coverage over time, we observe improvement in OpenAlex funding coverage largely driven by an increase of funding data available in all three databases. However, as on the global level (see Figure 5) commercial databases still hold substantially more funding information than OpenAlex.

Figure 9: Funding data coverage of German publications across publication years (core publications). Sum of publications with funding data in OpenAlex shown by black line.

According to OpenAlex, the Deutsche Forschungsgemeinschaft (DFG, “German Research Foundation”) and the Bundesministerium für Bildung und Forschung (BMBF, “German Ministry of Education and Research”) fund the largest share of German publications. Due to international cooperation, a wide range of foreign funders is also mentioned on publications with German affiliations as you can see in Table 2 below

Table 2: Funders of German publications in OpenAlex.

Quality of funding data

One major benefit of using OpenAlex funding data is that the data is disambiguated, making it much more convenient to use than Crossref. However, neither Crossref, nor OpenAlex provide raw strings of funding acknowledgements. This makes it quite difficult to assess the quality or improve funding data disambiguation in OpenAlex. Therefore we are only able to do some rudimentary quality checks at full database scale.

The quality assurance process employed by OpenAlex removes a number of mistakes present in Crossref. For example, there are 351 Crossref publications that have “Error! Hyperlink reference not valid.” listed as the funder name, which is removed from OpenAlex. Another example are two letter funder names, which are polluted by artefacts like country codes, the word “no” or simply NA. This noise is removed and not present in OpenAlex.

The OpenAlex quality assurance process isn’t perfect yet and some noise still makes it into the data, for example 104490 grant numbers (0.71%) in OpenAlex contain the words “grant” or “number”. This is considerably higher than the share of 2826 polluted numbers (0.01%) in Web of Science as well as in Scopus 39679 polluted numbers (0.07%).

Minor problems with internal consistency

We also noticed that the works_count field in the OpenAlex funders table doesn’t always match with the count you get by counting unique works id’s per funder id in the works table.

n p agreement
3190 9.83%
29247 90.17% ✔️
Table 3: Works per funder, agreement of count methods.

Database agreement

Another way to get an idea of funding data quality in OpenAlex is to check whether its data corresponds to data recorded in other databases. For example, the paper Prediction of post-surgical seizure outcome in left mesial temporal lobe epilepsy, published in the journal NeuroImage Clinical is one of 1,296 papers listed as funded by the German Federal Ministry of Education and Research (BMBF) and the German Research Foundation (DFG) in Crossref whereas OpenAlex only lists DFG as a funder. The paper itself mentions BMBF and DFG.

Doing this kind of checking at scale is unfortunately not very straightforward as funding data is not standardized across databases and tends to be noisy. Many different spelling variants can exist for a single funder. Rather than trying to establish if the funders of each publication match across databases we decided just to compare the number of funders as well as the number of grant numbers per paper. We compare OpenAlex coverage of five German funders in Section 6.2.3 and find that between 3.3% and 8.0% of publications have missing funding data in OpenAlex, even though the information is deposited in Crossref.

n p agreement
2437103 37.60% no agreement
1272678 19.63% agreement grants
571063 8.81% agreement agencies
2201348 33.96% full agreement
Table 4: Agreement between OpenAlex and Web of Science (core publications)
n p agreement
2557641 41.46% no agreement
1379189 22.36% agreement grants
556692 9.02% agreement agencies
1674818 27.15% full agreement
Table 5: Agreement between OpenAlex and Scopus (core publications)
n p agreement
2435799 19.65% no agreement
4196086 33.85% agreement grants
935254 7.54% agreement agencies
4829226 38.96% full agreement
Table 6: Agreement between Web of Science and Scopus (core publications)
n p agreement
1178106 17.40% no agreement
440543 6.51% agreement grants
135983 2.01% agreement agencies
5017216 74.09% full agreement
Table 7: Agreement between OpenAlex and Crossref (core publications)

How could coverage of funding information be improved?

Having observed the lacking coverage of funding information, we now turn our interest towards global and German data sources to improve the data quality of OpenAlex. We need comparable open funding data for research and reporting before open bibliometric databases like OpenAlex will constitute a valid alternative to commercial databases.

Better integration of Crossref data

Looking at our set of core publications we can see that there is a significant number of publications that have funding information in Crossref but not in OpenAlex. Adding these 1 365 364 publications would increase funding data coverage in OpenAlex by 5.31%.

The gap between OpenAlex and Crossref, as well as the gap between OpenAlex and the commercial Databases is relatively constant across the last decade of publications as Figure 10 shows. So the one way to raise coverage is better integration of Crossref data into OpenAlex. It is important to keep in mind that we would expect a small gap to remain even with perfect integration of Crossref data, due to the data cleaning performed by OpenAlex (see Section 5).

Figure 10: Share of OpenAlex publications with funding data.
Figure 11: Share of German OpenAlex publications with funding data.

What kind of publications are missing funding data in OpenAlex?

In this section we will look at the set of 10 765 209 publications which are missing funding data in OpenAlex and Crossref but have such data listed in one of the commercial databases. This gives us a better understanding of differences in funding coverage and, more interesting, could point us to potential data sources or publication sets that could be used to improve funding data coverage and quality in OpenAlex.

Open Access Status

The obvious starting point is to look at the share of open access publications, which is at 50.66%. So the amount of OpenAlex publications which have funding data in commercial databases but not in OpenAlex or Crossref could be cut in half by using publicly available information.

The OpenAlex snapshot used in this study is affected by an issue that misclassified more than four million open access records as closed, so our figures here are probably an undercount. This issue has been addressed and fixed by OpenAlex for later database versions. For more information see Jahn, Haupka, and Hobert (2023).

Figure 12: Open access status of OpenAlex publications missing open funding information.

Publishers

Another potential source for funding information are publishers. Due to the high concentration of the publishing market, convincing only a small number of the larger publishers would add a considerable amount of open funding data, as we can see on the right side of Figure 13. For example, convincing Springer Nature to add funding information to Crossref would add funding data for 1 236 303 publications, 356 323 of which are open access.

Figure 13: Publications per publisher that lack funding data in OpenAlex/Crossref (only publishers with more than 1000 publications are shown).

Publishers with funding information in commercial databases, but missing funding data in the open databases Crossref and OpenAlex are listed in the table below for OpenAlex. Note that publisher names in OpenAlex are not disambiguated, so the number of publications of the top ten publishers are under counted (try entering elsevier or mdpi into the search box).

The fact that many publishers do not share any or only part of their funding data with Crossref has been shown by prior work on Crossref funding data (Mugabushaka, van Eck, and Waltman 2022, 17). Like Mugabushaka et al., we can only speculate for the reasons behind this lack of data sharing. It is notable that some of the younger full open access publishers like MDPI, Frontiers and the Public Library of Science stand out by not sharing funding metadata openly, while building their business models on the open access transformation.

Table 8: Number of open access publications per publisher that lack funding data in OpenAlex/Crossref

Funders

Research funding bodies are not only users but also producers of funding data and have a vested interest in having the best possible funding data coverage for their funded research. Due to this double role they are uniquely positioned to contribute to open funding data.

The tables below show the degree of coverage between commercial and closed databases for five large German research funders. To account for the fact that founders are not disambiguated in all databases, we crafted a search query containing different English and German spelling variants as well as abbreviations for each one.

closed DBs OpenAlex Crossref publications p
✔️ ✔️ ✔️ 148606 85.67%
✔️ ✔️ 387 0.22%
✔️ ✔️ 5614 3.24%
✔️ 16965 9.78%
✔️ ✔️ 1655 0.95%
✔️ 84 0.05%
✔️ 160 0.09%
Table 9: Funding data coverage across open and closed databases.
closed DBs OpenAlex Crossref publications p
✔️ ✔️ ✔️ 30432 60.02%
✔️ ✔️ 4830 9.53%
✔️ ✔️ 3981 7.85%
✔️ 10434 20.58%
✔️ ✔️ 686 1.35%
✔️ 251 0.50%
✔️ 86 0.17%
Table 10: Funding data coverage across open and closed databases.
closed DBs OpenAlex Crossref publications p
✔️ ✔️ ✔️ 9898 66.43%
✔️ ✔️ 66 0.44%
✔️ ✔️ 937 6.29%
✔️ 3790 25.44%
✔️ ✔️ 161 1.08%
✔️ 26 0.17%
✔️ 22 0.15%
Table 11: Funding data coverage across open and closed databases.
closed DBs OpenAlex Crossref publications p
✔️ ✔️ ✔️ 3706 77.60%
✔️ ✔️ 7 0.15%
✔️ ✔️ 172 3.60%
✔️ 807 16.90%
✔️ ✔️ 79 1.65%
✔️ 2 0.04%
✔️ 3 0.06%
Table 12: Funding data coverage across open and closed databases.
closed DBs OpenAlex Crossref publications p
✔️ ✔️ ✔️ 13947 69.55%
✔️ ✔️ 137 0.68%
✔️ ✔️ 705 3.52%
✔️ 4824 24.06%
✔️ ✔️ 403 2.01%
✔️ 27 0.13%
✔️ 11 0.05%
Table 13: Funding data coverage across open and closed databases.

We see that coverage varies considerably across funders, ranging from less than a tenth of DFG funded publications with data only available in commercial databases to more than a quarter of DAAD funded publications. These gaps could be closed easily if funders were to publish a list of funded projects, grant numbers and funded publications as machine readable open data.

References

Alperin, Juan Pablo, Jason Portenoy, Kyle Demes, Vincent Larivière, and Stefanie Haustein. 2024. “An Analysis of the Suitability of OpenAlex for Bibliometric Analyses.” arXiv. https://doi.org/10.48550/arXiv.2404.17663.
Culbert, Jack, Anne Hobert, Najko Jahn, Nick Haupka, Marion Schmidt, Paul Donner, and Philipp Mayr. 2024. “Reference Coverage Analysis of OpenAlex Compared to Web of Science and Scopus.” arXiv. https://doi.org/10.48550/arXiv.2401.16359.
Haupka, Nick, Jack H. Culbert, Alexander Schniedermann, Najko Jahn, and Philipp Mayr. 2024. “Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, Pubmed and Semantic Scholar.” arXiv. https://doi.org/10.48550/arXiv.2406.15154.
Hendricks, Ginny, Dominika Tkaczyk, Jennifer Lin, and Patricia Feeney. 2020. “Crossref: The Sustainable Source of Community-Owned Scholarly Metadata.” Quantitative Science Studies 1 (1): 414–27. https://doi.org/10.1162/qss_a_00022.
Jahn, Najko, Nick Haupka, and Anne Hobert. 2023. “Scholarly Communication Analytics: Analysing and Reclassifying Open Access Information in OpenAlex.”
Mugabushaka, Alexis-Michel, Nees Jan van Eck, and Ludo Waltman. 2022. “Funding Covid-19 Research: Insights from an Exploratory Analysis Using Open Data Infrastructures.” arXiv. https://doi.org/10.48550/arXiv.2202.11639.
Okune, Angela, and Leslie Chan. 2023. “Digital Object Identifier: Privatising Knowledge Governance Through Infrastructuring.” In Routledge Handbook of Academic Knowledge Circulation. Routledge.
Priem, Jason, Heather Piwowar, and Richard Orr. 2022. OpenAlex: A Fully-Open Index of Scholarly Works, Authors, Venues, Institutions, and Concepts.” arXiv. https://arxiv.org/abs/2205.01833.
Schmidt, Marion, Christine Rimmert, Dimity Stephen, Christopher Lenke, Paul Donner, Simone Gärtner, Niels Taubert, Thomas Bausenwein, and Stephan Stahlschmidt. n.d. “The Data Infrastructure of the German Kompetenznetzwerk Bibliometrie: An Enabling Intermediary Between Raw Data and Analysis.” https://doi.org/10.5281/ZENODO.13932927.
Turki, Houcemeddine, Grischa Fraumann, Mohamed Ali Hadj Taieb, and Mohamed Ben Aouicha. 2023. “Global Visibility of Publications Through Digital Object Identifiers.” Frontiers in Research Metrics and Analytics 8 (August). https://doi.org/10.3389/frma.2023.1207980.

Acknowledgments

The initial analysis and quarto formatting was performed by Nikita Sorgatz during 2023 but remained unpublished. For this blog entry, the SQL queries and some analyses have been updated and restructured by Alexander Schniedermann, so that the document could be re-generated in early 2025.

Funding

This work is funded by the Bundesministerium für Bildung und Forschung (BMBF) project KBOPENBIB (16WIK2301E). We acknowledge the support of the German Competence Center for Bibliometrics.