scp_item_id | wos_item_id | oa_item_id | DOIs | perc |
---|---|---|---|---|
unique | unique | unique | 28626999 | 99.75% |
multiple | unique | unique | 50684 | 0.18% |
unique | multiple | unique | 12133 | 0.04% |
unique | unique | multiple | 6575 | 0.02% |
multiple | multiple | unique | 938 | 0.00% |
multiple | unique | multiple | 48 | 0.00% |
unique | multiple | multiple | 7 | 0.00% |
multiple | multiple | multiple | 2 | 0.00% |
Introduction
The availability of high quality funding information is essential to get an accurate picture of the global funding landscape of science for academics, institutions, funders and policy makers alike. Up to now, such data could only be acquired from commercial database vendors at significant cost. This is changing with the emergence of free and open biometric databases, OpenAlex (Priem, Piwowar, and Orr 2022) among them.
Compared to commercial incumbents, OpenAlex is a very dynamic database and is undergoing constant development. This study is part of recent efforts by bibliometricians and information scientists to assess quality and coverage of this rapidly changing database (Alperin et al. 2024; Haupka et al. 2024; Culbert et al. 2024). Our focus is funding data, which has has been added to OpenAlex in May 2023. We compare and contrast the open funding data available in OpenAlex (August 2024 snapshot) with Web of Science and Scopus snapshots from the the last quarter of 2024. We restrict analysis to publications from 2010 and following years. This analysis relies on infrastructure of the German Kompetenznetzwerk Bibliometrie (Schmidt et al., n.d.).
In our analysis we observe the following:
- OpenAlex lags behind commercial databases in terms of coverage. Less than half of publications with funding information in one of the commercial databases hold funding information in OpenAlex (Figure 3). This can be observed on a global scale and also for Germany (Figure 7).
- Only a few OpenAlex exclusive publications hold funding information (Figure 6).
- Unlike commercial vendors, OpenAlex does not include funding acknowledgements as raw strings, making independent validation or improvement of their disambiguation process very difficult.
- OpenAlex is more convenient than using Crossref, due to data cleaning and disambiguation. However some data is lost. In our set of five German research funders between 3.4 and 8.9% of publications have missing funding data in OpenAlex, even though this information is deposited in Crossref (see Section 6.2.3).
- All three databases have high levels of disagreement regarding the number of funders and grant numbers for a given publication. This is also the case for the two commercial databases, pointing to the low degree of normalisation of funding data in general.
We also explore several routes to improve OpenAlex coverage and find that better integration of Crossref, data from open access publications and research funders are promising avenues to boost the currently lacking coverage of funding information in OpenAlex.
Data and Methods
In order to compare the quantity and quality of funding data across databases we first focus our analysis on core publications. Core publications are publications that are simultaneously present across all three databases OpenAlex, Web of Science and Scopus, whereas margin publications are only covered by one or two databases. Due to the fact that there is a substantial number of publications exclusive to OpenAlex (see Figure 1) we will also cover margin publications later.
We compare publications between databases through matching their digital object identifiers (DOIs). Relying on DOI matching means that we exclude all items that do not have a DOI. Depending on the study, this practice can be problematic, due to unequal access to DOI registration resulting in the exclusion of publications from so called developing countries (Turki et al. 2023; Okune and Chan 2023). Another complication are cases, where one DOI is assigned to multiple items. Within the core publication set the relationships between DOIs and database item ids is unique in 99.75% of cases. This leaves us with 70387 DOIs for which there are multiple item ids in at least one of the databases. As shown in Table 1, most of these cases originate from the Scopus database. We decided to exclude these DOIs from our analysis.
We compare the August 2024 snapshot of OpenAlex with the Web of Science snapshot from July 2024 and Scopus snapshot from October 2024, so that the values for 2024 should be compared with care. To ease computational burden and offer a meaningful time frame for the more recent database OpenAlex, we restrict the analysis to publications year published in or after 2010.
Crossref is one of the two most important data sources for OpenAlex, according to their FAQ and is available as open data (Hendricks et al. 2020), making it another obvious point of comparison when judging the quality of OpenAlex data. The Crossref snapshot used in this study was provided by our partners from UBG and has publications starting from 2012 up to October 2023.
The databases used in this analysis sometimes list different publication years, which is probably caused by differing definitions (for example print versus online first publication date). We unify publication years with the following heuristic: all publications included in OpenAlex get the OpenAlex publication year, publications included in Scopus and Web of Science are assigned the Scopus publication year and publications found only in one database get the respective year.
Coverage of funding data
Comparing the global coverage of funding data between databases, it is apparent that OpenAlex is lagging far behind commercial databases. Only 40.15% core publications with funding data in one of the databases have funding data in OpenAlex, 0.40% of which are unique to OpenAlex. The commercial databases have much better coverage but each has their own blind spots, represented by the information exclusively included in the competitors’ database. At present, analyses striving for completeness need to combine information from both commercial databases – Web of Science and Scopus – whereas the unique information included in OpenAlex seems negligible (see also Figure 6).
Looking at the coverage across publication years, we see that coverage for core publications in OpenAlex improved steadily between 2013 and 2021, reaching a plateau of around 55%. A similar trend is present in the number of publications with funding data available in all databases. We are not sure why coverage in OpenAlex drops in 2022, while commercial vendors could improve theirs.
OpenAlex exclusive funding data
On of the potential benefits of using OpenAlex is that it has less restrictive inclusion criteria than the established databases an therefore covers a much wider set of publications (see Figure 1). However, OpenAlex provides relatively few funding information for core publications as we have seen in Figure 3. For margin publications displayed in Figure 6 the share of publications with funding information is even lower, albeit slowly increasing over time.
Coverage of German Publications
We consider all publications that have at least one German affiliation in one of the databases as a German publication.
Funding data coverage of German publications is similar to global coverage; only 38.60% of German core publications with funding data have funding data listed in OpenAlex , 2.29% of which are unique to OpenAlex.
Looking at coverage over time, we observe improvement in OpenAlex funding coverage largely driven by an increase of funding data available in all three databases. However, as on the global level (see Figure 5) commercial databases still hold substantially more funding information than OpenAlex.
According to OpenAlex, the Deutsche Forschungsgemeinschaft (DFG, “German Research Foundation”) and the Bundesministerium für Bildung und Forschung (BMBF, “German Ministry of Education and Research”) fund the largest share of German publications. Due to international cooperation, a wide range of foreign funders is also mentioned on publications with German affiliations as you can see in Table 2 below
Quality of funding data
One major benefit of using OpenAlex funding data is that the data is disambiguated, making it much more convenient to use than Crossref. However, neither Crossref, nor OpenAlex provide raw strings of funding acknowledgements. This makes it quite difficult to assess the quality or improve funding data disambiguation in OpenAlex. Therefore we are only able to do some rudimentary quality checks at full database scale.
The quality assurance process employed by OpenAlex removes a number of mistakes present in Crossref. For example, there are 351 Crossref publications that have “Error! Hyperlink reference not valid.” listed as the funder name, which is removed from OpenAlex. Another example are two letter funder names, which are polluted by artefacts like country codes, the word “no” or simply NA
. This noise is removed and not present in OpenAlex.
The OpenAlex quality assurance process isn’t perfect yet and some noise still makes it into the data, for example 104490 grant numbers (0.71%) in OpenAlex contain the words “grant” or “number”. This is considerably higher than the share of 2826 polluted numbers (0.01%) in Web of Science as well as in Scopus 39679 polluted numbers (0.07%).
Minor problems with internal consistency
We also noticed that the works_count
field in the OpenAlex funders
table doesn’t always match with the count you get by counting unique works id’s per funder id in the works
table.
n | p | agreement |
---|---|---|
3190 | 9.83% | ❌ |
29247 | 90.17% | ✔️ |
Database agreement
Another way to get an idea of funding data quality in OpenAlex is to check whether its data corresponds to data recorded in other databases. For example, the paper Prediction of post-surgical seizure outcome in left mesial temporal lobe epilepsy, published in the journal NeuroImage Clinical is one of 1,296 papers listed as funded by the German Federal Ministry of Education and Research (BMBF) and the German Research Foundation (DFG) in Crossref whereas OpenAlex only lists DFG as a funder. The paper itself mentions BMBF and DFG.
Doing this kind of checking at scale is unfortunately not very straightforward as funding data is not standardized across databases and tends to be noisy. Many different spelling variants can exist for a single funder. Rather than trying to establish if the funders of each publication match across databases we decided just to compare the number of funders as well as the number of grant numbers per paper. We compare OpenAlex coverage of five German funders in Section 6.2.3 and find that between 3.3% and 8.0% of publications have missing funding data in OpenAlex, even though the information is deposited in Crossref.
n | p | agreement |
---|---|---|
2437103 | 37.60% | no agreement |
1272678 | 19.63% | agreement grants |
571063 | 8.81% | agreement agencies |
2201348 | 33.96% | full agreement |
n | p | agreement |
---|---|---|
2557641 | 41.46% | no agreement |
1379189 | 22.36% | agreement grants |
556692 | 9.02% | agreement agencies |
1674818 | 27.15% | full agreement |
n | p | agreement |
---|---|---|
2435799 | 19.65% | no agreement |
4196086 | 33.85% | agreement grants |
935254 | 7.54% | agreement agencies |
4829226 | 38.96% | full agreement |
n | p | agreement |
---|---|---|
1178106 | 17.40% | no agreement |
440543 | 6.51% | agreement grants |
135983 | 2.01% | agreement agencies |
5017216 | 74.09% | full agreement |
How could coverage of funding information be improved?
Having observed the lacking coverage of funding information, we now turn our interest towards global and German data sources to improve the data quality of OpenAlex. We need comparable open funding data for research and reporting before open bibliometric databases like OpenAlex will constitute a valid alternative to commercial databases.
Better integration of Crossref data
Looking at our set of core publications we can see that there is a significant number of publications that have funding information in Crossref but not in OpenAlex. Adding these 1 365 364 publications would increase funding data coverage in OpenAlex by 5.31%.
The gap between OpenAlex and Crossref, as well as the gap between OpenAlex and the commercial Databases is relatively constant across the last decade of publications as Figure 10 shows. So the one way to raise coverage is better integration of Crossref data into OpenAlex. It is important to keep in mind that we would expect a small gap to remain even with perfect integration of Crossref data, due to the data cleaning performed by OpenAlex (see Section 5).
What kind of publications are missing funding data in OpenAlex?
In this section we will look at the set of 10 765 209 publications which are missing funding data in OpenAlex and Crossref but have such data listed in one of the commercial databases. This gives us a better understanding of differences in funding coverage and, more interesting, could point us to potential data sources or publication sets that could be used to improve funding data coverage and quality in OpenAlex.
Open Access Status
The obvious starting point is to look at the share of open access publications, which is at 50.66%. So the amount of OpenAlex publications which have funding data in commercial databases but not in OpenAlex or Crossref could be cut in half by using publicly available information.
The OpenAlex snapshot used in this study is affected by an issue that misclassified more than four million open access records as closed, so our figures here are probably an undercount. This issue has been addressed and fixed by OpenAlex for later database versions. For more information see Jahn, Haupka, and Hobert (2023).
Publishers
Another potential source for funding information are publishers. Due to the high concentration of the publishing market, convincing only a small number of the larger publishers would add a considerable amount of open funding data, as we can see on the right side of Figure 13. For example, convincing Springer Nature to add funding information to Crossref would add funding data for 1 236 303 publications, 356 323 of which are open access.
Publishers with funding information in commercial databases, but missing funding data in the open databases Crossref and OpenAlex are listed in the table below for OpenAlex. Note that publisher names in OpenAlex are not disambiguated, so the number of publications of the top ten publishers are under counted (try entering elsevier
or mdpi
into the search box).
The fact that many publishers do not share any or only part of their funding data with Crossref has been shown by prior work on Crossref funding data (Mugabushaka, van Eck, and Waltman 2022, 17). Like Mugabushaka et al., we can only speculate for the reasons behind this lack of data sharing. It is notable that some of the younger full open access publishers like MDPI, Frontiers and the Public Library of Science stand out by not sharing funding metadata openly, while building their business models on the open access transformation.
Funders
Research funding bodies are not only users but also producers of funding data and have a vested interest in having the best possible funding data coverage for their funded research. Due to this double role they are uniquely positioned to contribute to open funding data.
The tables below show the degree of coverage between commercial and closed databases for five large German research funders. To account for the fact that founders are not disambiguated in all databases, we crafted a search query containing different English and German spelling variants as well as abbreviations for each one.
closed DBs | OpenAlex | Crossref | publications | p |
---|---|---|---|---|
✔️ | ✔️ | ✔️ | 148606 | 85.67% |
✔️ | ✔️ | ❌ | 387 | 0.22% |
✔️ | ❌ | ✔️ | 5614 | 3.24% |
✔️ | ❌ | ❌ | 16965 | 9.78% |
❌ | ✔️ | ✔️ | 1655 | 0.95% |
❌ | ✔️ | ❌ | 84 | 0.05% |
❌ | ❌ | ✔️ | 160 | 0.09% |
closed DBs | OpenAlex | Crossref | publications | p |
---|---|---|---|---|
✔️ | ✔️ | ✔️ | 30432 | 60.02% |
✔️ | ✔️ | ❌ | 4830 | 9.53% |
✔️ | ❌ | ✔️ | 3981 | 7.85% |
✔️ | ❌ | ❌ | 10434 | 20.58% |
❌ | ✔️ | ✔️ | 686 | 1.35% |
❌ | ✔️ | ❌ | 251 | 0.50% |
❌ | ❌ | ✔️ | 86 | 0.17% |
closed DBs | OpenAlex | Crossref | publications | p |
---|---|---|---|---|
✔️ | ✔️ | ✔️ | 9898 | 66.43% |
✔️ | ✔️ | ❌ | 66 | 0.44% |
✔️ | ❌ | ✔️ | 937 | 6.29% |
✔️ | ❌ | ❌ | 3790 | 25.44% |
❌ | ✔️ | ✔️ | 161 | 1.08% |
❌ | ✔️ | ❌ | 26 | 0.17% |
❌ | ❌ | ✔️ | 22 | 0.15% |
closed DBs | OpenAlex | Crossref | publications | p |
---|---|---|---|---|
✔️ | ✔️ | ✔️ | 3706 | 77.60% |
✔️ | ✔️ | ❌ | 7 | 0.15% |
✔️ | ❌ | ✔️ | 172 | 3.60% |
✔️ | ❌ | ❌ | 807 | 16.90% |
❌ | ✔️ | ✔️ | 79 | 1.65% |
❌ | ✔️ | ❌ | 2 | 0.04% |
❌ | ❌ | ✔️ | 3 | 0.06% |
closed DBs | OpenAlex | Crossref | publications | p |
---|---|---|---|---|
✔️ | ✔️ | ✔️ | 13947 | 69.55% |
✔️ | ✔️ | ❌ | 137 | 0.68% |
✔️ | ❌ | ✔️ | 705 | 3.52% |
✔️ | ❌ | ❌ | 4824 | 24.06% |
❌ | ✔️ | ✔️ | 403 | 2.01% |
❌ | ✔️ | ❌ | 27 | 0.13% |
❌ | ❌ | ✔️ | 11 | 0.05% |
We see that coverage varies considerably across funders, ranging from less than a tenth of DFG funded publications with data only available in commercial databases to more than a quarter of DAAD funded publications. These gaps could be closed easily if funders were to publish a list of funded projects, grant numbers and funded publications as machine readable open data.
References
Acknowledgments
The initial analysis and quarto formatting was performed by Nikita Sorgatz during 2023 but remained unpublished. For this blog entry, the SQL queries and some analyses have been updated and restructured by Alexander Schniedermann, so that the document could be re-generated in early 2025.
Funding
This work is funded by the Bundesministerium für Bildung und Forschung (BMBF) project KBOPENBIB (16WIK2301E). We acknowledge the support of the German Competence Center for Bibliometrics.