Identifying journal article types in OpenAlex using OPENBIB data

analysis
Using open bibliometric data from PubMed and Crossref, I constructed a document type classifier with a focus on improving the identification of research items in bibliometric databases such as OpenAlex. In this blog post, I will evaluate the classifier by comparing the results with OpenAlex and Scopus.
Author
Affiliation

Nick Haupka

State and University Library Göttingen

Published

July 17, 2025

In October 2024, I published a brief blog post about a document type classifier that helps to distinguish between research and non-research journal publications in bibliometric databases such as OpenAlex. This was achieved by using open bibliometric metadata from PubMed and Crossref. The classifier was developed as part of the OPENBIB project, maintained by the Kompetenznetzwerk Bibliometrie (KB). In this blog post, I want to update my results by using OpenAlex and Scopus as a comparison. Instead of using the OpenAlex September Snapshot from 2024 which was used in my last blog post, I will use the classifer results from the OPENBIB data release.

Problem statement

Bibliometric databases use distinct approaches to classify journal items. For example, a book review can be classified as an article or as a book review in two different databases. However, a precise distinction between different types of works is important in many cases, for example in decision-making, university rankings or other bibliometric analyses. A correct assignment is also sometimes necessary when you want to compare different data sources with each other. As one of my analysis has shown, the assignment and the typology of document types in bibliometric databases can vary (Haupka et al. (2024)). My analysis showed that OpenAlex categorised over 99% of journal items as articles, while databases such as Scopus, Web of Science and PubMed counted only 87% to 89% of journal items as articles.

To improve this situation, I have developed a classifier that is intended to make it easier to distinguish between research and non-research articles. Although the classifier was initially developed for OpenAlex, it is conceivable that it can also be used for other databases. The results of the classifier were published as part of the OPENBIB project of the Kompetenznetzwerk Bibliometrie. These can be downloaded free of charge from Zenodo via the OPENBIB data release. The source code of the classifier is freely available on GitHub.

Data and Methods

The OPENBIB data release dataset includes all articles and reviews from journals in OpenAlex (Snapshot August 2024) that are published between 2014 and 2024. For each of these items, a score is determined that describes the probability that a work is a research contribution or not. If this score is over 0.5, the item will be classified as a research contribution. For more information about the classifier, you can visit my previous blog post where I described the approach in more detail.

This blog post aims to compare the results of the OpenAlex metadata-based classifier with the document type identification of OpenAlex (Snapshot 08/24) and Scopus (Snapshot 10/24). Data were queried in a custom SQL database hosted by the Kompetenznetzwerk Bibliometrie. I only considered the overlap of journal items between OpenAlex and Scopus that can be matched via DOI in this analysis. An item is considered as a journal publication if it has a primary source type ‘journal’ in OpenAlex. Additionally, I restricted the analysis to the publication years 2014 to 2023.

Results

For the period 2014 to 2023, the classifier categorised 4,517,879 out of 42,249,505 articles and reviews from journals in OpenAlex as non-research items. This corresponds to a share of 10.7%. When restricting the analysis to publications that are included both in OpenAlex and Scopus, 681,489 (2.8%) articles and reviews from journals between 2014 and 2023 in OpenAlex are recognised as non-research items by the classifier. In Scopus, 22,439,280 articles and reviews are counted for the publication years 2014 to 2023. Of these 22 million articles and reviews, 1.5% (332,099 publications) are considered as non-research. A comparison of the share of research items in journals between OpenAlex, Scopus and the classifier can be found in Figure 1. Here, the grey line indicates all items that are categorised as articles and reviews in journals in OpenAlex. The purple line indicates all articles and reviews from journals in Scopus. The yellow line displays the share of items that are considered as research items by the classifier. Finally, the green line displays all articles and reviews in journals in OpenAlex where at least one reference and one citation was found. Overall, OpenAlex classified on average more than 95.8% of items in journals as research between 2014 and 2023. Scopus counted less items as articles and reviews than OpenAlex (on average about 88.3%). The classifier lies in between OpenAlex and Scopus and counted 93.1% items in journals as research. The green line, which also serves as a baseline, counted on average 79.2% of items as research. Here, the proportion of publications categorised as research decreases over the years, as the number of citations for more recent publications is probably lower.

Figure 1: Comparison of the classification of articles and reviews in journals using OpenAlex, Scopus and OPENBIB.

Figure 2 compares the classification of research items in OpenAlex, Scopus and the introduced classifier using topics from OpenAlex. Here it can be seen that Scopus counts fewer articles and reviews than OpenAlex, especially in the physical and health sciences. In sum, Scopus counted about 8.7% (901,533) less items as articles and reviews than OpenAlex in the physical sciences. In the health sciences, around 7% (441,423) fewer articles were counted as research in Scopus compared to OpenAlex. When using the classifier, 3.8% (238,124) of articles and reviews in the health sciences are considered non-research items. In physical sciences, about 1.8% (183,941) of articles and reviews in OpenAlex are recognised as non-research items.

Figure 2: Comparison of the classification of articles and reviews in journals using topics from OpenAlex.

Discussion and Data Access

This blog post compares the results of my classifier for identifying original research articles with the results from OpenAlex and Scopus. The results indicate an improvement in the document type classification of OpenAlex when using the classifier. However, manual validations are necessary to measure the actual quality of the assignments. Here, the classifier can be used to check possible cases in which the document types review or article have not been correctly assigned, because, for example, an article is a book review or an abstract.

It is important to note that the quality of the classifier also depends on the quality of the metadata in a database. For example, the classifier makes use of the number of references and citations of an journal item (see my former blog post). However, if this information is incomplete, it may lead to incorrect conclusions being drawn by the classifier.

If you want to test the classifier, you can download the results from Zenodo. Alternatively, the data can be access through the data infrastructure of the Kompetenznetzwerk Bibliometrie and in the Open Scholarly Data Warehouse of the SUB Göttingen. More information about data access can be found in the data documentation.

References

Haupka, Nick, Jack H. Culbert, Alexander Schniedermann, Najko Jahn, and Philipp Mayr. 2024. “Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, PubMed and Semantic Scholar.” https://arxiv.org/abs/2406.15154.