Introduction
bibliometrix package provides a set of tools for
quantitative research in bibliometrics and scientometrics.
Bibliometrics turns the main tool of science, quantitative analysis,
on itself. Essentially, bibliometrics is the application of quantitative
analysis and statistics to publications such as journal articles and
their accompanying citation counts. Quantitative evaluation of
publication and citation data is now used in almost all scientific
fields to evaluate growth, maturity, leading authors, conceptual and
intellectual maps, trends of a scientific community.
Bibliometrics is also used in research performance evaluation,
especially in university and government labs, and also by policymakers,
research directors and administrators, information specialists and
librarians, and scholars themselves.
bibliometrix supports scholars in three key phases
of analysis:
Data importing and conversion to R format;
Bibliometric analysis of a publication dataset;
Building matrices for co-citation, coupling, collaboration, and
co-word analysis. Matrices are the input data for performing network
analysis, multiple correspondence analysis, and any other data reduction
techniques.
Bibliographic databases
bibliometrix works with data extracted from the four
main bibliographic databases: SCOPUS, Clarivate Analytics Web of
Science, Cochrane Database of Systematic Reviews (CDSR) and RISmed
PubMed/MedLine.
SCOPUS (https://www.scopus.com), founded in 2004, offers a great
deal of flexibility for the bibliometric user. It permits to query for
different fields, such as titles, abstracts, keywords, references and so
on. SCOPUS allows for relatively easy downloading data-queries, although
there are some limits on very large results sets with over 2,000
items.
Clarivate Analytics Web of Science (WoS) (https://www.webofknowledge.com), owned by Clarivate
Analytics, was founded by Eugene Garfield, one of the pioneers of
bibliometrics.
This platform includes many different collections.
Cochrane Database of Systematic Reviews (https://www.cochranelibrary.com/cdsr/about-cdsr) is the
leading resource for systematic reviews in health care. The CDSR
includes Cochrane Reviews (the systematic reviews) and protocols for
Cochrane Reviews as well as editorials. The CDSR also has occasional
supplements. The CDSR is updated regularly as Cochrane Reviews are
published “when ready” and form monthly issues; see publication
schedule.
PubMed comprises more than 28 million citations for
biomedical literature from MEDLINE, life science journals, and online
books. Citations may include links to full-text content from PubMed
Central and publisher websites.
Data acquisition
Bibliographic data may be obtained by querying the SCOPUS or
Clarivate Analytics Web of Science (WoS) database by diverse fields,
such as topic, author, journal, timespan, and so on.
In this example, we show how to download data, querying a term in the
manuscript title field.
We choose the generic term “bibliometrics”.
Querying from Clarivate Analytics WoS
At the link https://www.webofknowledge.com, select Web of Science
Core Collection database.
Write the keyword “bibliometrics” in the search field and select
title from the drop-down menu (see figure 1).
Choose SCI-EXPANDED and SSCI citation indexes.
The search yielded 291 results on May 09, 2016.
Results can be refined using options on the left side of the page
(the type of manuscript, source, subject category, etc.).
After refining the query, you can add records to your Marked List by
clicking the button “add to marked list” at the end of the page and
selecting the records to save (see figure 2).
The Marked List page provides you with a list of publications
selected and various means of exporting data.
To export the data you desire, choose the export tool and follow the
three intuitive steps (see figure 3).
The export tool allows you to select the diverse fields to save. So,
select the fields you are interested in (for example all the available
data about marked records).
To download an export file, in an appropriate format for the
bibliometrix package, make sure to select the option
“Save to Other File Formats” and choose Bibtex or Plain
Text.
The WoS platform permits to export only 500 records at a time.
The Clarivate Analytics Web of Science export tool creates an export
file with a default name “savedrecs” with an extension “.txt” or “.bib”
for plain text or BibTeX format respectively. Export files can be
separately stored.
Querying from SCOPUS
The access to SCOPUS is via https://www.scopus.com.
To find all articles whose title includes the term “bibliometrics”,
simply write this keyword in the field and select “Article Title” (see
figure 4)
The search yielded 414 results on May 09, 2016.
You can download the references (up to 2,000 full records) by
checking the ‘Select All’ box and clicking on the link ‘Export’. Choose
the file type “BibTeX export” and “all available information” (see
figure 5).
The SCOPUS export tool creates an export file with the default name
“scopus.bib”.
bibliometrix installation
Download and install the most recent version of R (https://cran.r-project.org)
Download and install the most recent version of Rstudio (https://rstudio.com)
Open Rstudio, in the console window, digit:
install.packages(“bibliometrix”, dependencies=TRUE) ### installs
bibliometrix package and dependencies
library(bibliometrix) ### load bibliometrix package
## To cite bibliometrix in publications, please use:
##
## Aria, M. & Cuccurullo, C. (2017) bibliometrix: An R-tool for comprehensive science mapping analysis,
## Journal of Informetrics, 11(4), pp 959-975, Elsevier.
##
##
## https://www.bibliometrix.org
##
##
## For information and bug reports:
## - Send an email to info@bibliometrix.org
## - Write a post on https://github.com/massimoaria/bibliometrix/issues
##
## Help us to keep Bibliometrix free to download and use by contributing with a small donation to support our research team (https://bibliometrix.org/donate.html)
##
##
## To start with the shiny web-interface, please digit:
## biblioshiny()
Data loading and converting
The export file can be read and converted using by R using the
function convert2df:
convert2df(file, dbsource,
format)
The argument file is a character vector containing the name
of export files downloaded from SCOPUS, Clarivate Analytics WOS, Digital
Science Dimensions, PubMed or Cochrane CDSR website. file can
also contains the name of a json/xlm object download using Digital
Science Dimenions or PubMed APIs (through the packages
dimensionsR and pubmedR.
es. file <- c(“file1.txt”,“file2.txt”, …)
file <- "https://www.bibliometrix.org/datasets/savedrecs.bib"
M <- convert2df(file = file, dbsource = "isi", format = "bibtex")
##
## Converting your isi collection into a bibliographic dataframe
##
## Done!
##
##
## Generating affiliation field tag AU_UN from C1: Done!
convert2df creates a bibliographic data frame with cases
corresponding to manuscripts and variables to Field Tag in the original
export file.
convert2df accepts two additional arguments:
dbsource and format.
The argument dbsource indicates from which database the
collection has been downloaded.
It can be:
“isi” or “wos” (for Clarivate Analytics Web of Science
database),
“scopus” (for SCOPUS database),
“dimensions” (for DS Dimensions database)
“pubmed” (for PubMed/Medline database),
“cochrane” (for Cochrane Library database of systematic
reviews).
The argument format indicates the file format of the
imported collection. It can be “plaintext” or “bibtex” for WOS
collection and mandatorily “bibtext” for SCOPUS collection. The argument
is ignored if the collection comes from Pubmed or Cochrane.
Each manuscript contains several elements, such as authors’ names,
title, keywords and other information. All these elements constitute the
bibliographic attributes of a document, also called metadata.
Data frame columns are named using the standard Clarivate Analytics
WoS Field Tag codify.
The main field tags are:
AU |
Authors |
TI |
Document Title |
SO |
Publication Name (or Source) |
JI |
ISO Source Abbreviation |
DT |
Document Type |
DE |
Authors’ Keywords |
ID |
Keywords associated by SCOPUS or ISI database |
AB |
Abstract |
C1 |
Author Address |
RP |
Reprint Address |
CR |
Cited References |
TC |
Times Cited |
PY |
Year |
SC |
Subject Category |
UT |
Unique Article Identifier |
DB |
Bibliographic Database |
For a complete list of field tags see https://www.bibliometrix.org/documents/Field_Tags_bibliometrix.pdf
Bibliometric Analysis
The first step is to perform a descriptive analysis of the
bibliographic data frame.
The function biblioAnalysis calculates main bibliometric
measures using this syntax:
results <- biblioAnalysis(M, sep = ";")
The function biblioAnalysis returns an object of class
“bibliometrix”.
An object of class “bibliometrix” is a list containing the following
components:
Articles |
the total number of manuscripts |
Authors |
the authors’ frequency distribution |
AuthorsFrac |
the authors’ frequency distribution (fractionalized) |
FirstAuthors |
corresponding author of each manuscript |
nAUperPaper |
the number of authors per manuscript |
Appearances |
the number of author appearances |
nAuthors |
the number of authors |
AuMultiAuthoredArt |
the number of authors of multi-authored articles |
MostCitedPapers |
the list of manuscripts sorted by citations |
Years |
publication year of each manuscript |
FirstAffiliation |
the affiliation of the corresponding author |
Affiliations |
the frequency distribution of affiliations (of all co-authors for
each paper) |
Aff_frac |
the fractionalized frequency distribution of affiliations (of all
co-authors for each paper) |
CO |
the affiliation country of the corresponding author |
Countries |
the affiliation countries’ frequency distribution |
CountryCollaboration |
the intra-country (SCP) and inter-country (MCP) collaboration
indices |
TotalCitation |
the number of times each manuscript has been cited |
TCperYear |
the yearly average number of times each manuscript has been
cited |
Sources |
the frequency distribution of sources (journals, books, etc.) |
DE |
the frequency distribution of authors’ keywords |
ID |
the frequency distribution of keywords associated to the manuscript
by SCOPUS and Thomson Reuters’ ISI Web of Knowledge databases |
Functions summary and plot
To summarize main results of the bibliometric analysis, use the
generic function summary. It displays main information about
the bibliographic data frame and several tables, such as annual
scientific production, top manuscripts per number of citations, most
productive authors, most productive countries, total citation per
country, most relevant sources (journals) and most relevant
keywords.
Main information table describes the collection size in terms of
number of documents, number of authors, number of sources, number of
keywords, timespan, and average number of citations.
Furthermore, many different co-authorship indices are shown. In
particular, the Authors per Article index is calculated
as the ratio between the total number of authors and the total number of
articles. The Co-Authors per Articles index is
calculated as the average number of co-authors per article. In this
case, the index takes into account the author appearances while for the
“authors per article” an author, even if he has published more than one
article, is counted only once. For that reasons, Authors per Article
index \(\le\) Co-authors per Article
index.
The Collaboration Index (CI) is calculated as Total
Authors of Multi-Authored Articles/Total Multi-Authored Articles (Elango
and Rajendran, 2012; Koseoglu, 2016). In other word, the Collaboration
Index is a Co-authors per Article index calculated only using the
multi-authored article set.
Elango, B., & Rajendran, P. (2012). Authorship trends and
collaboration pattern in the marine sciences literature: a scientometric
study. International Journal of Information Dissemination and
Technology, 2(3), 166.
Koseoglu, M. A. (2016). Mapping the institutional collaboration
network of strategic management research: 1980–2014. Scientometrics,
109(1), 203-226.
summary accepts two additional arguments. k is a
formatting value that indicates the number of rows of each table.
pause is a logical value (TRUE or FALSE) used to allow (or not)
pause in screen scrolling. Choosing k=10 you decide to see the first 10
Authors, the first 10 sources, etc.
options(width=100)
S <- summary(object = results, k = 10, pause = FALSE)
##
##
## MAIN INFORMATION ABOUT DATA
##
## Timespan 1985 : 2015
## Sources (Journals, Books, etc) 141
## Documents 291
## Annual Growth Rate % 7.18
## Document Average Age 16.7
## Average citations per doc 11.73
## Average citations per year per doc 0.6489
## References 6768
##
## DOCUMENT TYPES
## art exhibit review 1
## article 160
## article; proceedings paper 7
## biographical-item 1
## book review 32
## correction, addition 1
## editorial material 41
## letter 16
## meeting abstract 4
## note 3
## review 25
##
## DOCUMENT CONTENTS
## Keywords Plus (ID) 475
## Author's Keywords (DE) 365
##
## AUTHORS
## Authors 523
## Author Appearances 635
## Authors of single-authored docs 121
##
## AUTHORS COLLABORATION
## Single-authored docs 144
## Documents per Author 0.556
## Co-Authors per Doc 2.18
## International co-authorships % 12.37
##
##
## Annual Scientific Production
##
## Year Articles
## 1985 4
## 1986 3
## 1987 6
## 1988 7
## 1989 8
## 1990 6
## 1991 7
## 1992 6
## 1993 5
## 1994 7
## 1995 1
## 1996 8
## 1997 4
## 1998 5
## 1999 2
## 2000 7
## 2001 8
## 2002 5
## 2003 1
## 2004 3
## 2005 12
## 2006 5
## 2007 5
## 2008 8
## 2009 14
## 2010 17
## 2011 20
## 2012 25
## 2013 21
## 2014 29
## 2015 32
##
## Annual Percentage Growth Rate 7.177346
##
##
## Most Productive Authors
##
## Authors Articles Authors Articles Fractionalized
## 1 BORNMANN L 8 BORNMANN L 4.67
## 2 KOSTOFF RN 8 WHITE HD 3.50
## 3 MARX W 6 MARX W 3.17
## 4 HUMENIK JA 5 ATKINSON R 3.00
## 5 ABRAMO G 4 BROADUS RN 3.00
## 6 D'ANGELO CA 4 CRONIN B 3.00
## 7 GARG KC 4 BORGMAN CL 2.50
## 8 GLANZEL W 4 MCCAIN KW 2.50
## 9 WHITE HD 4 PERITZ BC 2.50
## 10 ATKINSON R 3 KOSTOFF RN 2.10
##
##
## Top manuscripts per citations
##
## Paper DOI TC TCperYear NTC
## 1 DAIM TU, 2006, TECHNOL FORECAST SOC CHANG 10.1016/j.techfore.2006.04.004 211 12.41 3.44
## 2 WHITE HD, 1989, ANNU REV INFORM SCI TECHNOL NA 196 5.76 5.58
## 3 BORGMAN CL, 2002, ANNU REV INFORM SCI TECHNOL NA 192 9.14 3.60
## 4 WEINGART P, 2005, SCIENTOMETRICS 10.1007/s11192-005-0007-7 151 8.39 6.79
## 5 NARIN F, 1994, SCIENTOMETRICS 10.1007/BF02017219 141 4.86 4.72
## 6 CRONIN B, 2001, J INF SCI NA 129 5.86 2.78
## 7 CHEN YC, 2011, SCIENTOMETRICS 10.1007/s11192-010-0289-2 101 8.42 5.92
## 8 HOOD WW, 2001, SCIENTOMETRICS 10.1023/A:1017919924342 71 3.23 1.53
## 9 D'ANGELO CA, 2011, J AM SOC INF SCI TECHNOL 10.1002/asi.21460 64 5.33 3.75
## 10 NARIN F, 1994, EVAL REV 10.1177/0193841X9401800107 62 2.14 2.08
##
##
## Corresponding Author's Countries
##
## Country Articles Freq SCP MCP MCP_Ratio
## 1 USA 81 0.3057 76 5 0.0617
## 2 UNITED KINGDOM 27 0.1019 27 0 0.0000
## 3 GERMANY 17 0.0642 12 5 0.2941
## 4 FRANCE 13 0.0491 11 2 0.1538
## 5 BRAZIL 12 0.0453 10 2 0.1667
## 6 CHINA 10 0.0377 8 2 0.2000
## 7 INDIA 10 0.0377 10 0 0.0000
## 8 AUSTRALIA 8 0.0302 6 2 0.2500
## 9 CANADA 8 0.0302 7 1 0.1250
## 10 SPAIN 8 0.0302 8 0 0.0000
##
##
## SCP: Single Country Publications
##
## MCP: Multiple Country Publications
##
##
## Total Citations per Country
##
## Country Total Citations Average Article Citations
## 1 USA 1831 22.60
## 2 GERMANY 330 19.41
## 3 ITALY 163 32.60
## 4 AUSTRALIA 134 16.75
## 5 UNITED KINGDOM 125 4.63
## 6 CANADA 111 13.88
## 7 INDIA 85 8.50
## 8 IRAN 74 37.00
## 9 SPAIN 73 9.12
## 10 BELGIUM 70 10.00
##
##
## Most Relevant Sources
##
## Sources Articles
## 1 SCIENTOMETRICS 49
## 2 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY 14
## 3 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 8
## 4 JOURNAL OF DOCUMENTATION 6
## 5 JOURNAL OF INFORMATION SCIENCE 6
## 6 JOURNAL OF INFORMETRICS 6
## 7 BRITISH JOURNAL OF ANAESTHESIA 5
## 8 LIBRI 5
## 9 SOCIAL WORK IN HEALTH CARE 5
## 10 TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE 5
##
##
## Most Relevant Keywords
##
## Author Keywords (DE) Articles Keywords-Plus (ID) Articles
## 1 BIBLIOMETRICS 63 SCIENCE 38
## 2 CITATION ANALYSIS 11 INDICATORS 24
## 3 SCIENTOMETRICS 7 IMPACT 23
## 4 IMPACT FACTOR 5 CITATION 20
## 5 INFORMATION RETRIEVAL 5 CITATION ANALYSIS 15
## 6 PEER REVIEW 5 JOURNALS 14
## 7 CITATION 4 H-INDEX 13
## 8 CITATIONS 4 PUBLICATION 12
## 9 H-INDEX 4 INFORMATION-SCIENCE 10
## 10 IMPACT FACTORS 4 IMPACT FACTORS 8
Some basic plots can be drawn using the generic function :
plot(x = results, k = 10, pause = FALSE)
Analysis of Cited References
The function citations generates the frequency table of the
most cited references or the most cited first authors (of
references).
For each manuscript, cited references are in a single string stored
in the column “CR” of the data frame.
For a correct extraction, you need to identify the separator field
among different references, used by ISI or SCOPUS database. Usually, the
default separator is “;” or ". "
(a dot with double
space).
# M$CR[1]
The figure shows the reference string of the first manuscript. In
this case, the separator field is sep = ";"
.
To obtain the most frequent cited manuscripts:
CR <- citations(M, field = "article", sep = ";")
cbind(CR$Cited[1:10])
## [,1]
## HIRSCH JE, 2005, P NATL ACAD SCI USA, V102, P16569, DOI 10.1073/PNAS.0507655102 29
## PRITCHAR.A, 1969, J DOC, V25, P348 22
## SMALL H, 1973, J AM SOC INFORM SCI, V24, P265, DOI 10.1002/ASI.4630240406 20
## DE SOLLA PRICE DJ, 1963, LITTLE SCI BIG SCI 19
## BRADFORD S. C, 1934, ENGINEERING-LONDON, V137, P85 14
## GARFIELD E, 2006, JAMA-J AM MED ASSOC, V295, P90, DOI 10.1001/JAMA.295.1.90 12
## MOED H. F., 2005, CITATION ANAL RES EV 12
## COLE FRANCIS J., 1917, SCI PROGR, V11, P578 11
## GROSS P L, 1927, SCIENCE, V66, P385, DOI 10.1126/SCIENCE.66.1713.385 11
## LOTKA A. Y., 1926, J WASHINGTON ACAD SC, V16, P317 11
To obtain the most frequent cited first authors:
CR <- citations(M, field = "author", sep = ";")
cbind(CR$Cited[1:10])
## [,1]
## GARFIELD E 167
## KOSTOFF RN 129
## BORNMANN L 92
## SMALL H 75
## WHITE HD 74
## CRONIN B 67
## NARIN F 51
## GLANZEL W 50
## LEYDESDORFF L 47
## EGGHE L 45
The function localCitations generates the frequency table of
the most local cited authors. Local citations measure how many times an
author (or a document) included in this collection have been cited by
other authors also in the collection.
To obtain the most frequent local cited authors:
CR <- localCitations(M, sep = ";")
CR$Authors[1:10,]
## Author LocalCitations
## 21 BARKER K 9
## 190 HOLDEN G 9
## 386 ROSENBERG G 9
## 1 ABRAMO G 8
## 83 D'ANGELO CA 8
## 50 BROADUS RN 7
## 199 HUMENIK JA 6
## 233 KOSTOFF RN 6
## 354 PFEIL KM 6
## 455 TSHITEYA R 6
CR$Papers[1:10,]
## Paper DOI Year LCS GCS
## 9 BROADUS RN, 1987, SCIENTOMETRICS 10.1007/BF02016680 1987 5 38
## 108 HOLDEN G, 2005, SOC WORK HEALTH CARE 10.1300/J010v41n03\\_01 2005 5 22
## 141 ABRAMO G, 2009, RES POLICY 10.1016/j.respol.2008.11.001 2009 5 43
## 45 SENGUPTA IN, 1992, LIBRI 10.1515/libr.1992.42.2.75 1992 4 20
## 83 CRONIN B, 2000, J DOC 10.1108/EUM0000000007123 2000 4 20
## 95 KOSTOFF RN, 2002, J POWER SOURCES 10.1016/S0378-7753(02)00233-1 2002 4 34
## 109 HOLDEN G, 2005, SOC WORK HEALTH CARE-a 10.1300/J010v41n03\\_03 2005 4 34
## 6 PERSSON O, 1986, SCIENTOMETRICS 10.1007/BF02016861 1986 3 15
## 7 DEGLAS F, 1986, LIBRI 10.1515/libr.1986.36.1.40 1986 2 8
## 10 BROADUS RN, 1987, J EDUC LIBR INF SCI 10.2307/40323625 1987 2 2
Authors’ Dominance ranking
The function dominance calculates the authors’ dominance
ranking as proposed by Kumar & Kumar, 2008.
Kumar, S., & Kumar, S. (2008). Collaboration in research
productivity in oil seed research institutes of India. In Proceedings of
Fourth International Conference on Webometrics, Informetrics and
Scientometrics.
Function arguments are: results (object of class
bibliometrix) obtained by biblioAnalysis; and
k (the number of authors to consider in the analysis).
DF <- dominance(results, k = 10)
DF
## Author Dominance Factor Tot Articles Single-Authored Multi-Authored First-Authored Rank by Articles Rank by DF
## 1 KOSTOFF RN 1.0000000 8 0 8 8 1 1
## 2 WHITE HD 1.0000000 4 3 1 1 4 1
## 3 BORGMAN CL 1.0000000 3 2 1 1 9 1
## 4 HOLDEN G 1.0000000 3 0 3 3 9 1
## 5 BORNMANN L 0.8333333 8 2 6 5 1 5
## 6 ABRAMO G 0.7500000 4 0 4 3 4 6
## 7 GARG KC 0.7500000 4 0 4 3 4 6
## 8 GLANZEL W 0.7500000 4 0 4 3 4 6
## 9 D'ANGELO CA 0.2500000 4 0 4 1 4 9
## 10 MARX W 0.2000000 6 1 5 1 3 10
The Dominance Factor is a ratio indicating the fraction of
multi-authored articles in which a scholar appears as the first
author.
In this example, Kostoff and Holden dominate their research team
because they appear as the first authors in all their papers (8 for
Kostoff and 3 for Holden).
Authors’ h-index
The h-index is an author-level metric that attempts to measure both
the productivity and citation impact of the publications of a scientist
or scholar.
The index is based on the set of the scientist’s most cited papers
and the number of citations that they have received in other
publications.
The function Hindex calculates the authors’ H-index or the
sources’ H-index and its variants (g-index and m-index) in a
bibliographic collection.
Function arguments are: M a bibliographic data frame;
field is character element that defines the unit of analysis in
terms of authors (field = “auhtor”) or sources (field = “source”);
elements a character vector containing the authors’ names (or
the sources’ names) for which you want to calculate the H-index. The
argument has the form c(“SURNAME1 N”,“SURNAME2 N”,…).
In other words, for each author: surname and initials are separated
by one blank space. i.e for the authors ARIA MASSIMO and CUCCURULLO
CORRADO, elements argument is elements = c(“ARIA M”,
“CUCCURULLO C”).
To calculate the h-index of Lutz Bornmann in this collection:
indices <- Hindex(M, field = "author", elements="BORNMANN L", sep = ";", years = 10)
# Bornmann's impact indices:
indices$H
## Element h_index g_index m_index TC NP PY_start
## 1 BORNMANN L 4 7 0.3636364 50 7 2012
# Bornmann's citations
indices$CitationList
## $`BORNMANN L`
## AU
## BORNMANN L, 2015, J INFORMETR BORNMANN L;MARX W
## BORNMANN L, 2014, RES EVALUAT BORNMANN L
## BORNMANN L, 2013, J AM SOC INF SCI TECHNOL BORNMANN L
## BORNMANN L, 2013, J INFORMETR BORNMANN L;WILLIAMS R
## BORNMANN L, 2013, J INFORMETR-a BORNMANN L;MARX W
## BORNMANN L, 2012, Z EVAL BORNMANN L;BOWMAN BF;BAUER J;MARX W;SCHIER H;PALZENBERGER M
## BORNMANN L, 2014, J INFORMETR BORNMANN L;LEYDESDORFF L
## SO PY
## BORNMANN L, 2015, J INFORMETR JOURNAL OF INFORMETRICS 2015
## BORNMANN L, 2014, RES EVALUAT RESEARCH EVALUATION 2014
## BORNMANN L, 2013, J AM SOC INF SCI TECHNOL JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY 2013
## BORNMANN L, 2013, J INFORMETR JOURNAL OF INFORMETRICS 2013
## BORNMANN L, 2013, J INFORMETR-a JOURNAL OF INFORMETRICS 2013
## BORNMANN L, 2012, Z EVAL ZEITSCHRIFT FUR EVALUATION 2012
## BORNMANN L, 2014, J INFORMETR JOURNAL OF INFORMETRICS 2014
## TC DI
## BORNMANN L, 2015, J INFORMETR 5 10.1016/j.joi.2015.01.006
## BORNMANN L, 2014, RES EVALUAT 3 10.1093/reseval/rvu002
## BORNMANN L, 2013, J AM SOC INF SCI TECHNOL 18 10.1002/asi.22792
## BORNMANN L, 2013, J INFORMETR 10 10.1016/j.joi.2013.02.005
## BORNMANN L, 2013, J INFORMETR-a 11 10.1016/j.joi.2012.09.003
## BORNMANN L, 2012, Z EVAL 2 <NA>
## BORNMANN L, 2014, J INFORMETR 1 10.1016/j.joi.2013.12.006
To calculate the h-index of the first 10 most productive authors (in
this collection):
authors=gsub(","," ",names(results$Authors)[1:10])
indices <- Hindex(M, field = "author", elements=authors, sep = ";", years = 50)
indices$H
## Element h_index g_index m_index TC NP PY_start
## 1 ABRAMO G 4 4 0.28571429 158 4 2009
## 2 BORNMANN L 4 7 0.36363636 50 7 2012
## 3 D'ANGELO CA 4 4 0.28571429 158 4 2009
## 4 GARG KC 4 4 0.12903226 41 4 1992
## 5 GLANZEL W 1 1 0.05882353 41 1 2006
## 6 HUMENIK JA 5 5 0.21739130 213 5 2000
## 7 KOSTOFF RN 8 8 0.33333333 276 8 1999
## 8 MARX W 3 5 0.25000000 36 5 2011
## 9 WHITE HD 4 4 0.11764706 248 4 1989
Top-Authors’ Productivity over the Time
The function AuthorProdOverTime calculates and plots the
authors’ production (in terms of number of publications, and total
citations per year) over the time.
Function arguments are: M a bibliographic data frame;
k is the number of k Top Authors; graph is a
logical. If graph=TRUE, the function plots the author
production over time graph.
topAU <- authorProdOverTime(M, k = 10, graph = TRUE)
## Table: Author's productivity per year
head(topAU$dfAU)
## Author year freq TC TCpY
## 1 ABRAMO G 2009 1 43 3.0714286
## 2 ABRAMO G 2011 3 115 9.5833333
## 3 ATKINSON R 2012 3 0 0.0000000
## 4 BORNMANN L 2012 1 2 0.1818182
## 5 BORNMANN L 2013 3 39 3.9000000
## 6 BORNMANN L 2014 2 4 0.4444444
## Table: Auhtor's documents list
#head(topAU$dfPapersAU)
Lotka’s Law coefficient estimation
The function lotka estimates Lotka’s law coefficients for
scientific productivity (Lotka A.J., 1926).
Lotka’s law describes the frequency of publication by authors in any
given field as an inverse square law, where the number of authors
publishing a certain number of articles is a fixed ratio to the number
of authors publishing a single article. This assumption implies that the
theoretical beta coefficient of Lotka’s law is equal to 2.
Using lotka function is possible to estimate the Beta
coefficient of our bibliographic collection and assess, through a
statistical test, the similarity of this empirical distribution with the
theoretical one.
L <- lotka(results)
# Author Productivity. Empirical Distribution
L$AuthorProd
## N.Articles N.Authors Freq
## 1 1 453 0.866156788
## 2 2 48 0.091778203
## 3 3 13 0.024856597
## 4 4 5 0.009560229
## 5 5 1 0.001912046
## 6 6 1 0.001912046
## 7 8 2 0.003824092
# Beta coefficient estimate
L$Beta
## [1] 3.043039
# Constant
L$C
## [1] 0.688399
# Goodness of fit
L$R2
## [1] 0.917464
# P-value of K-S two sample test
L$p.value
## [1] 0.2031888
The table L$AuthorProd shows the observed distribution of scientific
productivity in our example.
The estimated Beta coefficient is 3.05 with a goodness of fit equal
to 0.94. Kolmogorov-Smirnoff two sample test provides a p-value 0.09
that means there is not a significant difference between the observed
and the theoretical Lotka distributions.
You can compare the two distributions using plot
function:
# Observed distribution
Observed=L$AuthorProd[,3]
# Theoretical distribution with Beta = 2
Theoretical=10^(log10(L$C)-2*log10(L$AuthorProd[,1]))
plot(L$AuthorProd[,1],Theoretical,type="l",col="red",ylim=c(0, 1), xlab="Articles",ylab="Freq. of Authors",main="Scientific Productivity")
lines(L$AuthorProd[,1],Observed,col="blue")
legend(x="topright",c("Theoretical (B=2)","Observed"),col=c("red","blue"),lty = c(1,1,1),cex=0.6,bty="n")
Bibliographic network matrices
Manuscript’s attributes are connected to each other through the
manuscript itself: author(s) to journal, keywords to publication date,
etc.
These connections of different attributes generate bipartite networks
that can be represented as rectangular matrices (Manuscripts x
Attributes).
Furthermore, scientific publications regularly contain references to
other scientific works. This generates a further network, namely,
co-citation or coupling network.
These networks are analyzed in order to capture meaningful properties
of the underlying research system, and in particular to determine the
influence of bibliometric units such as scholars and journals.
Bipartite networks
cocMatrix is a general function to compute a bipartite
network selecting one of the metadata attributes.
For example, to create a network Manuscript x Publication
Source you have to use the field tag “SO”:
A <- cocMatrix(M, Field = "SO", sep = ";")
A is a rectangular binary matrix, representing a bipartite network
where rows and columns are manuscripts and sources respectively.
The generic element \(a_{ij}\) is 1
if the manuscript \(i\) has been
published in source \(j\), 0
otherwise.
The \(j-th\) column sum \(a_j\) is the number of manuscripts
published in source \(j\).
Sorting, in decreasing order, the column sums of A, you can see the
most relevant publication sources:
sort(Matrix::colSums(A), decreasing = TRUE)[1:5]
## SCIENTOMETRICS
## 49
## JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
## 14
## JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
## 8
## JOURNAL OF DOCUMENTATION
## 6
## JOURNAL OF INFORMATION SCIENCE
## 6
Following this approach, you can compute several bipartite
networks:
A <- cocMatrix(M, Field = "CR", sep = ". ")
A <- cocMatrix(M, Field = "AU", sep = ";")
Authors’ Countries is not a standard attribute of the bibliographic
data frame. You need to extract this information from affiliation
attribute using the function metaTagExtraction.
M <- metaTagExtraction(M, Field = "AU_CO", sep = ";")
# A <- cocMatrix(M, Field = "AU_CO", sep = ";")
metaTagExtraction allows to extract the following additional
field tags: Authors’ countries (Field = "AU_CO"
);
First Author’s countries (Field = "AU_CO"
);
First author of each cited reference
(Field = "CR_AU"
); Publication source of each cited
reference (Field = "CR_SO"
); and Authors’
affiliations (Field = "AU_UN"
).
A <- cocMatrix(M, Field = "DE", sep = ";")
A <- cocMatrix(M, Field = "ID", sep = ";")
Bibliographic coupling
Two articles are said to be bibliographically coupled if at least one
cited source appears in the bibliographies or reference lists of both
articles (Kessler, 1963).
A coupling network can be obtained using the general formulation:
\[
B = A \times A^T
\] where A is a bipartite network.
Element \(b_{ij}\) indicates how
many bibliographic couplings exist between manuscripts \(i\) and \(j\). In other words, \(b_{ij}\) gives the number of paths of
length 2, via which one moves from \(i\) along the arrow and then to \(j\) in the opposite direction.
\(B\) is a symmetrical matrix \(B = B^T\).
The strength of the coupling of two articles, \(i\) and \(j\) is defined simply by the number of
references that the articles have in common, as given by the element
\(b_{ij}\) of matrix \(B\).
The function biblioNetwork calculates, starting from a
bibliographic data frame, the most frequently used coupling networks:
Authors, Sources, and Countries.
biblioNetwork uses two arguments to define the network to
compute:
analysis argument can be “co-citation”, “coupling”,
“collaboration”, or “co-occurrences”.
network argument can be “authors”, “references”,
“sources”, “countries”, “universities”, “keywords”, “author_keywords”,
“titles” and “abstracts”.
The following code calculates a classical article coupling
network:
NetMatrix <- biblioNetwork(M, analysis = "coupling", network = "references", sep = ". ")
Articles with only a few references, therefore, would tend to be more
weakly bibliographically coupled, if coupling strength is measured
simply according to the number of references that articles contain in
common.
This suggests that it might be more practical to switch to a relative
measure of bibliographic coupling.
normalizeSimilarity function calculates Association
strength, Inclusion, Jaccard or Salton similarity among vertices of a
network. normalizeSimilarity can be recalled directly from
networkPlot() function using the argument
normalize.
NetMatrix <- biblioNetwork(M, analysis = "coupling", network = "authors", sep = ";")
#net=networkPlot(NetMatrix, normalize = "salton", weighted=NULL, n = 100, Title = "Authors' Coupling", type = "fruchterman", size=5,size.cex=T,remove.multiple=TRUE,labelsize=0.8,label.n=10,label.cex=F)
Bibliographic co-citation
We talk about co-citation of two articles when both are cited in a
third article. Thus, co-citation can be seen as the counterpart of
bibliographic coupling.
A co-citation network can be obtained using the general
formulation:
\[
C = A^T \times A
\] where A is a bipartite network.
Like matrix \(B\), matrix \(C\) is also symmetric. The main diagonal of
\(C\) contains the number of cases in
which a reference is cited in our data frame.
In other words, the diagonal element \(c_{i}\) is the number of local citations of
the reference \(i\).
Using the function biblioNetwork, you can calculate a
classical reference co-citation network:
# NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "references", sep = ". ")
Bibliographic collaboration
Scientific collaboration network is a network where nodes are authors
and links are co-authorships as the latter is one of the most
well-documented forms of scientific collaboration (Glanzel, 2004).
An author collaboration network can be obtained using the general
formulation:
\[
AC = A^T \times A
\] where A is a bipartite network Manuscripts x
Authors.
The diagonal element \(ac_{i}\) is
the number of manuscripts authored or co-authored by researcher \(i\).
Using the function biblioNetwork, you can calculate an
authors’ collaboration network:
# NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "authors", sep = ";")
or a country collaboration network:
# NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "countries", sep = ";")
Descriptive analysis of network graph characteristics
The function networkStat calculates several summary
statistics.
In particular, starting from a bibliographic matrix (or an
igraph object), two groups of descriptive measures are
computed:
# An example of a classical keyword co-occurrences network
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "keywords", sep = ";")
netstat <- networkStat(NetMatrix)
The summary statistics of the network
This group of statistics allows to describe the structural properties
of a network:
Size is the number of vertices composing the
network;
Density is the proportion of present edges from
all possible edges in the network;
Transitivity is the ratio of triangles to
connected triples;
Diameter is the longest geodesic distance
(length of the shortest path between two nodes) in the network;
Degree distribution is the cumulative
distribution of vertex degrees;
Degree centralization is the normalized degree
of the overall network;
Closeness centralization is the normalized
inverse of the vertex average geodesic distance to others in the
network;
Eigenvector centralization is the first
eigenvector of the graph matrix;
Betweenness centralization is the normalized
number of geodesics that pass through the vertex;
Average path length is the mean of the shortest
distance between each pair of vertices in the network.
names(netstat$network)
## [1] "networkSize" "networkDensity" "networkTransitivity" "networkDiameter"
## [5] "networkDegreeDist" "networkCentrDegree" "networkCentrCloseness" "networkCentrEigen"
## [9] "networkCentrbetweenness" "NetworkAverPathLeng"
The main indices of centrality and prestige of vertices
These measures help to identify the most important vertices in a
network and the propensity of two vertices that are connected to be both
connected to a third vertex.
The statistics, at vertex level, returned by networkStat
are:
Degree centrality
Closeness centrality measures how many steps are
required to access every other vertex from a given vertex;
Eigenvector centrality is a measure of being
well-connected connected to the well-connected;
Betweenness centrality measures brokerage or
gatekeeping potential. It is (approximately) the number of shortest
paths between vertices that pass through a particular vertex;
PageRank score approximates probability that any
message will arrive to a particular vertex. This algorithm was developed
by Google founders, and originally applied to website links;
Hub Score estimates the value of the links
outgoing from the vertex. It was initially applied to the web
pages;
Authority Score is another measure of centrality
initially applied to the Web. A vertex has high authority when it is
linked by many other vertices that are linking many other
vertices;
Vertex Ranking is an overall vertex ranking
obtained as a linear weighted combination of the centrality and prestige
vertex measures. The weights are proportional to the loadings of the
first component of the Principal Component Analysis.
names(netstat$vertex)
## NULL
To summarize the main results of the networkStat function,
use the generic function summary. It displays the main
information about the network and vertex description through several
tables.
summary accepts one additional argument. k is a
formatting value that indicates the number of rows of each table.
Choosing k=10, you decide to see the first 10 vertices.
summary(netstat, k=10)
##
##
## Main statistics about the network
##
## Size 475
## Density 0.024
## Transitivity 0.335
## Diameter 5
## Degree Centralization 0.301
## Average path length 2.743
##
Visualizing bibliographic networks
All bibliographic networks can be graphically visualized or
modeled.
Here, we show how to visualize networks using function
networkPlot and VOSviewer software by Nees Jan van Eck
and Ludo Waltman (https://www.vosviewer.com).
Using the function networkPlot, you can plot a network
created by biblioNetwork using R routines or using
VOSviewer.
The main argument of networkPlot is type. It indicates the
network map layout: circle, kamada-kawai, mds, etc. Choosing
type=“vosviewer”, the function automatically: (i) saves the network into
a pajek network file, named “vosnetwork.net”; (ii) starts an instance of
VOSviewer which will map the file “vosnetwork.net”. You need to declare,
using argument vos.path, the full path of the folder where
VOSviewer software is located (es.
vos.path=‘c:/software/VOSviewer’).
Country Scientific Collaboration
# Create a country collaboration network
M <- metaTagExtraction(M, Field = "AU_CO", sep = ";")
NetMatrix <- biblioNetwork(M, analysis = "collaboration", network = "countries", sep = ";")
# Plot the network
net=networkPlot(NetMatrix, n = dim(NetMatrix)[1], Title = "Country Collaboration", type = "circle", size=TRUE, remove.multiple=FALSE,labelsize=0.7,cluster="none")
Co-Citation Network
# Create a co-citation network
# NetMatrix <- biblioNetwork(M, analysis = "co-citation", network = "references", sep = ";")
# Plot the network
#net=networkPlot(NetMatrix, n = 30, Title = "Co-Citation Network", type = "fruchterman", size=T, remove.multiple=FALSE, labelsize=0.7,edgesize = 5)
Keyword co-occurrences
# Create keyword co-occurrences network
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "keywords", sep = ";")
# Plot the network
net=networkPlot(NetMatrix, normalize="association", weighted=T, n = 30, Title = "Keyword Co-occurrences", type = "fruchterman", size=T,edgesize = 5,labelsize=0.7)
Co-Word Analysis: The conceptual structure of a field
The aim of the co-word analysis is to map the conceptual structure of
a framework using the word co-occurrences in a bibliographic
collection.
The analysis can be performed through dimensionality reduction
techniques such as Multidimensional Scaling (MDS), Correspondence
Analysis (CA) or Multiple Correspondence Analysis (MCA).
Here, we show an example using the function
conceptualStructure that performs a CA or MCA to draw a
conceptual structure of the field and K-means clustering to identify
clusters of documents which express common concepts. Results are plotted
on a two-dimensional map.
conceptualStructure includes natural language processing
(NLP) routines (see the function termExtraction) to extract
terms from titles and abstracts. In addition, it implements the Porter’s
stemming algorithm to reduce inflected (or sometimes derived) words to
their word stem, base or root form.
# Conceptual Structure using keywords (method="CA")
CS <- conceptualStructure(M,field="ID", method="CA", minDegree=4, clust=5, stemming=FALSE, labelsize=10, documents=10)
Historical Direct Citation Network
The historiographic map is a graph proposed by E. Garfield (2004) to
represent a chronological network map of most relevant direct citations
resulting from a bibliographic collection.
Garfield, E. (2004). Historiographic mapping of knowledge domains
literature. Journal of Information Science, 30(2), 119-145.
The function generates a chronological direct citation network matrix
which can be plotted using histPlot:
# Create a historical citation network
options(width=130)
histResults <- histNetwork(M, min.citations = 1, sep = ";")
##
## WOS DB:
## Searching local citations (LCS) by reference items (SR) and DOIs...
##
## Analyzing 8702 reference items...
##
## Found 49 documents with no empty Local Citations (LCS)
# Plot a historical co-citation network
net <- histPlot(histResults, n=15, size = 10, labelsize=5)
##
## Legend
##
## Label DOI Year LCS GCS
## 1 DEGLAS F, 1986, LIBRI DOI 10.1515/LIBR.1986.36.1.40 10.1515/libr.1986.36.1.40 1986 2 8
## 2 BROADUS RN, 1987, SCIENTOMETRICS DOI 10.1007/BF02016680 10.1007/BF02016680 1987 5 38
## 3 BROADUS RN, 1987, J EDUC LIBR INF SCI DOI 10.2307/40323625 10.2307/40323625 1987 2 2
## 4 PERITZ BC, 1990, SCIENTOMETRICS DOI 10.1007/BF02020148 10.1007/BF02020148 1990 2 5
## 5 SENGUPTA IN, 1992, LIBRI DOI 10.1515/LIBR.1992.42.2.75 10.1515/libr.1992.42.2.75 1992 4 20
## 6 CRONIN B, 2000, J DOC DOI 10.1108/EUM0000000007123 10.1108/EUM0000000007123 2000 4 20
## 7 HOOD WW, 2001, SCIENTOMETRICS DOI 10.1023/A:1017919924342 10.1023/A:1017919924342 2001 2 71
## 8 KOSTOFF RN, 2002, J POWER SOURCES DOI 10.1016/S0378-7753(02)00233-1 10.1016/S0378-7753(02)00233-1 2002 4 34
## 9 KOSTOFF RN, 2005, ENERGY DOI 10.1016/J.ENERGY.2004.04.058 10.1016/j.energy.2004.04.058 2005 2 39
## 10 HOLDEN G, 2005, SOC WORK HEALTH CARE DOI 10.1300/J010V41N03\\_01 10.1300/J010v41n03\\_01 2005 5 22
## 11 HOLDEN G, 2005, SOC WORK HEALTH CARE DOI 10.1300/J010V41N03\\_03 10.1300/J010v41n03\\_03 2005 4 34
## 12 ABRAMO G, 2009, RES POLICY DOI 10.1016/J.RESPOL.2008.11.001 10.1016/j.respol.2008.11.001 2009 5 43
## 13 ABRAMO G, 2011, SCIENTOMETRICS DOI 10.1007/S11192-011-0352-7 10.1007/s11192-011-0352-7 2011 2 35
Main Authors’ references (about bibliometrics)
Aria, M. & Cuccurullo, C. (2017). bibliometrix: An
R-tool for comprehensive science mapping analysis, Journal of
Informetrics, 11(4), pp 959-975, Elsevier, DOI:
10.1016/j.joi.2017.08.007 (https://doi.org/10.1016/j.joi.2017.08.007).
Aria M., Misuraca M., Spano M. (2020) Mapping the evolution of social
research and data science on 30 years of Social Indicators Research,
Social Indicators Research. (DOI: )https://doi.org/10.1007/s11205-020-02281-3)
Aria, M., Cuccurullo, C., D’Aniello, L., Misuraca, M., & Spano,
M. (2022). Thematic Analysis as a New Culturomic Tool: The Social Media
Coverage on COVID-19 Pandemic in Italy. Sustainability, 14(6),
3643, (https://doi.org/10.3390/su14063643).
Aria M., Alterisio A., Scandurra A, Pinelli C., D’Aniello B, (2021)
The scholar’s best friend: research trends in dog cognitive and
behavioural studies, Animal Cognition. (https://doi.org/10.1007/s10071-020-01448-2)
Cuccurullo, C., Aria, M., & Sarto, F. (2016). Foundations and
trends in performance management. A twenty-five years bibliometric
analysis in business and public administration domains,
Scientometrics, DOI: 10.1007/s11192-016-1948-8 (https://doi.org/10.1007/s11192-016-1948-8).
Cuccurullo, C., Aria, M., & Sarto, F. (2015). Twenty years of
research on performance management in business and public administration
domains. Presentation at the Correspondence Analysis and Related
Methods conference (CARME 2015) in September 2015 (https://www.bibliometrix.org/documents/2015Carme_cuccurulloetal.pdf).
Sarto, F., Cuccurullo, C., & Aria, M. (2014). Exploring
healthcare governance literature: systematic review and paths for future
research. Mecosan (https://www.francoangeli.it/Riviste/Scheda_Rivista.aspx?IDarticolo=52780&lingua=en).
Cuccurullo, C., Aria, M., & Sarto, F. (2013). Twenty years of
research on performance management in business and public administration
domains. In Academy of Management Proceedings (Vol. 2013,
No. 1, p. 14270). Academy of Management (https://doi.org/10.5465/AMBPP.2013.14270abstract).