APIs for Dummies (don’t read the python code section. We are using R instead of python)
The OpenAlex API will return JSON in a web browser which can be read by a human.
You simply type the API route and the directory (entities) you are looking up into your browser. The directories are massive, so it is necessary to drill down into a known subdirectory or filter. Directories (entities) = institutions; works; authors; sources; topics; publishers; funders. Examples:
Institutions
https://api.openalex.org/institutions?search=Southampton https://api.openalex.org/institutions/https://ror.org/01ryk1543
Works (research outputs)
Authors
e.g. https://api.openalex.org/authors/https://orcid.org/0000-0002-4726-8018
Results are returned in pages of 25 results as standard.
To find research outputs with a UoS affiliation published in May 2024, paste this into Firefox:
Count = total number of works (research outputs). There are 25 results per page (so 524/25 = 21 pages in total. Each result is one research output and you can click to see the authors, title, DOI, funding info etc.
Open R Studio - New file - R script
When you are writing an R script, you can use # at the start of a line to write comments to explain and document your code. Everything on the line following # is ignored by R when it runs the code.
Install and load the packages httr (for making requests to APIs)and jsonlite (for working with JSON data). Why these 2 packages? Because Steven knows they are useful! There are great resources online explaining packages, including R for Librarians written for our team.
Note that you need to enclose the package name in “” to
install.packages but not with the library
command to load.
# install and load httr and jsonlite
install.packages("httr")
install.packages("jsonlite")
library(httr)
library(jsonlite)
You can also install packages by clicking on Packages > Install and type a package name.
Then we define our search parameters variables date1, date2, and institution. Remember that in this example we are looking for research outputs with a UoS affiliation published in May 2024.
# Define parameters
date1<-"2024-05-01"
date2<-"2024-05-31"
ror<-"https://ror.org/01ryk1543"
The next step is to add your email address to the request header to add you to the API “polite list”. I have used my email address, you will need to change it to yours.
email <- "nsc@soton.ac.uk"
Data frames in R provide a structured, tabular format (like a spreadsheet) to organise and efficiently analyse potentially mixed data types. This section also builds the initial API endpoint URL for the first page of results.
#setup dataframe for a full API call
publication_data <- data.frame(
Authors = character(),
Title = character(),
Publisher = character(),
Sourcetitle = character(),
Type = character(),
DOI = character(),
Hyperlink.DOI = character(),
CollaboratingHEIs = character(),
Pubdate = character(),
Is.OA = character(),
OA.colour = character(),
License = character(),
Any.repository = character(),
Repositorylist = character(),
Datasets = character(),
SDG = character(),
Funder = character()
)
api_endpoint<-(paste0("https://api.openalex.org/works?filter=from_publication_date:",date1,",to_publication_date:",date2,",institutions.ror:",ror))
response<-GET(api_endpoint,(add_headers(paste0("mailto=",email))))
You can use {} to wrap chunks of code and run them as a block.
This section determines how many pages of results there are (in the May 2024 example there are 21). A successful API response has the status code 200. It then works out how many pages we need to loop through (remember we only get 25 records per page).
{
#Read the first page of an api call to establish the number of pages
if (status_code(response) == 200) {
# Extract the content as text
content <- content(response, "text", encoding = "UTF-8")
# Parse the JSON content
json_data <- fromJSON(content)
}
}
# Calculate the total number of pages (imax) based on the API response
if (round(json_data[["meta"]][["count"]]/json_data[["meta"]][["per_page"]])<(json_data[["meta"]][["count"]]/json_data[["meta"]][["per_page"]])){
# If the division results in a decimal (meaning there are more pages due to partial page), round up and add 1 to get the total number of pages
imax<-round(json_data[["meta"]][["count"]]/json_data[["meta"]][["per_page"]])+1
}else{
# If the answer is a whole number, calculate the number of pages by rounding down
# Set the starting page number
imin<-1
# Calculate the total number of pages
imax<-round(json_data[["meta"]][["count"]]/json_data[["meta"]][["per_page"]])
}
# Loop through all the pages
for (i in imin:imax){
api_endpoint <- paste0("https://api.openalex.org/works?filter=from_publication_date:",date1,",to_publication_date:",date2,",institutions.ror:",ror,"&page=",i)
response <- GET(api_endpoint, add_headers(paste0("mailto=",email)))
if (status_code(response) == 200) {
# Extract the content as text
content <- content(response, "text", encoding = "UTF-8")
# Parse the JSON content
json_data <- fromJSON(content)
for (n in 1:(sum(sapply(json_data[["results"]][["id"]],length)))){
Then we trust the process and use the work Steven put in to make the data display in an appropriate way:
Authors <- paste(c(json_data[["results"]][["authorships"]][[n]][["raw_author_name"]]), collapse = ", ")
Title<-json_data[["results"]][["title"]][[n]]
Sourcetitle<- if (is.list(json_data[["results"]][["locations"]][[n]][["source"]])) {paste(c(json_data[["results"]][["locations"]][[n]][["source"]][["display_name"]][[1]]))}else{NA}
doctype<-json_data[["results"]][["type_crossref"]][[n]]
publisher<-json_data[["results"]][["primary_location"]][["source"]][["host_organization_name"]][[n]]
Hyperlink.DOI<-json_data[["results"]][["doi"]][[n]]
DOI<-gsub("^.*?(10\\.)", "10.", Hyperlink.DOI)
CollaboratingHEIs<-json_data[["results"]][["institutions_distinct_count"]][[n]]
publication.date<-json_data[["results"]][["publication_date"]][[n]]
Is.OA<-json_data[["results"]][["best_oa_location"]][["is_oa"]][[n]]
OA.colour<-paste(c(json_data[["results"]][["open_access"]][["oa_status"]][[n]]))
license<-paste(c(json_data[["results"]][["best_oa_location"]][["license"]][[n]]))
Any.repository<-json_data[["results"]][["open_access"]][["any_repository_has_fulltext"]][[n]]
repositorylist<-if (is.list(json_data[["results"]][["locations"]][[n]][["source"]])) {paste0(json_data[["results"]][["locations"]][[n]][["source"]][["display_name"]][-1],collapse = "; ")}else{NA}
Datasets<-if (length(json_data[["results"]][["datasets"]][[n]])==0){NA}else{paste(json_data[["results"]][["datasets"]][[n]], collapse = "; ")}
SDG<- if (is.null(json_data[["results"]][["sustainable_development_goals"]][[n]][["display_name"]])){NA}else{paste(c(json_data[["results"]][["sustainable_development_goals"]][[n]][["display_name"]]),collapse = "; ")}
Funder<-paste(c(json_data[["results"]][["grants"]][[n]][["funder_display_name"]]), collapse = "; ")
Then the final piece of magic ‘binds’ the data to the date frame instead of overwriting each page of results. In this example you should end up with 524 rows of publication information, just as you did in Firefox.
publication_data <- rbind(publication_data,data.frame(Authors=Authors,Title=Title,Sourcetitle=Sourcetitle,Publisher=publisher,Type=doctype,DOI=DOI,Hyperlink.DOI=Hyperlink.DOI,CollaboratingHEIs=CollaboratingHEIs,Pubdate=publication.date,Is.OA=Is.OA,OA.colour=OA.colour,License=license,Any.repository=Any.repository,Repositorylist=repositorylist,Datasets=Datasets,SDG=SDG,Funder=Funder,stringsAsFactors = FALSE))
}
}}
The results will populate the Environment section, showing 524 obs (research outputs) of 17 variables (columns: author, title, publisher etc)
Click on publication-data to see it display in the tabular format.
Export your results to csv with this (replace “My Filename” with whatever you want to call it)
write.csv(publication_data,"My Filename")