I've got a table of journal publications, and I'd like to extract the 1st, 2nd last and last authors.
Unfortunately, the number of authors varies a lot, with some having one and some as many as 35.
If a publication has one author, I expect to get just one first author. I hope to get a first and last author if there are two authors. If there are three authors, I expect a first, second last and last author and so on.
Here's the original dataset:
pub1 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4",
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3",
"author1, author2, author3, author4", "author1, author2, author3, author4, author5",
"author1, author2, author3, author4, author5, author6")),
class = "data.frame", row.names = c(NA, -6L))
And here's an expected output:
pub2 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4",
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3",
"author1, author2, author3, author4", "author1, author2, author3, author4, author5",
"author1, author2, author3, author4, author5, author6"),
author_first = c("author1", "author1", "author1", "author1", "author1", "author1"),
author_second_last = c("", ""," author2", " author3", " author4", " author5"),
author_last = c("", " author2", " author3", " author4", " author5", " author6")),
class = "data.frame", row.names = c(NA, -6L))
I have no idea how to go about this.
CodePudding user response:
Here's an idea of how to do it using dplyr
and stringr
library(dplyr)
library(stringr)
author_position = function(str, p, position) {
stopifnot(is.numeric(position))
# split the string up into a vector of pieces using a pattern (in this case `,`)
# and trim the white space
s = str_trim(str_split(str, p, simplify = TRUE))
len = length(s)
# Return NA if the author position chosen is greater than or equal to the length of the new vector
# Caveat: If the position is 1, then return the value at the first position
if(abs(position) >= len) {
if(position == 1) {
first(s)
} else {
NA
}
# Return the the value at the selected position
} else {
nth(s, position)
}
}
pub1 %>%
rowwise() %>% # group by row
mutate(author_first = author_position(authors,",",1),
author_second_last = author_position(authors,",",-2),
author_last = author_position(authors,",",-1))
# # A tibble: 6 × 5
# # Rowwise:
# publication authors author_first author_second_last author_last
# <chr> <chr> <chr> <chr> <chr>
# 1 pub1 author1 author1 NA NA
# 2 pub2 author1, author2 author1 NA author2
# 3 pub3 author1, author2, author3 author1 author2 author3
# 4 pub4 author1, author2, author3, author4 author1 author3 author4
# 5 pub5 author1, author2, author3, author4, author5 author1 author4 author5
# 6 pub6 author1, author2, author3, author4, author5, author6 author1 author5 author6
Edited: To allow capability to return any author position and added comments.
The only constraint here is that the first/last authors are fixed. So if you want to return the 3rd to last author and there are only 3 authors for the publication, it will return NA since technically that's considered to be the first. Same goes for returning the 3rd author as that would be considered to be the last author if there are only 3 authors.
pub1 %>%
rowwise() %>% # group by row
mutate(author_third = author_position(authors,",",3),
author_third_last = author_position(authors, ",", -3))
# # A tibble: 6 × 4
# # Rowwise:
# publication authors author_third author_third_last
# <chr> <chr> <chr> <chr>
# 1 pub1 author1 NA NA
# 2 pub2 author1, author2 NA NA
# 3 pub3 author1, author2, author3 NA NA
# 4 pub4 author1, author2, author3, author4 author3 author2
# 5 pub5 author1, author2, author3, author4, author5 author3 author3
# 6 pub6 author1, author2, author3, author4, author5, author6 author3 author4