Home > Mobile >  R - generating new columns from a column with a variable number of delimited entries
R - generating new columns from a column with a variable number of delimited entries

Time:11-17

I've got a table of journal publications, and I'd like to extract the 1st, 2nd last and last authors.

Unfortunately, the number of authors varies a lot, with some having one and some as many as 35.

If a publication has one author, I expect to get just one first author. I hope to get a first and last author if there are two authors. If there are three authors, I expect a first, second last and last author and so on.

Here's the original dataset:

pub1 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
        "pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
        "author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
        "author1, author2, author3, author4, author5, author6")), 
        class = "data.frame", row.names = c(NA, -6L))

And here's an expected output:

pub2 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
        "pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
        "author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
        "author1, author2, author3, author4, author5, author6"),
        author_first = c("author1", "author1", "author1", "author1", "author1", "author1"),
        author_second_last = c("", ""," author2", " author3", " author4", " author5"),
        author_last = c("", " author2", " author3", " author4", " author5", " author6")),
        class = "data.frame", row.names = c(NA, -6L))

I have no idea how to go about this.

CodePudding user response:

Here's an idea of how to do it using dplyr and stringr

library(dplyr)
library(stringr)

author_position = function(str, p, position) {
  stopifnot(is.numeric(position))
  # split the string up into a vector of pieces using a pattern (in this case `,`)
  # and trim the white space
  s = str_trim(str_split(str, p, simplify = TRUE))
  len = length(s)
  
  # Return NA if the author position chosen is greater than or equal to the length of the new vector
  # Caveat: If the position is 1, then return the value at the first position
  if(abs(position) >= len) {
    if(position == 1) {
      first(s)
    } else {
      NA
    }
  # Return the the value at the selected position 
  } else {
    nth(s, position)
  }
}

pub1 %>%
  rowwise() %>% # group by row
  mutate(author_first = author_position(authors,",",1),
         author_second_last = author_position(authors,",",-2),
         author_last = author_position(authors,",",-1))

# # A tibble: 6 × 5
# # Rowwise: 
#   publication authors                                              author_first author_second_last author_last
#   <chr>       <chr>                                                <chr>        <chr>              <chr>      
# 1 pub1        author1                                              author1      NA                 NA         
# 2 pub2        author1, author2                                     author1      NA                 author2    
# 3 pub3        author1, author2, author3                            author1      author2            author3    
# 4 pub4        author1, author2, author3, author4                   author1      author3            author4    
# 5 pub5        author1, author2, author3, author4, author5          author1      author4            author5    
# 6 pub6        author1, author2, author3, author4, author5, author6 author1      author5            author6 

Edited: To allow capability to return any author position and added comments.

The only constraint here is that the first/last authors are fixed. So if you want to return the 3rd to last author and there are only 3 authors for the publication, it will return NA since technically that's considered to be the first. Same goes for returning the 3rd author as that would be considered to be the last author if there are only 3 authors.

pub1 %>%
  rowwise() %>% # group by row
  mutate(author_third = author_position(authors,",",3),
         author_third_last = author_position(authors, ",", -3))


# # A tibble: 6 × 4
# # Rowwise: 
#   publication authors                                              author_third author_third_last
#   <chr>       <chr>                                                <chr>        <chr>            
# 1 pub1        author1                                              NA           NA               
# 2 pub2        author1, author2                                     NA           NA               
# 3 pub3        author1, author2, author3                            NA           NA               
# 4 pub4        author1, author2, author3, author4                   author3      author2          
# 5 pub5        author1, author2, author3, author4, author5          author3      author3          
# 6 pub6        author1, author2, author3, author4, author5, author6 author3      author4  
  •  Tags:  
  • r
  • Related