Home > Net >  Format author's names column on a R dataframe
Format author's names column on a R dataframe

Time:11-22

Derived from previous question here (Format author's name with stringr), I would like to edit a whole variable with different strings.

Previous solution doesn't work as it repeats all strings to each one.

library(dplyr)

x <- data.frame(
  names = c("Daenerys Targaryen, George R. R. Martin, Luís Inácio Lula da Silva",
            "Hadley Alexander Wickham, Joseph J. Allaire",
            "Stack Overflow"
            )
)

format_names <- function(variable) {
  variable %>%
    strsplit(", ") %>%
    unlist() %>% 
    gsub("(.*?) (\\w $)", "\\U\\2\\E, \\1", ., perl = TRUE) %>%
    gsub(" ([A-Z])\\w*\\.?", " \\1.", .) %>%
    paste(collapse = "; ")
}

x %>% 
  mutate(new_names = format_names(names))

#>                                                                names
#> 1 Daenerys Targaryen, George R. R. Martin, Luís Inácio Lula da Silva
#> 2                        Hadley Alexander Wickham, Joseph J. Allaire
#> 3                                                     Stack Overflow
#>                                                                                           new_names
#> 1 TARGARYEN, D.; MARTIN, G. R. R.; SILVA, L. I. L. da; WICKHAM, H. A.; ALLAIRE, J. J.; OVERFLOW, S.
#> 2 TARGARYEN, D.; MARTIN, G. R. R.; SILVA, L. I. L. da; WICKHAM, H. A.; ALLAIRE, J. J.; OVERFLOW, S.
#> 3 TARGARYEN, D.; MARTIN, G. R. R.; SILVA, L. I. L. da; WICKHAM, H. A.; ALLAIRE, J. J.; OVERFLOW, S.

Created on 2022-11-21 with reprex v2.0.2

CodePudding user response:

You'll want to replace the unlist() with something that preserves the groups. Here sapply can help

format_names <- function(variable) {
  variable %>%
    strsplit(", ") %>%
    sapply(. %>% 
    gsub("(.*?) (\\w $)", "\\U\\2\\E, \\1", ., perl = TRUE) %>%
    gsub(" ([A-Z])\\w*\\.?", " \\1.", .) %>%
    paste(collapse = "; "))
}

CodePudding user response:

One workaround is to make sure you are working by row. You can either use rowwise() from dplyr or groub_by(names). rowwise() basically group by rows, so it is the same thing.

Solution with rowwise() from dplyr

library(dplyr)

x %>% 
   rowwise() %>% 
   mutate(new_names = format_names(names))

Output

# A tibble: 3 × 2
# Rowwise: 
  names                                                              new_names                  
  <chr>                                                              <chr>                      
1 Daenerys Targaryen, George R. R. Martin, Luís Inácio Lula da Silva TARGARYEN, D.; MARTIN, G. …
2 Hadley Alexander Wickham, Joseph J. Allaire                        WICKHAM, H. A.; ALLAIRE, J…
3 Stack Overflow                                                     OVERFLOW, S.          
  • Related