Pivoting wide to long conditionally for certain columns-CodePudding

I'm working with a dataset of author collaborations. Here's a subset of five articles with article ID, corresponding author (RP), corresponding author's country (Country), author full names (AF_1:AF_3), and author short names (AU_1:AU_3):

articles <- structure(list(
ArtID = 1:5,
RP = c("DE GARRIDO, L","CURSEU, PL","HENIKE, T","DI VINCENZO, F","OMIGIE, D"),
Country = c("spain", "romania", "germany", "italy", NA),
AF_1 = c("DE GARRIDO, LUIS","CURSEU, PETRU L.","STIELER, MAXIMILIAN","DI VINCENZO, FAUSTO","OMIGIE, DIANA"),
AF_2 = c(NA,"SCHRUIJER, SANDRA G. L.","HENIKE, TASSILO","IACOPINO, VALENTINA","RICCI, JESSICA"),
AF_3 = c(NA, "FODOR, OANA C.", NA, NA, NA),
AU_1 = c("DE GARRIDO L", "CURSEU PL", "STIELER M","DI VINCENZO F", "OMIGIE D"),
AU_2 = c(NA, "SCHRUIJER SGL", "HENIKE T","IACOPINO V", "RICCI J"),
AU_3 = c(NA, "FODOR OC", NA, NA, NA)),
row.names = c(NA,-5L),class = c("data.frame"))

> articles
  ArtID             RP Country                AF_1                    AF_2           AF_3          AU_1          AU_2     AU_3
1     1  DE GARRIDO, L   spain    DE GARRIDO, LUIS                    <NA>           <NA>  DE GARRIDO L          <NA>     <NA>
2     2     CURSEU, PL romania    CURSEU, PETRU L. SCHRUIJER, SANDRA G. L. FODOR, OANA C.     CURSEU PL SCHRUIJER SGL FODOR OC
3     3      HENIKE, T germany STIELER, MAXIMILIAN         HENIKE, TASSILO           <NA>     STIELER M      HENIKE T     <NA>
4     4 DI VINCENZO, F   italy DI VINCENZO, FAUSTO     IACOPINO, VALENTINA           <NA> DI VINCENZO F    IACOPINO V     <NA>
5     5      OMIGIE, D    <NA>       OMIGIE, DIANA          RICCI, JESSICA           <NA>      OMIGIE D       RICCI J     <NA>

When pivoting the data from wide (articles) to long (authors), it automatically copies other column values for all authors (i.e., country). pivot_longer code and data:

authors <- articles %>% pivot_longer(cols=starts_with(c("AF","AU")),names_to=c(".value","ArtAthID"),names_sep="_",values_drop_na=T)

> authors
# A tibble: 10 × 6
   ArtID RP             Country ArtAthID AF                      AU           
   <int> <chr>          <chr>   <chr>    <chr>                   <chr>        
 1     1 DE GARRIDO, L  spain   1        DE GARRIDO, LUIS        DE GARRIDO L 
 2     2 CURSEU, PL     romania 1        CURSEU, PETRU L.        CURSEU PL    
 3     2 CURSEU, PL     romania 2        SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
 4     2 CURSEU, PL     romania 3        FODOR, OANA C.          FODOR OC     
 5     3 HENIKE, T      germany 1        STIELER, MAXIMILIAN     STIELER M    
 6     3 HENIKE, T      germany 2        HENIKE, TASSILO         HENIKE T     
 7     4 DI VINCENZO, F italy   1        DI VINCENZO, FAUSTO     DI VINCENZO F
 8     4 DI VINCENZO, F italy   2        IACOPINO, VALENTINA     IACOPINO V   
 9     5 OMIGIE, D      NA      1        OMIGIE, DIANA           OMIGIE D     
10     5 OMIGIE, D      NA      2        RICCI, JESSICA          RICCI J

However, when pivoting, I want some columns (Country, Uni, Email) to stay with the corresponding author and be NA for the other authors. For example, Country should look like this:

> authors
# A tibble: 10 × 6
   ArtID RP             Country ArtAthID AF                      AU           
   <int> <chr>          <chr>   <chr>    <chr>                   <chr>        
 1     1 DE GARRIDO, L  spain   1        DE GARRIDO, LUIS        DE GARRIDO L 
 2     2 CURSEU, PL     romania 1        CURSEU, PETRU L.        CURSEU PL    
 3     2 CURSEU, PL     NA      2        SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
 4     2 CURSEU, PL     NA      3        FODOR, OANA C.          FODOR OC     
 5     3 HENIKE, T      NA      1        STIELER, MAXIMILIAN     STIELER M    
 6     3 HENIKE, T      germany 2        HENIKE, TASSILO         HENIKE T     
 7     4 DI VINCENZO, F italy   1        DI VINCENZO, FAUSTO     DI VINCENZO F
 8     4 DI VINCENZO, F NA      2        IACOPINO, VALENTINA     IACOPINO V   
 9     5 OMIGIE, D      NA      1        OMIGIE, DIANA           OMIGIE D     
10     5 OMIGIE, D      NA      2        RICCI, JESSICA          RICCI J

I've tried a few things (trying to use matching between RP and AU), but am hitting a wall. Any suggestions for how to do this?

Thanks for any help!

CodePudding user response：

I think the pivot is working as designed, as it is keeping the data associated with all authors. What you're asking to do is de-duplicate Country (and perhaps others) within a particular ArtID.

Try this:

authors %>%
  group_by(ArtID) %>%
  mutate(across(c(Country), ~ replace(., duplicated(.), .[NA][1]))) %>%
  ungroup()
# # A tibble: 10 x 6
#    ArtID RP             Country ArtAthID AF                      AU           
#    <int> <chr>          <chr>   <chr>    <chr>                   <chr>        
#  1     1 DE GARRIDO, L  spain   1        DE GARRIDO, LUIS        DE GARRIDO L 
#  2     2 CURSEU, PL     romania 1        CURSEU, PETRU L.        CURSEU PL    
#  3     2 CURSEU, PL     NA      2        SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
#  4     2 CURSEU, PL     NA      3        FODOR, OANA C.          FODOR OC     
#  5     3 HENIKE, T      germany 1        STIELER, MAXIMILIAN     STIELER M    
#  6     3 HENIKE, T      NA      2        HENIKE, TASSILO         HENIKE T     
#  7     4 DI VINCENZO, F italy   1        DI VINCENZO, FAUSTO     DI VINCENZO F
#  8     4 DI VINCENZO, F NA      2        IACOPINO, VALENTINA     IACOPINO V   
#  9     5 OMIGIE, D      NA      1        OMIGIE, DIANA           OMIGIE D     
# 10     5 OMIGIE, D      NA      2        RICCI, JESSICA          RICCI J

Notes:

You mentioned doing the same for Uni and Email, but those columns were not in the original data (which is fine). For this reason, I chose to use mutate(across(c(Country), ..)) instead of the more conventional mutate(Country = replace(..)): include your other columns in that c(.) vector, perhaps just mutate(across(c(Country, Uni, Email), ~ ...)).
The replace should be clear enough, but the .[NA][1] is to make sure that the replacement (of NA) is the same class of NA as the original column. There are at least six different classes of NA, and some R tools -- notably many within dplyr and tidyr -- complain when trying to combine an integer NA (NA_integer_) with a logical (NA) or string (NA_character_) or real/floating-point (NA_real_), to name a few. The use of .[NA] will always give the correct class. The addition of [1] is to get around the fact that replace(.) requires the third argument to be the same length as the number of values to replace, not necessarily the same length as the input argument x; this same-length is relaxed when recycling, so I truncate this to be always length-1.

Your desired output for ArtID=3 showed the NA before "germany", which seemed like just a typo in the sample data. If you want it matched (grepl) between RP and the other fields, then perhaps this works better (though some articles with this example clear out all countries):

authors %>%
  mutate(tmp = mapply(function(x, ...) any(grepl(x, unlist(list(...)))), RP, AF, AU), across(c(Country), ~ if_else(tmp, ., .[NA]))) %>%
  select(-tmp)
# # A tibble: 10 x 6
#    ArtID RP             Country ArtAthID AF                      AU           
#    <int> <chr>          <chr>   <chr>    <chr>                   <chr>        
#  1     1 DE GARRIDO, L  spain   1        DE GARRIDO, LUIS        DE GARRIDO L 
#  2     2 CURSEU, PL     NA      1        CURSEU, PETRU L.        CURSEU PL    
#  3     2 CURSEU, PL     NA      2        SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
#  4     2 CURSEU, PL     NA      3        FODOR, OANA C.          FODOR OC     
#  5     3 HENIKE, T      NA      1        STIELER, MAXIMILIAN     STIELER M    
#  6     3 HENIKE, T      germany 2        HENIKE, TASSILO         HENIKE T     
#  7     4 DI VINCENZO, F italy   1        DI VINCENZO, FAUSTO     DI VINCENZO F
#  8     4 DI VINCENZO, F NA      2        IACOPINO, VALENTINA     IACOPINO V   
#  9     5 OMIGIE, D      NA      1        OMIGIE, DIANA           OMIGIE D     
# 10     5 OMIGIE, D      NA      2        RICCI, JESSICA          RICCI J

This second method is flawed because of the mismatch in author name formats ("HENIKE, T" != "HENIKE T", for instance). If you can come up with a better way to match RP with the other columns, then this may be more to your liking.

CodePudding user response：

I don't think you can do what you want within pivot_longer, but I think this gives you what you need for one "additional" variable (Which is all you've given us in your test data...)

articles %>% 
  group_by(Country) %>% 
  pivot_longer(
    cols=starts_with(c("AF","AU")),
    names_to=c(".value","ArtAthID"),
    names_sep="_",
    values_drop_na=T
  ) %>% 
  mutate(Country=ifelse(row_number() == 1, Country, NA)) %>% 
  ungroup()
# A tibble: 10 × 6
# Groups:   Country [5]
   ArtID RP             Country ArtAthID AF                      AU           
   <int> <chr>          <chr>   <chr>    <chr>                   <chr>        
 1     1 DE GARRIDO, L  spain   1        DE GARRIDO, LUIS        DE GARRIDO L 
 2     2 CURSEU, PL     romania 1        CURSEU, PETRU L.        CURSEU PL    
 3     2 CURSEU, PL     NA      2        SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
 4     2 CURSEU, PL     NA      3        FODOR, OANA C.          FODOR OC     
 5     3 HENIKE, T      germany 1        STIELER, MAXIMILIAN     STIELER M    
 6     3 HENIKE, T      NA      2        HENIKE, TASSILO         HENIKE T     
 7     4 DI VINCENZO, F italy   1        DI VINCENZO, FAUSTO     DI VINCENZO F
 8     4 DI VINCENZO, F NA      2        IACOPINO, VALENTINA     IACOPINO V   
 9     5 OMIGIE, D      NA      1        OMIGIE, DIANA           OMIGIE D     
10     5 OMIGIE, D      NA      2        RICCI, JESSICA          RICCI J

if you need more than one additional variable, add each to the group_by() call and then use across() in the mutate(). Something like (untested code):

...
%>% mutate(
      across(
        c(Country, Uni, Email), 
        function(x) ifelse(row_number() == 1, x, NA)
      )
    )