I'm working with a dataset of author collaborations. Here's a subset of five articles with article ID, corresponding author (RP), corresponding author's country (Country), author full names (AF_1:AF_3), and author short names (AU_1:AU_3):
articles <- structure(list(
ArtID = 1:5,
RP = c("DE GARRIDO, L","CURSEU, PL","HENIKE, T","DI VINCENZO, F","OMIGIE, D"),
Country = c("spain", "romania", "germany", "italy", NA),
AF_1 = c("DE GARRIDO, LUIS","CURSEU, PETRU L.","STIELER, MAXIMILIAN","DI VINCENZO, FAUSTO","OMIGIE, DIANA"),
AF_2 = c(NA,"SCHRUIJER, SANDRA G. L.","HENIKE, TASSILO","IACOPINO, VALENTINA","RICCI, JESSICA"),
AF_3 = c(NA, "FODOR, OANA C.", NA, NA, NA),
AU_1 = c("DE GARRIDO L", "CURSEU PL", "STIELER M","DI VINCENZO F", "OMIGIE D"),
AU_2 = c(NA, "SCHRUIJER SGL", "HENIKE T","IACOPINO V", "RICCI J"),
AU_3 = c(NA, "FODOR OC", NA, NA, NA)),
row.names = c(NA,-5L),class = c("data.frame"))
> articles
ArtID RP Country AF_1 AF_2 AF_3 AU_1 AU_2 AU_3
1 1 DE GARRIDO, L spain DE GARRIDO, LUIS <NA> <NA> DE GARRIDO L <NA> <NA>
2 2 CURSEU, PL romania CURSEU, PETRU L. SCHRUIJER, SANDRA G. L. FODOR, OANA C. CURSEU PL SCHRUIJER SGL FODOR OC
3 3 HENIKE, T germany STIELER, MAXIMILIAN HENIKE, TASSILO <NA> STIELER M HENIKE T <NA>
4 4 DI VINCENZO, F italy DI VINCENZO, FAUSTO IACOPINO, VALENTINA <NA> DI VINCENZO F IACOPINO V <NA>
5 5 OMIGIE, D <NA> OMIGIE, DIANA RICCI, JESSICA <NA> OMIGIE D RICCI J <NA>
When pivoting the data from wide (articles) to long (authors), it automatically copies other column values for all authors (i.e., country). pivot_longer code and data:
authors <- articles %>% pivot_longer(cols=starts_with(c("AF","AU")),names_to=c(".value","ArtAthID"),names_sep="_",values_drop_na=T)
> authors
# A tibble: 10 × 6
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL romania 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL romania 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F italy 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
However, when pivoting, I want some columns (Country, Uni, Email) to stay with the corresponding author and be NA for the other authors. For example, Country should look like this:
> authors
# A tibble: 10 × 6
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T NA 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
I've tried a few things (trying to use matching between RP and AU), but am hitting a wall. Any suggestions for how to do this?
Thanks for any help!
CodePudding user response:
I think the pivot is working as designed, as it is keeping the data associated with all authors. What you're asking to do is de-duplicate Country
(and perhaps others) within a particular ArtID
.
Try this:
authors %>%
group_by(ArtID) %>%
mutate(across(c(Country), ~ replace(., duplicated(.), .[NA][1]))) %>%
ungroup()
# # A tibble: 10 x 6
# ArtID RP Country ArtAthID AF AU
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
# 2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
# 3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
# 4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
# 5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
# 6 3 HENIKE, T NA 2 HENIKE, TASSILO HENIKE T
# 7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
# 8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
# 9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
# 10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
Notes:
- You mentioned doing the same for
Uni
andEmail
, but those columns were not in the original data (which is fine). For this reason, I chose to usemutate(across(c(Country), ..))
instead of the more conventionalmutate(Country = replace(..))
: include your other columns in thatc(.)
vector, perhaps justmutate(across(c(Country, Uni, Email), ~ ...))
. - The
replace
should be clear enough, but the.[NA][1]
is to make sure that the replacement (ofNA
) is the same class ofNA
as the original column. There are at least six different classes ofNA
, and some R tools -- notably many withindplyr
andtidyr
-- complain when trying to combine an integer NA (NA_integer_
) with a logical (NA
) or string (NA_character_
) or real/floating-point (NA_real_
), to name a few. The use of.[NA]
will always give the correct class. The addition of[1]
is to get around the fact thatreplace(.)
requires the third argument to be the same length as the number of values to replace, not necessarily the same length as the input argumentx
; this same-length is relaxed when recycling, so I truncate this to be always length-1.
Your desired output for ArtID=3
showed the NA
before "germany"
, which seemed like just a typo in the sample data. If you want it matched (grepl
) between RP
and the other fields, then perhaps this works better (though some articles with this example clear out all countries):
authors %>%
mutate(tmp = mapply(function(x, ...) any(grepl(x, unlist(list(...)))), RP, AF, AU), across(c(Country), ~ if_else(tmp, ., .[NA]))) %>%
select(-tmp)
# # A tibble: 10 x 6
# ArtID RP Country ArtAthID AF AU
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
# 2 2 CURSEU, PL NA 1 CURSEU, PETRU L. CURSEU PL
# 3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
# 4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
# 5 3 HENIKE, T NA 1 STIELER, MAXIMILIAN STIELER M
# 6 3 HENIKE, T germany 2 HENIKE, TASSILO HENIKE T
# 7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
# 8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
# 9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
# 10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
This second method is flawed because of the mismatch in author name formats ("HENIKE, T" != "HENIKE T"
, for instance). If you can come up with a better way to match RP
with the other columns, then this may be more to your liking.
CodePudding user response:
I don't think you can do what you want within pivot_longer
, but I think this gives you what you need for one "additional" variable (Which is all you've given us in your test data...)
articles %>%
group_by(Country) %>%
pivot_longer(
cols=starts_with(c("AF","AU")),
names_to=c(".value","ArtAthID"),
names_sep="_",
values_drop_na=T
) %>%
mutate(Country=ifelse(row_number() == 1, Country, NA)) %>%
ungroup()
# A tibble: 10 × 6
# Groups: Country [5]
ArtID RP Country ArtAthID AF AU
<int> <chr> <chr> <chr> <chr> <chr>
1 1 DE GARRIDO, L spain 1 DE GARRIDO, LUIS DE GARRIDO L
2 2 CURSEU, PL romania 1 CURSEU, PETRU L. CURSEU PL
3 2 CURSEU, PL NA 2 SCHRUIJER, SANDRA G. L. SCHRUIJER SGL
4 2 CURSEU, PL NA 3 FODOR, OANA C. FODOR OC
5 3 HENIKE, T germany 1 STIELER, MAXIMILIAN STIELER M
6 3 HENIKE, T NA 2 HENIKE, TASSILO HENIKE T
7 4 DI VINCENZO, F italy 1 DI VINCENZO, FAUSTO DI VINCENZO F
8 4 DI VINCENZO, F NA 2 IACOPINO, VALENTINA IACOPINO V
9 5 OMIGIE, D NA 1 OMIGIE, DIANA OMIGIE D
10 5 OMIGIE, D NA 2 RICCI, JESSICA RICCI J
if you need more than one additional variable, add each to the group_by()
call and then use across()
in the mutate()
. Something like (untested code):
...
%>% mutate(
across(
c(Country, Uni, Email),
function(x) ifelse(row_number() == 1, x, NA)
)
)