Concatenating variable number of rows in a group to create a single row and cast from long to wide f-CodePudding

I am using R tidyverse to clean data downloaded from pubmed to get author affiliations from some articles. I have gotten this far and have data that looks like this:

pubmed_id   variable_id entry
29542687    PMID    29542687
29542687    FAU Kato
29542687    FAU Gregory J
29542687    AU  Kato GJ
29542687    AD  Heart
29542687    AD  Lung and Blood Vascular Medicine Institute and the Division of
29542687    AD  Hematology-Oncology
29542687    AD  Department of Medicine
29542687    AD  University of Pittsburgh
29542687    AD  200 Lothrop
29542687    AD  Street
29542687    AD  Pittsburgh
29542687    AD  PA 15261
29542687    AD  USA.
29542687    FAU Piel
29542687    FAU Fredrich
29542687    AU  Piel FB
29542687    AD  PHE Centre for Environment and Health
29542687    AD  Department of Epidemiology and
29542687    AD  Biostatistics
29542687    AD  School of Public Health
29542687    AD  Faculty of Medicine
29542687    AD  Imperial College
29542687    AD  London
29542687    AD  London
29542687    AD  UK.
29542687    FAU Reid
29542687    FAU Clarice D
29542687    AU  Reid CD
29542687    AD  Sickle Cell Disease Branch
29542687    AD  National Heart
29542687    AD  Lung and Blood Institute
29542687    AD  NIH
29542687    AD  Bethesda
29542687    AD  
29542687    AD  MD
29542687    AD  USA.
29542687    FAU Gaston
29542687    FAU Marilyn H
29542687    AU  Gaston MH
29542687    AD  The Gaston and Porter Health Improvement Center
29542687    AD  Potomac
29542687    AD  MD
29542687    AD  USA.

I want to convert this to data that looks like this:

pubmed_id   full_author         author       address
29542687    Kato, Gregory J     Kato GJ      Heart Lung and Blood Vascular Medicine Institute and the Division of Hematology-Oncology Department of Medicine University of Pittsburgh 200 Lothrop Street Pittsburgh PA 15261 USA.
29542687    Piel, Fredrich B    Piel FB      PHE Centre for Environment and Health Department of Epidemiology and Biostatistics School of Public Health Faculty of Medicine Imperial College London London UK.
29542687    Reid, Clarice D     Reid CD      National Heart Lung and Blood Institute NIH Bethesda MD USA.
29542687    Gaston, Marilyn H   Gaston MH    The Gaston and Porter Health Improvement Center Potomac MD USA.

Initial thought was to somehow rank each author before combining consecutive rows but struggling to create a ranking that can handle a variable number of AD (address rows). The FAU row is the full author name but is listed on two lines. The pubmed_id field is used to track all the authors that are listed on a single article. Sample data:

structure(list(pubmed_id = c(29542687L, 29542687L, 29542687L, 
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 
29542687L, 29542687L, 29542687L, 29542687L, 29542687L), variable_id = c("PMID", 
"FAU", "FAU", "AU", "AD", "AD", "AD", "AD", "AD", "AD", "AD", 
"AD", "AD", "AD", "FAU", "FAU", "AU", "AD", "AD", "AD", "AD", 
"AD", "AD", "AD", "AD", "AD", "FAU", "FAU", "AU", "AD", "AD", 
"AD", "AD", "AD", "AD", "AD", "AD", "FAU", "FAU", "AU", "AD", 
"AD", "AD", "AD"), entry = c("29542687", "Kato", "Gregory J", 
"Kato GJ", "Heart", "Lung and Blood Vascular Medicine Institute and the Division of", 
"Hematology-Oncology", "Department of Medicine", "University of Pittsburgh", 
"200 Lothrop", "Street", "Pittsburgh", "PA 15261", "USA.", "Piel", 
"Fredrich", "Piel FB", "PHE Centre for Environment and Health", 
"Department of Epidemiology and", "Biostatistics", "School of Public Health", 
"Faculty of Medicine", "Imperial College", "London", "London", 
"UK.", "Reid", "Clarice D", "Reid CD", "Sickle Cell Disease Branch", 
"National Heart", "Lung and Blood Institute", "NIH", "Bethesda", 
"", "MD", "USA.", "Gaston", "Marilyn H", "Gaston MH", "The Gaston and Porter Health Improvement Center", 
"Potomac", "MD", "USA.")), class = "data.frame", row.names = c(NA, 
-44L))

Greatly appreciate any pointers on cleaning this data. Thanks!

CodePudding user response：

One option to achieve your desired result may look like so:

As your data contains multiple authors per pubmed_id I first add an identifier for the authors
After doing so we could use group_by summarise to collapse the entries for the full author name and the address into single strings where I use a ; as the separator
Now we are ready to convert to wide format
Finally replace ; in the full author column by a ,, remove it from the adress column and rename the columns

library(dplyr)
library(tidyr)

d %>%
  filter(!variable_id %in% "PMID") %>% 
  group_by(pubmed_id) %>% 
  mutate(author_id = cumsum(variable_id == "FAU" & !lag(variable_id) %in% "FAU")) %>% 
  group_by(variable_id, author_id, .add = TRUE) %>% 
  summarise(entry = paste(entry, collapse = "; "), .groups = "drop") %>% 
  pivot_wider(names_from = variable_id, values_from = entry) %>% 
  mutate(FAU = gsub("; ", ", ", FAU),
         AD = gsub("; ", " ", AD)) %>% 
  select(pubmed_id, author_id, full_author = FAU, author = AU, adress = AD)
#> # A tibble: 4 × 5
#>   pubmed_id author_id full_author       author    adress                        
#>       <int>     <int> <chr>             <chr>     <chr>                         
#> 1  29542687         1 Kato, Gregory J   Kato GJ   Heart Lung and Blood Vascular…
#> 2  29542687         2 Piel, Fredrich    Piel FB   PHE Centre for Environment an…
#> 3  29542687         3 Reid, Clarice D   Reid CD   Sickle Cell Disease Branch Na…
#> 4  29542687         4 Gaston, Marilyn H Gaston MH The Gaston and Porter Health …

CodePudding user response：

This one is ugly... but a possible solution, or at least a step in a direction to solve your data issue:

library(tidyverse)
library(data.table)

left_join(
df %>% 
  mutate(temp = data.table::rleid(variable_id == "FAU")) %>% 
  group_by(temp) %>% 
  mutate(full_author = case_when(variable_id == "FAU" ~ paste0(lag(entry), ", ", entry)),
         author = case_when(variable_id == "FAU" ~ paste0(lag(entry), " ", sapply(str_extract_all(entry, '[A-Z] '),paste0, collapse = '')))) %>% 
  slice(2), 
    
df %>% 
  mutate(temp = data.table::rleid(variable_id == "FAU")) %>% 
  group_by(temp) %>%
  filter(variable_id == "AD") %>% 
  summarise(address = paste(entry, collapse = " ")),

by = "temp"
) %>% 
  ungroup() %>% 
  fill(address, .direction = "up") %>% 
  select(pubmed_id, full_author, author, address) %>% 
  drop_na

This gives us:

# A tibble: 4 x 4
  pubmed_id full_author      author    address                                                                                                                              
      <int> <chr>            <chr>     <chr>                                                                                                                                
1  29542687 Kato, Gregory J  Kato GJ   Heart Lung and Blood Vascular Medicine Institute and the Division of Hematology-Oncology Department of Medicine University of Pittsb~
2  29542687 Piel, Fredrich   Piel F    PHE Centre for Environment and Health Department of Epidemiology and Biostatistics School of Public Health Faculty of Medicine Imper~
3  29542687 Reid, Clarice D  Reid CD   Sickle Cell Disease Branch National Heart Lung and Blood Institute NIH Bethesda  MD USA.                                             
4  29542687 Gaston, Marilyn~ Gaston MH The Gaston and Porter Health Improvement Center Potomac MD USA.

CodePudding user response：

Here is one more alternative!

library(tidyverse)
df %>% 
  mutate(variable_id1 = lead(variable_id)) %>% 
  group_by(id_Group = cumsum(variable_id=="FAU" & variable_id1=="FAU")) %>% 
  filter(id_Group!=0) %>% 
  mutate(entry = ifelse(variable_id=="FAU", paste(first(entry), first(entry,2), sep=", "),entry)) %>% 
  slice(-1) %>%
  group_by(id_Group, variable_id) %>% 
  mutate(entry = ifelse(variable_id=="AD", paste(entry, collapse = " "), entry)) %>% 
  group_by(id_Group) %>% 
  slice(1:3) %>% 
  select(-variable_id1, -id_Group) %>% 
  mutate(variable_id = case_when(variable_id == "FAU"~ "full_author",
                                 variable_id == "AU" ~ "author",
                                 TRUE ~ "address")) %>% 
  pivot_wider(
    names_from = variable_id,
    values_from = entry
    ) %>% 
  select(-id_Group)

id_Group pubmed_id full_author       author    address                                              
     <int>     <int> <chr>             <chr>     <chr>                                                
1        1  29542687 Kato, Gregory J   Kato GJ   Heart Lung and Blood Vascular Medicine Institute and~
2        2  29542687 Piel, Fredrich    Piel FB   PHE Centre for Environment and Health Department of ~
3        3  29542687 Reid, Clarice D   Reid CD   Sickle Cell Disease Branch National Heart Lung and B~
4        4  29542687 Gaston, Marilyn H Gaston MH The Gaston and Porter Health Improvement Center Poto~