I am using R tidyverse to clean data downloaded from pubmed to get author affiliations from some articles. I have gotten this far and have data that looks like this:
pubmed_id variable_id entry
29542687 PMID 29542687
29542687 FAU Kato
29542687 FAU Gregory J
29542687 AU Kato GJ
29542687 AD Heart
29542687 AD Lung and Blood Vascular Medicine Institute and the Division of
29542687 AD Hematology-Oncology
29542687 AD Department of Medicine
29542687 AD University of Pittsburgh
29542687 AD 200 Lothrop
29542687 AD Street
29542687 AD Pittsburgh
29542687 AD PA 15261
29542687 AD USA.
29542687 FAU Piel
29542687 FAU Fredrich
29542687 AU Piel FB
29542687 AD PHE Centre for Environment and Health
29542687 AD Department of Epidemiology and
29542687 AD Biostatistics
29542687 AD School of Public Health
29542687 AD Faculty of Medicine
29542687 AD Imperial College
29542687 AD London
29542687 AD London
29542687 AD UK.
29542687 FAU Reid
29542687 FAU Clarice D
29542687 AU Reid CD
29542687 AD Sickle Cell Disease Branch
29542687 AD National Heart
29542687 AD Lung and Blood Institute
29542687 AD NIH
29542687 AD Bethesda
29542687 AD
29542687 AD MD
29542687 AD USA.
29542687 FAU Gaston
29542687 FAU Marilyn H
29542687 AU Gaston MH
29542687 AD The Gaston and Porter Health Improvement Center
29542687 AD Potomac
29542687 AD MD
29542687 AD USA.
I want to convert this to data that looks like this:
pubmed_id full_author author address
29542687 Kato, Gregory J Kato GJ Heart Lung and Blood Vascular Medicine Institute and the Division of Hematology-Oncology Department of Medicine University of Pittsburgh 200 Lothrop Street Pittsburgh PA 15261 USA.
29542687 Piel, Fredrich B Piel FB PHE Centre for Environment and Health Department of Epidemiology and Biostatistics School of Public Health Faculty of Medicine Imperial College London London UK.
29542687 Reid, Clarice D Reid CD National Heart Lung and Blood Institute NIH Bethesda MD USA.
29542687 Gaston, Marilyn H Gaston MH The Gaston and Porter Health Improvement Center Potomac MD USA.
Initial thought was to somehow rank each author before combining consecutive rows but struggling to create a ranking that can handle a variable number of AD (address rows). The FAU row is the full author name but is listed on two lines. The pubmed_id field is used to track all the authors that are listed on a single article. Sample data:
structure(list(pubmed_id = c(29542687L, 29542687L, 29542687L,
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L,
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L,
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L,
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L,
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L,
29542687L, 29542687L, 29542687L, 29542687L, 29542687L, 29542687L,
29542687L, 29542687L, 29542687L, 29542687L, 29542687L), variable_id = c("PMID",
"FAU", "FAU", "AU", "AD", "AD", "AD", "AD", "AD", "AD", "AD",
"AD", "AD", "AD", "FAU", "FAU", "AU", "AD", "AD", "AD", "AD",
"AD", "AD", "AD", "AD", "AD", "FAU", "FAU", "AU", "AD", "AD",
"AD", "AD", "AD", "AD", "AD", "AD", "FAU", "FAU", "AU", "AD",
"AD", "AD", "AD"), entry = c("29542687", "Kato", "Gregory J",
"Kato GJ", "Heart", "Lung and Blood Vascular Medicine Institute and the Division of",
"Hematology-Oncology", "Department of Medicine", "University of Pittsburgh",
"200 Lothrop", "Street", "Pittsburgh", "PA 15261", "USA.", "Piel",
"Fredrich", "Piel FB", "PHE Centre for Environment and Health",
"Department of Epidemiology and", "Biostatistics", "School of Public Health",
"Faculty of Medicine", "Imperial College", "London", "London",
"UK.", "Reid", "Clarice D", "Reid CD", "Sickle Cell Disease Branch",
"National Heart", "Lung and Blood Institute", "NIH", "Bethesda",
"", "MD", "USA.", "Gaston", "Marilyn H", "Gaston MH", "The Gaston and Porter Health Improvement Center",
"Potomac", "MD", "USA.")), class = "data.frame", row.names = c(NA,
-44L))
Greatly appreciate any pointers on cleaning this data. Thanks!
CodePudding user response:
One option to achieve your desired result may look like so:
- As your data contains multiple authors per
pubmed_id
I first add an identifier for the authors - After doing so we could use group_by summarise to collapse the entries for the full author name and the address into single strings where I use a
;
as the separator - Now we are ready to convert to wide format
- Finally replace
;
in the full author column by a,
, remove it from the adress column and rename the columns
library(dplyr)
library(tidyr)
d %>%
filter(!variable_id %in% "PMID") %>%
group_by(pubmed_id) %>%
mutate(author_id = cumsum(variable_id == "FAU" & !lag(variable_id) %in% "FAU")) %>%
group_by(variable_id, author_id, .add = TRUE) %>%
summarise(entry = paste(entry, collapse = "; "), .groups = "drop") %>%
pivot_wider(names_from = variable_id, values_from = entry) %>%
mutate(FAU = gsub("; ", ", ", FAU),
AD = gsub("; ", " ", AD)) %>%
select(pubmed_id, author_id, full_author = FAU, author = AU, adress = AD)
#> # A tibble: 4 × 5
#> pubmed_id author_id full_author author adress
#> <int> <int> <chr> <chr> <chr>
#> 1 29542687 1 Kato, Gregory J Kato GJ Heart Lung and Blood Vascular…
#> 2 29542687 2 Piel, Fredrich Piel FB PHE Centre for Environment an…
#> 3 29542687 3 Reid, Clarice D Reid CD Sickle Cell Disease Branch Na…
#> 4 29542687 4 Gaston, Marilyn H Gaston MH The Gaston and Porter Health …
CodePudding user response:
This one is ugly... but a possible solution, or at least a step in a direction to solve your data issue:
library(tidyverse)
library(data.table)
left_join(
df %>%
mutate(temp = data.table::rleid(variable_id == "FAU")) %>%
group_by(temp) %>%
mutate(full_author = case_when(variable_id == "FAU" ~ paste0(lag(entry), ", ", entry)),
author = case_when(variable_id == "FAU" ~ paste0(lag(entry), " ", sapply(str_extract_all(entry, '[A-Z] '),paste0, collapse = '')))) %>%
slice(2),
df %>%
mutate(temp = data.table::rleid(variable_id == "FAU")) %>%
group_by(temp) %>%
filter(variable_id == "AD") %>%
summarise(address = paste(entry, collapse = " ")),
by = "temp"
) %>%
ungroup() %>%
fill(address, .direction = "up") %>%
select(pubmed_id, full_author, author, address) %>%
drop_na
This gives us:
# A tibble: 4 x 4
pubmed_id full_author author address
<int> <chr> <chr> <chr>
1 29542687 Kato, Gregory J Kato GJ Heart Lung and Blood Vascular Medicine Institute and the Division of Hematology-Oncology Department of Medicine University of Pittsb~
2 29542687 Piel, Fredrich Piel F PHE Centre for Environment and Health Department of Epidemiology and Biostatistics School of Public Health Faculty of Medicine Imper~
3 29542687 Reid, Clarice D Reid CD Sickle Cell Disease Branch National Heart Lung and Blood Institute NIH Bethesda MD USA.
4 29542687 Gaston, Marilyn~ Gaston MH The Gaston and Porter Health Improvement Center Potomac MD USA.
CodePudding user response:
Here is one more alternative!
library(tidyverse)
df %>%
mutate(variable_id1 = lead(variable_id)) %>%
group_by(id_Group = cumsum(variable_id=="FAU" & variable_id1=="FAU")) %>%
filter(id_Group!=0) %>%
mutate(entry = ifelse(variable_id=="FAU", paste(first(entry), first(entry,2), sep=", "),entry)) %>%
slice(-1) %>%
group_by(id_Group, variable_id) %>%
mutate(entry = ifelse(variable_id=="AD", paste(entry, collapse = " "), entry)) %>%
group_by(id_Group) %>%
slice(1:3) %>%
select(-variable_id1, -id_Group) %>%
mutate(variable_id = case_when(variable_id == "FAU"~ "full_author",
variable_id == "AU" ~ "author",
TRUE ~ "address")) %>%
pivot_wider(
names_from = variable_id,
values_from = entry
) %>%
select(-id_Group)
id_Group pubmed_id full_author author address
<int> <int> <chr> <chr> <chr>
1 1 29542687 Kato, Gregory J Kato GJ Heart Lung and Blood Vascular Medicine Institute and~
2 2 29542687 Piel, Fredrich Piel FB PHE Centre for Environment and Health Department of ~
3 3 29542687 Reid, Clarice D Reid CD Sickle Cell Disease Branch National Heart Lung and B~
4 4 29542687 Gaston, Marilyn H Gaston MH The Gaston and Porter Health Improvement Center Poto~