I have a messy csv dataset that contains several (but not all) rows that unfortunately contains multiple entries. For each row, I'd like to separate each entry out so that i can create a list of the unique values (in this case, a list of specific clinical trial sites). The multiple entries are separated by "|". To make life even more fun, I'd like to exclude any entry that isn't from the US.
I'm just having a very tough time conceptualizing how to even start. if each line only had one value I think i could work through it with base R. Maybe something from tidyverse...separate or separate_rows, or use regex to extract everything bounded by "|"?
Example data:
Locations
1
University of Pennsylvania, Philadelphia, Pennsylvania, United States | University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States | Houston Methodist Cancer Center, Houston, Texas, United States
2
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States | Moffitt Cancer Center, Tampa, Florida, United States | Biomira Inc.
Edmonton, Alberta, Canada
3
Massachusetts General Hospital, Boston, Massachusetts, United States
4
Moffitt Cancer Center, Tampa, Florida, United States | Sunnybrook Health Sciences Centre
Toronto, Ontario, Canada
5
Memorial Sloan Kettering Cancer Center, New York, New York, United States
6
Duke University Medical Center, Durham, North Carolina, United States
7
Moffitt Cancer Center, Tampa, Florida, United States
8
Moffitt Cancer Center, Tampa, Florida, United States | Tom Baker Cancer Centre
Calgary, Alberta, Canada
9
Houston Methodist Cancer Center, Houston, Texas, United States
10
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States
Desired output:
University of Pennsylvania, Philadelphia, Pennsylvania, United States
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States
Houston Methodist Cancer Center, Houston, Texas, United States
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States
Moffitt Cancer Center, Tampa, Florida, United States
(etc etc etc)
CodePudding user response:
Duh, turned out to be almost trivial.
df %>%
tidyr::separate_rows(Locations,sep="\\|",convert=T)
Tricky thing was escaping out the "|" symbol!
CodePudding user response:
library(stringr)
z <- c("Thing one | another thing | yet another", "different thing | other thing")
z
#> [1] "Thing one | another thing | yet another"
#> [2] "different thing | other thing"
str_split(z, "\\|") %>%
unlist
#> [1] "Thing one " " another thing " " yet another" "different thing "
#> [5] " other thing"