Home > Net >  extracting strings from a dataframe row containing multiple entries?
extracting strings from a dataframe row containing multiple entries?

Time:01-26

I have a messy csv dataset that contains several (but not all) rows that unfortunately contains multiple entries. For each row, I'd like to separate each entry out so that i can create a list of the unique values (in this case, a list of specific clinical trial sites). The multiple entries are separated by "|". To make life even more fun, I'd like to exclude any entry that isn't from the US.

I'm just having a very tough time conceptualizing how to even start. if each line only had one value I think i could work through it with base R. Maybe something from tidyverse...separate or separate_rows, or use regex to extract everything bounded by "|"?

Example data:

Locations  
1
University of Pennsylvania, Philadelphia, Pennsylvania, United States | University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States | Houston Methodist Cancer Center, Houston, Texas, United States  
2
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States | Moffitt Cancer Center, Tampa, Florida, United States  | Biomira Inc.
Edmonton, Alberta, Canada   
3
Massachusetts General Hospital, Boston, Massachusetts, United States  
4
Moffitt Cancer Center, Tampa, Florida, United States  | Sunnybrook Health Sciences Centre
Toronto, Ontario, Canada  
5
Memorial Sloan Kettering Cancer Center, New York, New York, United States  
6
Duke University Medical Center, Durham, North Carolina, United States  
7
Moffitt Cancer Center, Tampa, Florida, United States  
8
Moffitt Cancer Center, Tampa, Florida, United States | Tom Baker Cancer Centre
Calgary, Alberta, Canada    
9
Houston Methodist Cancer Center, Houston, Texas, United States  
10
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States  

Desired output:

University of Pennsylvania, Philadelphia, Pennsylvania, United States   
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States  
Houston Methodist Cancer Center, Houston, Texas, United States   
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States   
Moffitt Cancer Center, Tampa, Florida, United States   
(etc etc etc)

CodePudding user response:

Duh, turned out to be almost trivial.

df %>%
  tidyr::separate_rows(Locations,sep="\\|",convert=T) 

Tricky thing was escaping out the "|" symbol!

CodePudding user response:

library(stringr)
z <- c("Thing one | another thing | yet another", "different thing | other thing")
z
#> [1] "Thing one | another thing | yet another"
#> [2] "different thing | other thing"

str_split(z, "\\|") %>%
  unlist
#> [1] "Thing one "       " another thing "  " yet another"     "different thing "
#> [5] " other thing"
  • Related