Home > Software engineering >  Replace strings in one column based on two columns from another data frame
Replace strings in one column based on two columns from another data frame

Time:08-22

I have one column in a multi-column data frame, let's call that df1, that consists of some text, some example rows below:

"896 European ancestry cases, 2,455 European ancestry controls"
"591 European individuals, 687 European males"
"1,968 African American cases, 3,928 African American controls"

Then I have two columns in another data frame, let's call that df2, that look like this:

"European ancestry cases", "European"
"European ancestry controls", "European"
"European individuals", "European"
"European males", "European"
"African American cases", "African"
"African American controls", "African"

The quotation marks aren't actually there, I just used those to clarify the number of columns.

I would like to replace, in df1, all the instances of the first column of df2 with the second column of df2. In other words, I want to change df1 only, but using information in df2, which would give me back the following result:

"896 European, 2,455 European"
"591 European, 687 European"
"1,968 African, 3,928 African"

Any ideas on how to accomplish this in R?

(PS. The real data has thousands of rows for both df1 and df2, with many more variations, these example rows are a simplification of the problem)

CodePudding user response:

You could try

# 1
gsub("(European|African)([^,] )", "\\1", df1$txt)

# 2
gsub("(?<=European|African)[^,] ", "", df1$txt, perl = TRUE)

# [1] "896 European, 2,455 European"
# [2] "591 European, 687 European"  
# [3] "1,968 African, 3,928 African"

You could also use str_replace_all() from stringr to perform multiple replacements by passing a named vector (c(pattern1 = replacement1)) to it.

library(tidyverse)

df1 %>%
  mutate(txt = str_replace_all(txt, deframe(df2)))

#                            txt
# 1 896 European, 2,455 European
# 2   591 European, 687 European
# 3 1,968 African, 3,928 African

Data
df1 <- data.frame(txt = c("896 European ancestry cases, 2,455 European ancestry controls",
                          "591 European individuals, 687 European males",
                          "1,968 African American cases, 3,928 African American controls"))

df2 <- structure(list(
V1 = c("European ancestry cases", "European ancestry controls", "European individuals", "European males", "African American cases", "African American controls"),
V2 = c("European", "European", "European", "European", "African", "African")),
class = "data.frame", row.names = c(NA, -6L))

CodePudding user response:

Here is one more:

#library(dplyr)
#library(stringr)
#library(tidyr)
#library(readr)
library(tidyverse)

df1 %>% 
  mutate(id = row_number()) %>% 
  separate_rows(col1, sep = "\\, ") %>% 
  mutate(col1 = str_c(parse_number(col1), 
                      str_extract(col1, str_c(unique(df2$col2), collapse = "|")), sep = " ")) %>% 
  group_by(id) %>% 
  mutate(col1 = toString(col1)) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-id)
 col1                       
  <chr>                      
1 896 European, 2455 European
2 591 European, 687 European 
3 1968 African, 3928 African 

data:

> dput(df1)
structure(list(col1 = c("896 European ancestry cases, 2,455 European ancestry controls", 
"591 European individuals, 687 European males", "1,968 African American cases, 3,928 African American controls"
)), class = "data.frame", row.names = c(NA, -3L))
> dput(df2)
structure(list(col1 = c("European ancestry cases", "European ancestry controls", 
"European individuals", "European males", "African American cases", 
"African American controls"), col2 = c("European", "European", 
"European", "European", "African", "African")), class = "data.frame", row.names = c(NA, 
-6L))
  • Related