Home > Software engineering >  Conditionally Concatenating Strings in R
Conditionally Concatenating Strings in R

Time:11-19

I have this dataset in R:

id = 1:5
col1 = c("12 ABC", "123", "AB", "123344567", "1345677.")
col2 = c("gggw", "12", "567", "abc 123", "p")
col3 = c("abw", "abi", "klo", "poy", "17df")
col4 = c("13 AB", "344", "Huh8", "98", "b")
    
my_data = data.frame(id, col1, col2, col3, col4)

 id      col1    col2 col3  col4
1  1    12 ABC    gggw  abw 13 AB
2  2       123      12  abi   344
3  3        AB     567  klo  Huh8
4  4 123344567 abc 123  poy    98
5  5  1345677.       p 17df     b

I then used the following code to check to see if a specific cell contains AT LEAST one number:

my_data$col1_check = grepl("\\d", my_data$col1)
my_data$col2_check = grepl("\\d", my_data$col2)
my_data$col3_check = grepl("\\d", my_data$col3)
my_data$col4_check = grepl("\\d", my_data$col4)

  id      col1    col2 col3  col4 col1_check col2_check col3_check col4_check
1  1    12 ABC    gggw  abw 13 AB       TRUE      FALSE      FALSE       TRUE
2  2       123      12  abi   344       TRUE       TRUE      FALSE       TRUE
3  3        AB     567  klo  Huh8      FALSE       TRUE      FALSE       TRUE
4  4 123344567 abc 123  poy    98       TRUE       TRUE      FALSE       TRUE
5  5  1345677.       p 17df     b       TRUE      FALSE       TRUE      FALSE

What I am trying to do, is for each row : I would like to take all columns in which the value is FALSE, and paste (with a space) the contents of these columns into a single cell.

This would look something like this:

 id  new_col
1  1 gggw abw
2  2      abi
3  3   AB klo
4  4      poy
5  5      p b

I have been trying to read about "conditional concatenation" (e.g. conditional concatenation in R), but so far nothing I have read matches the problem I am working on.

Can someone please suggest what to do from here?

Thanks!

CodePudding user response:

Here is one option in tidyverse - loop across the columns col1 to col4, get the corresponding value from the logical column by pasteing the _check on the column names (cur_column()), convert the TRUE values to NA in case_when and unite those columns to new_col

library(stringr)
library(dplyr)
library(tidyr)
 my_data %>%
   transmute(id, across(col1:col4, 
    ~ case_when(!get(str_c(cur_column(), "_check"))~ .x))) %>% 
   unite(new_col, col1:col4, sep = " ", na.rm = TRUE)

-output

 id  new_col
1  1 gggw abw
2  2      abi
3  3   AB klo
4  4      poy
5  5      p b

If we want to skip creating the _check, it will be easier as we can directly convert the elements that are not needed to NA and unite

my_data %>%
   mutate(across(col1:col4,
    ~ case_when(str_detect(.x, "\\d ", negate = TRUE)  ~.x))) %>% 
   unite(new_col, col1:col4, sep = " ", na.rm = TRUE)

-output

  id  new_col
1  1 gggw abw
2  2      abi
3  3   AB klo
4  4      poy
5  5      p b

Or using base R

cbind(my_data[1], new_col = gsub("\\s{2,}", " ", 
    trimws(do.call(paste, replace(my_data[2:5], 
     as.matrix(my_data[6:9]), '')))))

-output

 id  new_col
1  1 gggw abw
2  2      abi
3  3   AB klo
4  4      poy
5  5      p b

CodePudding user response:

A base R approach

data.frame(id = my_data$id, new_col = apply(my_data[,-1], 1, function(x) 
  paste(x[!grepl("[[:digit:]]", x)], collapse=" ")))
  id  new_col
1  1 gggw abw
2  2      abi
3  3   AB klo
4  4      poy
5  5      p b

CodePudding user response:

Starting from my_data you could use

library(dplyr)
library(tidyr)
library(stringr)

my_data %>% 
  pivot_longer(-id) %>% 
  filter(!str_detect(value, "\\d")) %>% 
  group_by(id) %>% 
  summarise(new_col = paste(value, collapse = " "))

This returns

# A tibble: 5 × 2
     id new_col 
  <int> <chr>   
1     1 gggw abw
2     2 abi     
3     3 AB klo  
4     4 poy     
5     5 p b  

CodePudding user response:

Updated improved code (thanks to @Martin Gal)

my_data %>% 
  transmute(across(-id, ~case_when(!str_detect(., '\\d') ~ .))) %>% 
  unite("New_col", col1:col4, na.rm = TRUE, sep = " ")

One more: Similar to @akrun's solution but not identical:

library(dplyr)
library(tidyr)
library(stringr)

my_data %>% 
  transmute(across(-id, ~case_when(!str_detect(., '\\d')== TRUE ~ .), .names = 'new_{col}')) %>% 
  unite(New_col, starts_with('new'), na.rm = TRUE, sep = ' ')
   New_col
1 gggw abw
2      abi
3   AB klo
4      poy
5      p b

CodePudding user response:

You can do this without using any package. It might looks very tedious, but easy to follow if you have knowledge of apply function:

data.frame(id, new_col = apply(my_data[, -1], 1, FUN = function(x) {
  paste(x[!grepl("\\d", x)], collapse = " ") }))
  
my_data

  id  new_col
1  1 gggw abw
2  2      abi
3  3   AB klo
4  4      poy
5  5      p b
  • Related