How to split column by some rules in R?-CodePudding

I need to split the column into two columns.

Here is the input data set

df_in <- data.table(
  product = c("Product 1 - 100g", "Product 2 - 150g", "Product 3", "Product 4 - keep it"),
  tags = c(NA, NA, NA, NA))

The data after " - " should be extracted into the "tags" column. But not for all rows. Sometimes there are products with "-" we need to keep. I put these strings into separate list

list_of_keep_data <- c("keep it")

So, the output data set should be the following:

df_out <- data.table(
  product = c("Product 1", "Product 2", "Product 3", "Product 4 - keep it"),
  tags = c("100g", "150g", NA, NA))

Will be glad to see possible solutions by using "str_split", "str_location" or any other functions in a tidyverse style.

Thanks!

CodePudding user response：

One option would be to use an ifelse to find any rows that do not have an item in list_of_keep_data , then replace the hyphen with something else (like ;>) and leave the keep it rows alone. Then, we can use separate by using the new delimiter (;>). This simultaneously removes the text from product and puts the other text into the tags column.

library(tidyverse)

df_in %>%
  mutate(product = ifelse(
    !str_detect(product, str_c(list_of_keep_data, collapse = "|")),
    str_replace_all(product, pattern = " - ", " ;> "),
    product
  )) %>%
  separate(product, into = c("product", "tags"),  sep = " ;> ")

Output

               product tags
1:           Product 1 100g
2:           Product 2 150g
3:           Product 3 <NA>
4: Product 4 - keep it <NA>

Another option could be to filter to the rows that you do want to separate, separate on the -, then bind the rows back to the other rows.

df_in %>% 
  filter(!str_detect(product, str_c(list_of_keep_data, collapse = "|"))) %>% 
  separate(product, into = c("product", "tags"),  sep = " - ") %>% 
  bind_rows(filter(df_in, str_detect(product, str_c(list_of_keep_data, collapse = "|"))))

Or here is another option using data.table:

library(data.table)

df_in[, tags := as.character(tags)
      ][!str_detect(product, str_c(list_of_keep_data, collapse = "|")),
                             c("product", "tags") := tstrsplit(product, " - ")][]

CodePudding user response：

Using tidyr's function extract allows for a much simpler solution:

df_in %>%
  extract(product,
          into = c("product", "tags"),
          regex = "(Product\\s\\d )( - \\d g)?") %>%
  mutate(tags = sub("- ", "", tags),
         tags = na_if(tags, ""))
     product    tags
1: Product 1    100g
2: Product 2    150g
3: Product 3        
4: Product 4

If you need to have NA instead of blank cells, just add:

%>% mutate(tags = na_if(tags, ""))