I need to split the column into two columns.
Here is the input data set
df_in <- data.table(
product = c("Product 1 - 100g", "Product 2 - 150g", "Product 3", "Product 4 - keep it"),
tags = c(NA, NA, NA, NA))
The data after " - " should be extracted into the "tags" column. But not for all rows. Sometimes there are products with "-" we need to keep. I put these strings into separate list
list_of_keep_data <- c("keep it")
So, the output data set should be the following:
df_out <- data.table(
product = c("Product 1", "Product 2", "Product 3", "Product 4 - keep it"),
tags = c("100g", "150g", NA, NA))
Will be glad to see possible solutions by using "str_split", "str_location" or any other functions in a tidyverse style.
Thanks!
CodePudding user response:
One option would be to use an ifelse
to find any rows that do not have an item in list_of_keep_data
, then replace the hyphen with something else (like ;>
) and leave the keep it rows alone. Then, we can use separate
by using the new delimiter (;>
). This simultaneously removes the text from product
and puts the other text into the tags
column.
library(tidyverse)
df_in %>%
mutate(product = ifelse(
!str_detect(product, str_c(list_of_keep_data, collapse = "|")),
str_replace_all(product, pattern = " - ", " ;> "),
product
)) %>%
separate(product, into = c("product", "tags"), sep = " ;> ")
Output
product tags
1: Product 1 100g
2: Product 2 150g
3: Product 3 <NA>
4: Product 4 - keep it <NA>
Another option could be to filter
to the rows that you do want to separate, separate on the -
, then bind the rows back to the other rows.
df_in %>%
filter(!str_detect(product, str_c(list_of_keep_data, collapse = "|"))) %>%
separate(product, into = c("product", "tags"), sep = " - ") %>%
bind_rows(filter(df_in, str_detect(product, str_c(list_of_keep_data, collapse = "|"))))
Or here is another option using data.table
:
library(data.table)
df_in[, tags := as.character(tags)
][!str_detect(product, str_c(list_of_keep_data, collapse = "|")),
c("product", "tags") := tstrsplit(product, " - ")][]
CodePudding user response:
Using tidyr
's function extract
allows for a much simpler solution:
df_in %>%
extract(product,
into = c("product", "tags"),
regex = "(Product\\s\\d )( - \\d g)?") %>%
mutate(tags = sub("- ", "", tags),
tags = na_if(tags, ""))
product tags
1: Product 1 100g
2: Product 2 150g
3: Product 3
4: Product 4
If you need to have NA
instead of blank cells, just add:
%>% mutate(tags = na_if(tags, ""))