Stringr pattern matching: Is there a way to identify multiple product descriptions?-CodePudding

In my dataset, i have product descriptions that appear as:

Product A, Product A, Product A

and in other rows as

Product A, Product B, Product A, Product B

and in some rows, as just

Product A

Initially, my dataset had strings in the format:

Product A, Product B, Product A, Product B, Product A, Product B

and

Product A, Product A, Product A

Since I wanted just one instance of each product, I resolved this issue by using the following code:

df$lengths <- str_length(df$items)

df$new_items <- str_sub(df$items, 1, df$lengths/3)

Is there a way to solve the above problem by modifying this code?

df <-
structure(list(Product_name = c("Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (4 64)", 
"Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32)", 
"Samsung A32 (6 128), Samsung A32 (6 128), Samsung A32 (6 128)", 
"samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s(3 32)", 
"Xiaomi Redmi 10 (6 128), Xiaomi Redmi 10 (6 128)", "Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128)"
)), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response：

EDIT:

If the comma-separated strings do not always contain identical elements, more complex solutions are in order:

Data:

Product_name = c("Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (3 32)", "Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (4 64)", "Samsung A32 (6 128), Samsung A32 (6 128), Samsung A32 (6 128)", "samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32)", "Xiaomi Redmi 10 (6 128), Xiaomi Redmi 10 (6 128)", "Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128)")

Solution 1: A regex solution based on negative character class, negative lookahead, and backreference -- basically, a one-liner:

library(dplyr)
library(stringr)
data.frame(Product_name) %>%
  mutate(Product_name = str_extract_all(Product_name, "([^,] )(?!.*\\1)"))
                                              Product_name
1  Samsung Galaxy A03s (4 64),  Samsung Galaxy A03s (3 32)
2  Samsung Galaxy A03s (3 32),  Samsung Galaxy A03s (4 64)
3                                      Samsung A32 (6 128)
4                                      samsung A02s (3 32)
5                                  Xiaomi Redmi 10 (6 128)
6                                Redmi Note 10 Pro (6 128)

Solution 2: Based on tidyrfunctionality

library(tidyr)
library(dplyr)
data.frame(Product_name) %>%
  # create identifier:
  mutate(row = row_number()) %>%
  # separate rows into individual elements:
  separate_rows(Product_name, sep = ", ") %>%
  group_by(row) %>%
  # remove duplicated elements:
  filter(!duplicated(Product_name)) %>%
  # put distinct elements back into the same row:
  summarise(Product_name = toString(Product_name))
# A tibble: 6 x 2
    row Product_name                                          
  <int> <chr>                                                 
1     1 Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (3 32)
2     2 Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (4 64)
3     3 Samsung A32 (6 128)                                   
4     4 samsung A02s (3 32)                                   
5     5 Xiaomi Redmi 10 (6 128)                               
6     6 Redmi Note 10 Pro (6 128)

Before edit:

This solution is based on the assumption that the comma-separated elements in the strings are always identical:

library(stringr)
str_extract(Product_name, "[^,] ")
[1] "Samsung Galaxy A03s (4 64)" "Samsung Galaxy A03s (3 32)"
[3] "Samsung A32 (6 128)"        "samsung A02s (3 32)"       
[5] "Xiaomi Redmi 10 (6 128)"    "Redmi Note 10 Pro (6 128)"

Data:

Product_name = c("Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (4 64)", "Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32)", "Samsung A32 (6 128), Samsung A32 (6 128), Samsung A32 (6 128)", "samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32)", "Xiaomi Redmi 10 (6 128), Xiaomi Redmi 10 (6 128)", "Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128)")