In my dataset, i have product descriptions that appear as:
- Product A, Product A, Product A
and in other rows as
- Product A, Product B, Product A, Product B
and in some rows, as just
- Product A
Initially, my dataset had strings in the format:
- Product A, Product B, Product A, Product B, Product A, Product B
and
- Product A, Product A, Product A
Since I wanted just one instance of each product, I resolved this issue by using the following code:
df$lengths <- str_length(df$items)
df$new_items <- str_sub(df$items, 1, df$lengths/3)
Is there a way to solve the above problem by modifying this code?
df <-
structure(list(Product_name = c("Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (4 64)",
"Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32)",
"Samsung A32 (6 128), Samsung A32 (6 128), Samsung A32 (6 128)",
"samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s(3 32)",
"Xiaomi Redmi 10 (6 128), Xiaomi Redmi 10 (6 128)", "Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128)"
)), class = "data.frame", row.names = c(NA, -6L))
CodePudding user response:
EDIT:
If the comma-separated strings do not always contain identical elements, more complex solutions are in order:
Data:
Product_name = c("Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (3 32)", "Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (4 64)", "Samsung A32 (6 128), Samsung A32 (6 128), Samsung A32 (6 128)", "samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32)", "Xiaomi Redmi 10 (6 128), Xiaomi Redmi 10 (6 128)", "Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128)")
Solution 1: A regex solution based on negative character class, negative lookahead, and backreference -- basically, a one-liner:
library(dplyr)
library(stringr)
data.frame(Product_name) %>%
mutate(Product_name = str_extract_all(Product_name, "([^,] )(?!.*\\1)"))
Product_name
1 Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (3 32)
2 Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (4 64)
3 Samsung A32 (6 128)
4 samsung A02s (3 32)
5 Xiaomi Redmi 10 (6 128)
6 Redmi Note 10 Pro (6 128)
Solution 2: Based on tidyr
functionality
library(tidyr)
library(dplyr)
data.frame(Product_name) %>%
# create identifier:
mutate(row = row_number()) %>%
# separate rows into individual elements:
separate_rows(Product_name, sep = ", ") %>%
group_by(row) %>%
# remove duplicated elements:
filter(!duplicated(Product_name)) %>%
# put distinct elements back into the same row:
summarise(Product_name = toString(Product_name))
# A tibble: 6 x 2
row Product_name
<int> <chr>
1 1 Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (3 32)
2 2 Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (4 64)
3 3 Samsung A32 (6 128)
4 4 samsung A02s (3 32)
5 5 Xiaomi Redmi 10 (6 128)
6 6 Redmi Note 10 Pro (6 128)
Before edit:
This solution is based on the assumption that the comma-separated elements in the strings are always identical:
library(stringr)
str_extract(Product_name, "[^,] ")
[1] "Samsung Galaxy A03s (4 64)" "Samsung Galaxy A03s (3 32)"
[3] "Samsung A32 (6 128)" "samsung A02s (3 32)"
[5] "Xiaomi Redmi 10 (6 128)" "Redmi Note 10 Pro (6 128)"
Data:
Product_name = c("Samsung Galaxy A03s (4 64), Samsung Galaxy A03s (4 64)", "Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32), Samsung Galaxy A03s (3 32)", "Samsung A32 (6 128), Samsung A32 (6 128), Samsung A32 (6 128)", "samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32), samsung A02s (3 32)", "Xiaomi Redmi 10 (6 128), Xiaomi Redmi 10 (6 128)", "Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128), Redmi Note 10 Pro (6 128)")