how to group by unique part of text rows in r-CodePudding

I have a challenge. I would want to group brands of products by amount purchased. The problem is that a lot of brands in my dataset have "double" names with padding's "Bio" and "Non Bio" like this:

Amount purchased Brand
2 Bakoma Bio
1 Bakoma Non Bio
2 Dorritos Bio
3 Dorritos Non Bio

I would want to create a function that will group brands by its name no matter if they have an adding Bio or not bio.

Example output:
Amount purchased Brand
3 Bakoma
5 Dorritos

Thank you for help :)

CodePudding user response：

A possible solution:

library(tidyverse)

df <- data.frame(
  qtd = c(2L, 1L, 2L, 3L),
  product = c("Bakoma Bio","Bakoma Non Bio",
              "Dorritos Bio","Dorritos Non Bio")
)

df %>% 
  group_by(product = str_remove(product, "\\s (Bio|Non Bio)")) %>% 
  summarise(qtd = sum(qtd))

#> # A tibble: 2 × 2
#>   product    qtd
#>   <chr>    <int>
#> 1 Bakoma       3
#> 2 Dorritos     5

CodePudding user response：

A base R option

aggregate(qtd ~ product, transform(df, product = trimws(product, 
   whitespace = "\\s (Bio|Non Bio).*")), sum)
   product qtd
1   Bakoma   3
2 Dorritos   5

CodePudding user response：

Here is an alternative approach based on the pattern in your data is constant (e.g brand is first word): data from @PaulS (many thanks):

library(dplyr)
library(stringr)
df %>% 
  group_by(brand = word(product, 1)) %>% 
  summarise(Sum = sum(qtd))

  brand      Sum
  <chr>    <int>
1 Bakoma       3
2 Dorritos     5