Home > Net >  R filtering itemsets to include only the less complicated baskets
R filtering itemsets to include only the less complicated baskets

Time:01-06

I have this dataframe:

structure(list(CATEGORY = c("Edible, Vape", "Concentrate, Flower", 
"Concentrate, Flower", "Concentrate, Flower", "Edible", "Concentrate, Flower", 
"Edible, Vape", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Edible", "Edible", "Edible", "Concentrate, Flower", "Edible", 
"Edible", "Edible", "Edible, Vape", "Edible", "Edible", "Concentrate, Flower", 
"Edible", "Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower", 
"Edible", "Concentrate, Flower", "Concentrate, Edible, Flower", 
"Concentrate, Flower", "Edible", "Concentrate, Edible, Flower", 
"Edible", "Concentrate, Edible, Flower", "Concentrate, Edible, Flower, Vape", 
"Concentrate, Edible, Flower", "Concentrate, Flower", "Edible", 
"Edible", "Edible", "Concentrate, Edible, Flower, Vape", "Concentrate, Flower", 
"Concentrate, Flower", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower", 
"Concentrate, Flower", "Edible, Vape", "Concentrate, Flower", 
"Edible, Vape", "Concentrate, Edible, Flower", "Edible, Vape", 
"Concentrate, Flower", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Edible", "Concentrate, Flower", "Edible, Vape", "Edible", "Concentrate, Edible, Flower, Vape", 
"Edible", "Edible", "Concentrate, Flower", "Concentrate, Flower", 
"Edible, Vape", "Concentrate, Flower", "Edible", "Edible", "Edible, Vape", 
"Edible", "Edible", "Edible", "Concentrate, Flower", "Edible", 
"Edible", "Concentrate, Flower", "Edible, Vape", "Concentrate, Flower", 
"Edible", "Edible", "Edible", "Edible", "Concentrate, Flower", 
"Edible, Vape", "Edible", "Concentrate, Flower", "Edible, Vape", 
"Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower", 
"Concentrate, Flower", "Edible", "Edible", "Edible", "Edible, Vape", 
"Concentrate, Flower", "Edible")), row.names = c(NA, -100L), class = c("tbl_df", 
"tbl", "data.frame"))

enter image description here

Some of the items in the CATEGORY vector have only one string and some of them have two, three or four. (And larger, this is just a section of a bigger data frame.)

How can I filter to only include items with two or three items in the dataset?

For example, if I type this:

unique(interesting_baskets_df$CATEGORY)

I see these categories.

[1] "Edible, Vape"                      "Concentrate, Flower"               "Edible"                            "Concentrate, Edible, Flower"      
[5] "Concentrate, Edible, Flower, Vape"

But I only want to include "Edible, Vape" or "Concentrate, Flower" or "Edible".

I know in this case I could input a specific filter in dplyr with a set of items, but my dataset is much larger and I would need a more flexible solution. I would appreciate something that would be flexible in choosing the number of items, two or three or four, since I don't exactly know what will be most useful in association rule learning.

CodePudding user response:

We may count the number of words with str_count and create a logical expression based on the count (< 3) to filter only 'CATEGORY' having less than 3 words

library(dplyr)
library(stringr)
df1 %>% 
   filter(str_count(CATEGORY, "\\w ") < 3)

-output

# A tibble: 92 × 1
   CATEGORY           
   <chr>              
 1 Edible, Vape       
 2 Concentrate, Flower
 3 Concentrate, Flower
 4 Concentrate, Flower
 5 Edible             
 6 Concentrate, Flower
 7 Edible, Vape       
 8 Edible             
 9 Concentrate, Flower
10 Concentrate, Flower
# … with 82 more rows

CodePudding user response:

Another option might be counting number of commas 1 and filter less than 3 like this:

library(stringr)
library(dplyr)
filter(df, str_count(CATEGORY, ",")   1 < 3)
#> # A tibble: 92 × 1
#>    CATEGORY           
#>    <chr>              
#>  1 Edible, Vape       
#>  2 Concentrate, Flower
#>  3 Concentrate, Flower
#>  4 Concentrate, Flower
#>  5 Edible             
#>  6 Concentrate, Flower
#>  7 Edible, Vape       
#>  8 Edible             
#>  9 Concentrate, Flower
#> 10 Concentrate, Flower
#> # … with 82 more rows

Created on 2023-01-05 with reprex v2.0.2

CodePudding user response:

With regex W and strsplit you can filter your data by the number of words that you want, next example for less than three words.

With R base:

df[lengths(strsplit(df$CATEGORY, "\\W "))<3, ]

Or dplyr:

library(dplyr)
df %>% filter(lengths(strsplit(df$CATEGORY, "\\W "))<3)
  •  Tags:  
  • r
  • Related