I have this dataframe:
structure(list(CATEGORY = c("Edible, Vape", "Concentrate, Flower",
"Concentrate, Flower", "Concentrate, Flower", "Edible", "Concentrate, Flower",
"Edible, Vape", "Edible", "Concentrate, Flower", "Concentrate, Flower",
"Edible", "Edible", "Edible", "Concentrate, Flower", "Edible",
"Edible", "Edible", "Edible, Vape", "Edible", "Edible", "Concentrate, Flower",
"Edible", "Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower",
"Edible", "Concentrate, Flower", "Concentrate, Edible, Flower",
"Concentrate, Flower", "Edible", "Concentrate, Edible, Flower",
"Edible", "Concentrate, Edible, Flower", "Concentrate, Edible, Flower, Vape",
"Concentrate, Edible, Flower", "Concentrate, Flower", "Edible",
"Edible", "Edible", "Concentrate, Edible, Flower, Vape", "Concentrate, Flower",
"Concentrate, Flower", "Edible", "Concentrate, Flower", "Concentrate, Flower",
"Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower",
"Concentrate, Flower", "Edible, Vape", "Concentrate, Flower",
"Edible, Vape", "Concentrate, Edible, Flower", "Edible, Vape",
"Concentrate, Flower", "Edible", "Concentrate, Flower", "Concentrate, Flower",
"Edible", "Concentrate, Flower", "Edible, Vape", "Edible", "Concentrate, Edible, Flower, Vape",
"Edible", "Edible", "Concentrate, Flower", "Concentrate, Flower",
"Edible, Vape", "Concentrate, Flower", "Edible", "Edible", "Edible, Vape",
"Edible", "Edible", "Edible", "Concentrate, Flower", "Edible",
"Edible", "Concentrate, Flower", "Edible, Vape", "Concentrate, Flower",
"Edible", "Edible", "Edible", "Edible", "Concentrate, Flower",
"Edible, Vape", "Edible", "Concentrate, Flower", "Edible, Vape",
"Concentrate, Flower", "Concentrate, Flower", "Concentrate, Flower",
"Concentrate, Flower", "Edible", "Edible", "Edible", "Edible, Vape",
"Concentrate, Flower", "Edible")), row.names = c(NA, -100L), class = c("tbl_df",
"tbl", "data.frame"))
Some of the items in the CATEGORY
vector have only one string and some of them have two, three or four. (And larger, this is just a section of a bigger data frame.)
How can I filter to only include items with two or three items in the dataset?
For example, if I type this:
unique(interesting_baskets_df$CATEGORY)
I see these categories.
[1] "Edible, Vape" "Concentrate, Flower" "Edible" "Concentrate, Edible, Flower"
[5] "Concentrate, Edible, Flower, Vape"
But I only want to include "Edible, Vape" or "Concentrate, Flower" or "Edible".
I know in this case I could input a specific filter
in dplyr
with a set of items, but my dataset is much larger and I would need a more flexible solution. I would appreciate something that would be flexible in choosing the number of items, two or three or four, since I don't exactly know what will be most useful in association rule learning.
CodePudding user response:
We may count the number of words with str_count
and create a logical expression based on the count (< 3
) to filter only 'CATEGORY' having less than 3 words
library(dplyr)
library(stringr)
df1 %>%
filter(str_count(CATEGORY, "\\w ") < 3)
-output
# A tibble: 92 × 1
CATEGORY
<chr>
1 Edible, Vape
2 Concentrate, Flower
3 Concentrate, Flower
4 Concentrate, Flower
5 Edible
6 Concentrate, Flower
7 Edible, Vape
8 Edible
9 Concentrate, Flower
10 Concentrate, Flower
# … with 82 more rows
CodePudding user response:
Another option might be counting number of commas 1 and filter less than 3 like this:
library(stringr)
library(dplyr)
filter(df, str_count(CATEGORY, ",") 1 < 3)
#> # A tibble: 92 × 1
#> CATEGORY
#> <chr>
#> 1 Edible, Vape
#> 2 Concentrate, Flower
#> 3 Concentrate, Flower
#> 4 Concentrate, Flower
#> 5 Edible
#> 6 Concentrate, Flower
#> 7 Edible, Vape
#> 8 Edible
#> 9 Concentrate, Flower
#> 10 Concentrate, Flower
#> # … with 82 more rows
Created on 2023-01-05 with reprex v2.0.2
CodePudding user response:
With regex W
and strsplit
you can filter your data by the number of words that you want, next example for less than three words.
With R base:
df[lengths(strsplit(df$CATEGORY, "\\W "))<3, ]
Or dplyr:
library(dplyr)
df %>% filter(lengths(strsplit(df$CATEGORY, "\\W "))<3)