How to count observations matching the values of a vector of characters-CodePudding

I have a dataframe with numerous observations and different type of variables. Here's a sample of my dataframe:

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")

# of observation	Product	Price in $	Place
1	Pizza	2	Supermarket
2	Cleaning Product	3.5	Supermarket
3	Chocolate	1	Supermarket
4	Fruit	1	Little Store
5	Red Meat	2.5	Supermarket
6	Cleaning Product	3.5	Supermarket
7	Bracelet	3	Little Store
8	Trucker Hat	5	Gas Station
9	Shirt	15	Supermarket
10	Shirt	20	Supermarket
11	Chicken Breast	2.5	Little Store
12	Chocolate	1	Gas Station
13	Cereal	2	Gas Station
14	Fruit	1	Little Store
15	Cleaning Product	3.5	Supermarket
16	Trucker Hat	4	Supermarket

I also have a vector of characters:

non.food <- c("Cleaning", "Hat", "Shirt", "Bracelet")

I have to eliminate observations that match any of the words from the vector non.food. For this I use the following code:

non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = '|') 
mydf <- mydf %>% 
filter(!str_detect(Product,non.food))

It works pretty well but I have the impression that I lose more observations than I should. For instance, looking at the sample I should lose 8 observations. But in reality I end up losing 10 (I don't show it in the sample since in reality I have 8916 observations, so the sample is just an example of what kind of dataframe I face)

So, I would like to first count the number of observations that match any of the words inside the vector to be sure that my code didn't eliminate more observations than it should. I cannot use commands as which(mydf$Product == non.food) or sum(mydf$Product == non.food). I could do the inverse of my code and filter only by observations that match my strings of characters to verify, but it takes more time and creates more data that I don't want. Does anybody has an idea?

Also, if my code is in fact eliminating more observations than it should, does somebody has a solution?

Thank you in advance.

CodePudding user response：

You could add a count variable, that counts the number of deleted rows using case_when, e.g.

library(tidyverse)
    df <- tribble(
      ~"# of observation", ~Product, ~"Price in $", ~Place,
      1, "Pizza", 2, "Supermarket",
      2, "Cleaning Product", 3.5, "Supermarket",
      3, "Chocolate", 1, "Supermarket",
      4, "Fruit", 1, "Little Store",
      5, "Red Meat", 2.5, "Supermarket",
      6, "Cleaning Product", 3.5, "Supermarket",
      7, "Bracelet", 3, "Little Store",
      8, "Trucker Hat", 5, "Gas Station",
      9, "Shirt", 15, "Supermarket",
      10, "Shirt", 20, "Supermarket",
      11, "Chicken Breast", 2.5, "Little Store",
      12, "Chocolate", 1, "Gas Station",
      13, "Cereal", 2, "Gas Station",
      14, "Fruit", 1, "Little Store",
      15, "Cleaning Product", 3.5, "Supermarket",
      16, "Trucker Hat", 4, "Supermarket"
    )
    
    
    
    non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = "|")
    mydf <- df %>%
      mutate(count = case_when(
        str_detect(Product, non.food) ~ 1,
        TRUE ~ 0
      )) %>%
      mutate(sum_deleted = sum(count)) %>% 
      filter(!str_detect(Product, non.food))

CodePudding user response：

To count matching or non-matching elements, you can use

num_foods <- nrow(mydf[!str_detect(mydf$Product, non.food),])
num_non_foods <- nrow(mydf[str_detect(mydf$Product, non.food),])

You can see, that num_foods == 8 and num_non_foods == 8, so your code seems to do what it should.

data

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")