Home > Blockchain >  How to count observations matching the values of a vector of characters
How to count observations matching the values of a vector of characters

Time:05-03

I have a dataframe with numerous observations and different type of variables. Here's a sample of my dataframe:

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")
# of observation Product Price in $ Place
1 Pizza 2 Supermarket
2 Cleaning Product 3.5 Supermarket
3 Chocolate 1 Supermarket
4 Fruit 1 Little Store
5 Red Meat 2.5 Supermarket
6 Cleaning Product 3.5 Supermarket
7 Bracelet 3 Little Store
8 Trucker Hat 5 Gas Station
9 Shirt 15 Supermarket
10 Shirt 20 Supermarket
11 Chicken Breast 2.5 Little Store
12 Chocolate 1 Gas Station
13 Cereal 2 Gas Station
14 Fruit 1 Little Store
15 Cleaning Product 3.5 Supermarket
16 Trucker Hat 4 Supermarket

I also have a vector of characters:

non.food <- c("Cleaning", "Hat", "Shirt", "Bracelet")

I have to eliminate observations that match any of the words from the vector non.food. For this I use the following code:

non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = '|') 
mydf <- mydf %>% 
filter(!str_detect(Product,non.food))

It works pretty well but I have the impression that I lose more observations than I should. For instance, looking at the sample I should lose 8 observations. But in reality I end up losing 10 (I don't show it in the sample since in reality I have 8916 observations, so the sample is just an example of what kind of dataframe I face)

So, I would like to first count the number of observations that match any of the words inside the vector to be sure that my code didn't eliminate more observations than it should. I cannot use commands as which(mydf$Product == non.food) or sum(mydf$Product == non.food). I could do the inverse of my code and filter only by observations that match my strings of characters to verify, but it takes more time and creates more data that I don't want. Does anybody has an idea?

Also, if my code is in fact eliminating more observations than it should, does somebody has a solution?

Thank you in advance.

CodePudding user response:

You could add a count variable, that counts the number of deleted rows using case_when, e.g.

library(tidyverse)
    df <- tribble(
      ~"# of observation", ~Product, ~"Price in $", ~Place,
      1, "Pizza", 2, "Supermarket",
      2, "Cleaning Product", 3.5, "Supermarket",
      3, "Chocolate", 1, "Supermarket",
      4, "Fruit", 1, "Little Store",
      5, "Red Meat", 2.5, "Supermarket",
      6, "Cleaning Product", 3.5, "Supermarket",
      7, "Bracelet", 3, "Little Store",
      8, "Trucker Hat", 5, "Gas Station",
      9, "Shirt", 15, "Supermarket",
      10, "Shirt", 20, "Supermarket",
      11, "Chicken Breast", 2.5, "Little Store",
      12, "Chocolate", 1, "Gas Station",
      13, "Cereal", 2, "Gas Station",
      14, "Fruit", 1, "Little Store",
      15, "Cleaning Product", 3.5, "Supermarket",
      16, "Trucker Hat", 4, "Supermarket"
    )
    
    
    
    non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = "|")
    mydf <- df %>%
      mutate(count = case_when(
        str_detect(Product, non.food) ~ 1,
        TRUE ~ 0
      )) %>%
      mutate(sum_deleted = sum(count)) %>% 
      filter(!str_detect(Product, non.food))

CodePudding user response:

To count matching or non-matching elements, you can use

num_foods <- nrow(mydf[!str_detect(mydf$Product, non.food),])
num_non_foods <- nrow(mydf[str_detect(mydf$Product, non.food),])

You can see, that num_foods == 8 and num_non_foods == 8, so your code seems to do what it should.

data

mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product", 
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet", 
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate", 
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2, 
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket", 
"Supermarket", "Supermarket", "Little Store", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Supermarket", 
"Supermarket", "Little Store", "Gas Station", "Gas Station", 
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA, 
-16L), class = "data.frame")
  • Related