I have a dataframe
with numerous observations and different type of variables. Here's a sample of my dataframe
:
mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product",
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet",
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate",
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2,
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket",
"Supermarket", "Supermarket", "Little Store", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Gas Station",
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA,
-16L), class = "data.frame")
# of observation | Product | Price in $ | Place |
---|---|---|---|
1 | Pizza | 2 | Supermarket |
2 | Cleaning Product | 3.5 | Supermarket |
3 | Chocolate | 1 | Supermarket |
4 | Fruit | 1 | Little Store |
5 | Red Meat | 2.5 | Supermarket |
6 | Cleaning Product | 3.5 | Supermarket |
7 | Bracelet | 3 | Little Store |
8 | Trucker Hat | 5 | Gas Station |
9 | Shirt | 15 | Supermarket |
10 | Shirt | 20 | Supermarket |
11 | Chicken Breast | 2.5 | Little Store |
12 | Chocolate | 1 | Gas Station |
13 | Cereal | 2 | Gas Station |
14 | Fruit | 1 | Little Store |
15 | Cleaning Product | 3.5 | Supermarket |
16 | Trucker Hat | 4 | Supermarket |
I also have a vector
of characters
:
non.food <- c("Cleaning", "Hat", "Shirt", "Bracelet")
I have to eliminate observations that match any of the words from the vector
non.food
. For this I use the following code:
non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = '|')
mydf <- mydf %>%
filter(!str_detect(Product,non.food))
It works pretty well but I have the impression that I lose more observations than I should. For instance, looking at the sample I should lose 8 observations. But in reality I end up losing 10 (I don't show it in the sample since in reality I have 8916 observations, so the sample is just an example of what kind of dataframe I face)
So, I would like to first count the number of observations that match any of the words inside the vector
to be sure that my code
didn't eliminate more observations than it should. I cannot use commands as which(mydf$Product == non.food)
or sum(mydf$Product == non.food)
. I could do the inverse of my code and filter only by observations that match my strings of characters to verify, but it takes more time and creates more data
that I don't want. Does anybody has an idea?
Also, if my code
is in fact eliminating more observations than it should, does somebody has a solution?
Thank you in advance.
CodePudding user response:
You could add a count variable, that counts the number of deleted rows using case_when
, e.g.
library(tidyverse)
df <- tribble(
~"# of observation", ~Product, ~"Price in $", ~Place,
1, "Pizza", 2, "Supermarket",
2, "Cleaning Product", 3.5, "Supermarket",
3, "Chocolate", 1, "Supermarket",
4, "Fruit", 1, "Little Store",
5, "Red Meat", 2.5, "Supermarket",
6, "Cleaning Product", 3.5, "Supermarket",
7, "Bracelet", 3, "Little Store",
8, "Trucker Hat", 5, "Gas Station",
9, "Shirt", 15, "Supermarket",
10, "Shirt", 20, "Supermarket",
11, "Chicken Breast", 2.5, "Little Store",
12, "Chocolate", 1, "Gas Station",
13, "Cereal", 2, "Gas Station",
14, "Fruit", 1, "Little Store",
15, "Cleaning Product", 3.5, "Supermarket",
16, "Trucker Hat", 4, "Supermarket"
)
non.food <- paste(c("Cleaning", "Hat", "Shirt", "Bracelet"), collapse = "|")
mydf <- df %>%
mutate(count = case_when(
str_detect(Product, non.food) ~ 1,
TRUE ~ 0
)) %>%
mutate(sum_deleted = sum(count)) %>%
filter(!str_detect(Product, non.food))
CodePudding user response:
To count matching or non-matching elements, you can use
num_foods <- nrow(mydf[!str_detect(mydf$Product, non.food),])
num_non_foods <- nrow(mydf[str_detect(mydf$Product, non.food),])
You can see, that num_foods == 8
and num_non_foods == 8
, so your code seems to do what it should.
data
mydf <- structure(list(id = 1:16, Product = c("Pizza", "Cleaning Product",
"Chocolate", "Fruit", "Red Meat", "Cleaning Product", "Bracelet",
"Trucker Hat", "Shirt", "Shirt", "Chicken Breast", "Chocolate",
"Cereal", "Fruit", "Cleaning Product", "Trucker Hat"), price = c(2,
3.5, 1, 1, 2.5, 3.5, 3, 5, 15, 20, 2.5, 1, 2, 1, 3.5, 4), place = c("Supermarket",
"Supermarket", "Supermarket", "Little Store", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Supermarket",
"Supermarket", "Little Store", "Gas Station", "Gas Station",
"Little Store", "Supermarket", "Supermarket")), row.names = c(NA,
-16L), class = "data.frame")