I am looking for advice on the principle of filtering a dataset in R. I currently have the below code which allows for easy filtering of records where a value in column 'Value' is within the required list that I have created:
ValuesNumber <-
read.table(textConnection("CustomerID Value
1 Ball
1 Cat
2 Ball
2 Ball
3 Dog
4 Ball
4 Blitz"), header=TRUE)
#Filter for required values only
Values_List <- "Ball|Twist|Tester"
ValuesNumberFiltered <- ValuesNumber[grep(Values_List, ValuesNumber$Value
),]
I am looking to amend this so that the below criteria are met:
- 'CustomerID' appears in the dataset at least twice
- The entry in 'Value' column for the second entry does not appear within a list of my choosing.
So for example if working with this dataset:
CustomerID | Value |
---|---|
1 | Ball |
1 | Cat |
2 | Ball |
2 | Ball |
3 | Dog |
4 | Ball |
4 | Blitz |
I would then like to create a new column entitled 'Y/N' which has:
'1' if the value in all occurrences after the first occurrence does not match my list or '0' if it does not.
So the output would look like this:
CustomerID | Value | Y/N |
---|---|---|
1 | Ball | 0 |
1 | Cat | 1 |
2 | Ball | 0 |
2 | Ball | 0 |
3 | Dog | 0 |
4 | Ball | 0 |
4 | Blitz | 1 |
CodePudding user response:
tidyverse
solution:
library(dplyr)
Values_List <- c("Ball", "Twist", "Tester")
ValuesNumber %>%
group_by(CustomerID) %>%
mutate(`Y/N` = (n() >= 2 & !(Value %in% Values_List)))
CustomerID Value `Y/N`
1 1 Ball 0
2 1 Cat 1
3 2 Ball 0
4 2 Ball 0
5 3 Dog 0
6 4 Ball 0
7 4 Blitz 1
CodePudding user response:
library(dplyr)
ValuesNumber %>%
group_by(CustomerID) %>%
mutate(`Y/N` = case_when(
row_number() == 1 ~ 0,
grepl(Values_List, Value) ~ 0,
TRUE ~ 1
)) %>%
ungroup()
# # A tibble: 7 × 3
# CustomerID Value `Y/N`
# <int> <chr> <dbl>
# 1 1 Ball 0
# 2 1 Cat 1
# 3 2 Ball 0
# 4 2 Ball 0
# 5 3 Dog 0
# 6 4 Ball 0
# 7 4 Blitz 1
CodePudding user response:
rm(list = ls())
library(tidyverse)
values_number <- read.table(
textConnection("CustomerID Value
1 Ball
1 Cat
2 Ball
2 Ball
3 Dog
4 Ball
4 Blitz"), header = TRUE)
# Filter for required values only
value_list <- c("Ball", "Twist", "Tester")
count_id <- values_number |>
group_by(CustomerID) |>
summarise(count = length(CustomerID)) |> # count the occurance of each customer id
right_join(values_number, by = "CustomerID") |> # combined to the original data
mutate("Y/N" = case_when(
count > 1 & !(Value %in% value_list) ~ 1, # if the occurance of customer id > 1 and
TRUE ~ 0) # the entry did not involved in the list
) # mark as 1, the others mark as 0