I am working with the R programming language. Suppose I have the following data:
var1 <- rnorm(100,10,1)
var2 <- rnorm(100,10,10)
var3 <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5, 0.5) )
var4 <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.5, 0.25, 0.25, 0.25) )
var5 <- sample( LETTERS[1:3], 100, replace=TRUE, prob=c(0.5, 0.25, 0.25) )
d1 = data.frame(var1, var2, var3, var4, var5)
d2 = data.frame(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0)
old_data = rbind(d1,d2)
var1 var2 var3 var4 var5
1 8.932176 11.1764660 B B C
2 9.782025 0.5252539 A A B
3 8.973996 5.0944256 A B B
4 9.271109 7.4390781 B A A
5 9.374961 28.4386201 A D A
6 8.313307 3.4805010 A B C
Now, suppose I have a new row of data:
new_data = data.frame(var1 = 2, var2 = 3, var3 = "A", var4 = "GG", var5 = "L")
My Question: For all factor variables (i.e. var3, var4, var5) in the "new_data", if these variables contain a value that is not contained within "old_data" - then I want to make them as 0.
I tried to do this manually:
#first, list all values of each factor variable
> table(old_data$var3)
0 A B
1 52 48
> table(old_data$var4)
0 A B C D
1 43 18 15 24
> table(old_data$var5)
0 A B C
1 50 25 25
#next, write the replacement function
ifelse(new_data$var3 != "A" & new_data$var3 != "B", new_data$var3 == 0, new_data$var3 == new_data$var3 )
ifelse(new_data$var4!= "A" & new_data$var4 != "B" & new_data$var4 != "C" & new_data$var4 != "D", new_data$var4 == 0, new_data$var4 == new_data$var4 )
ifelse(new_data$var5!= "A" & new_data$var5 != "B" & new_data$var5 != "C" , new_data$var5 == 0, new_data$var5 == new_data$var5 )
But I don't think the above approach worked. Also, this would not be a very effective approach if there were many factor variables, and if each of these factor variables had many possible levels.
Can someone please show me how to do this?
Thanks!
CodePudding user response:
This code may return what you want (in your example, var1
, var2
var3
unchanged and var4
, var5
equal 0).
library(dplyr)
new_data %>%
mutate(across(where(is.character), ~ ifelse(. %in% as.list(old_data)[[cur_column()]], ., 0)))
mutate(across(where(is.character), ~ ))
allows you to apply a function to each character column of your dataframe.
The function is ifelse(. %in% as.list(old_data)[[cur_column()]], ., 0)
.
.
refers to the current value, which is compared with the corresponding column of old_data
using the cur_column()
function.
edit: corrected across()
to only check character columns.