Home > Blockchain >  Ifelse across multiple columns matching on similar attributes
Ifelse across multiple columns matching on similar attributes

Time:04-20

I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.

ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0

f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)

mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)

dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.

In essence, the expected output should be

  f1 f2 f3 f4 l1 l2 l3 l4 dum
1  3  4  5  1  1  1  1  0   1
2  2  1  3  2  0  1  0  0   0
3  1  5  4  4  1  1  0  0   1
4  6 NA NA NA  0 NA NA NA   0
5  5 NA NA NA  0 NA NA NA   0

I can only think of doing it in a very long way such as

mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1, 
ifelse(f2 %in% c(4:6) & l2==1, 1, 
ifelse(f3 %in% c(4:6) & l3==1, 1, 
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))

But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.

Is there a more efficient way to do this?

CodePudding user response:

Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:

library(dplyr)
library(readr)

mydata %>% 
  mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
                                  TRUE ~ 0), .names = 'new_{col}')) %>%
  mutate(sumNew = rowSums(.[9:16])) %>% 
  mutate(dum = ifelse(sumNew >=1, 1, 0)) %>% 
  select(1:8, dum)

  f1 f2 f3 f4 l1 l2 l3 l4 dum
1  3  4  5  1  1  1  1  0   1
2  2  1  3  2  0  1  0  0   1
3  1  5  4  4  1  1  0  0   1
4  6 NA NA NA  0 NA NA NA   0
5  5 NA NA NA  0 NA NA NA   0

CodePudding user response:

Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with or as.integer

library(dplyr)
mydata %>% 
  mutate(dum =  (rowSums(across(starts_with('f'),  ~.x %in% 4:6) & 
                     across(starts_with('l'), ~ .x %in% 1)) > 0))
  f1 f2 f3 f4 l1 l2 l3 l4 dum
1  3  4  5  1  1  1  1  0   1
2  2  1  3  2  0  1  0  0   0
3  1  5  4  4  1  1  0  0   1
4  6 NA NA NA  0 NA NA NA   0
5  5 NA NA NA  0 NA NA NA   0

We could also use base R

mydata$dum <-  (Reduce(`|`, Map(function(x, y) x %in% 4:6 & 
          y %in% 1, mydata[startsWith(names(mydata), "f")], 
                    mydata[startsWith(names(mydata), "l")])))

CodePudding user response:

Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.

as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
            mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0

If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).

  •  Tags:  
  • r
  • Related