Home > Blockchain >  perform filtering on pairs of columns in tidyverse
perform filtering on pairs of columns in tidyverse

Time:09-30

I have a large dataframe with one categorical and many numerical columns.

big_df <- data.frame(category = LETTERS[1:10]
                     ,Q_COL1 = c(0,1,0,2,0,0,17,0,12,19)
                     ,Q_COL2 = c(0,2,3,0,1,12,0,1,0,12)
                     )

It looks like this:

   category Q_COL1 q_COL2
1         A      0      0
2         B      1      2
3         C      0      3
4         D      2      0
5         E      0      1
6         F      0     12
7         G     17      0
8         H      0      1
9         I     12      0
10        J     19     12

For each numerical column, I would like to create a character vector as follows:

col1_char <- big_df %>% 
  select(category, Q_COL1) %>% 
  filter(Q_COL1 > 0) %>% 
  select(category) %>% 
  deframe()

col2_char <- big_df %>% 
  select(category, Q_COL2) %>% 
  filter(Q_COL2 > 0) %>% 
  select(category) %>% 
  deframe()

This code allows to list categories (from the category column) for which there were no 0 values in each of the numerical column. The output vectors look like this:

> col1_char
[1] "B" "D" "G" "I" "J"
> col2_char
[1] "B" "C" "E" "F" "H" "J"

Each of these will be of a different length. If possible, the optimal output would store them in a dataframe, with NAs to account for different lengths. A list would also be good.

I can produce these objects one-by-one, but it's not very elegant. I could probably write a loop to do this, but I wonder is there a neater, perhaps tidyverse, way?

CodePudding user response:

This can be a one-liner with good ol' base R,

sapply(big_df[-1], \(i) big_df$category[i > 0])

$Q_COL1
[1] "B" "D" "G" "I" "J"

$Q_COL2
[1] "B" "C" "E" "F" "H" "J"

To put the in a data frame, lot of options can be found here

stringi::stri_list2matrix(sapply(big_df[-1], \(i)big_df$category[i > 0]))

     [,1] [,2]
[1,] "B"  "B" 
[2,] "D"  "C" 
[3,] "G"  "E" 
[4,] "I"  "F" 
[5,] "J"  "H" 
[6,] NA   "J" 

CodePudding user response:

Using tidyverse:

big_df %>%
  mutate(across(where(is.numeric), ~ifelse(.x > 0, category, NA)))

category    Q_COL1  Q_COL2
A   NA  NA
B   B   B
C   NA  C
D   D   NA
E   NA  E
F   NA  F
G   G   NA
H   NA  H
I   I   NA
J   J   J
  • Related