Home > Back-end >  Subsetting dataframe based on lots of columns in R
Subsetting dataframe based on lots of columns in R

Time:09-20

thanks in advance for any help! I have a dataframe with lots of columns, and I'd like to filter it to only display rows that match specific values in each of those columns. I can easily produce the conditions I'd like to filter by, but I feel like there's an easier way to actually filter the data than by filtering for each value individually.

For a reproducible example, I'll generate a dataset that has lots of columns that will produce matching entries using anagrams. I'm not just looking for all duplicates, but each of them have duplicates -- I'm looking for duplicates of a given test value (test below). Glad for any help!

# load my important libraries
library(dplyr); library(stringr)


corpus <- tibble(word  = c("tables", "stables",
                           "elbow", "below", "bowel",
                           "ascot", "coats", "coast", "tacos",
                           "aridest", "astride", "staider", "tardies", "tirades"))

corpus[letters] <- NA

corpus %<>%
  pivot_longer(cols = letters) %>%
  mutate(value = str_count(word, name)) %>%
  pivot_wider(names_from = name, values_from = value)


# I'm choosing a completely arbitrary row (8) as an example value to look for.
test <- corpus[8, letters]

I want to determine which rows in my dataframe, the corpus, have the same combination of letters as my test value. I can certainly filter the corpus one letter at a time, but it's so inelegant:

corpus %>%
  filter(a == test$a,
         b == test$b,
         c == test$c,
         d == test$d,
         e == test$e,
         # etc . . .
         x == test$x,
         y == test$y,
         z == test$z)

I feel like there has to be an easier way to do this using either the match, which, or %in% commands, so that it could match even 100 columns if necessary without me writing out each individual one. My ideal solution would look something like this:

corpus %>%
  which(test)

or

which(corpus[letters] == test[letters])

and give me the results:

# A tibble: 4 × 27
  word      a     b     c     d     e     f     g     h     i     j     k     l     m     n     o     p     q     r     s
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
2 coats     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
3 coast     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
4 tacos     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1
# … with 7 more variables: t <int>, u <int>, v <int>, w <int>, x <int>, y <int>, z <int>

CodePudding user response:

We could use across

library(dplyr)
corpus %>% 
      filter(across(-1, ~ .x == test[[cur_column()]]))

-output

# A tibble: 4 × 27
  word      a     b     c     d     e     f     g     h     i     j     k     l     m     n     o     p     q     r     s     t     u
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
2 coats     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
3 coast     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
4 tacos     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
# … with 5 more variables: v <int>, w <int>, x <int>, y <int>, z <int>

Or another option with imap/filter

library(purrr)
imap(corpus[-1], ~ .x == test[[.y]]) %>%
   reduce(`&`) %>% 
   filter(corpus, .)

-output

# A tibble: 4 × 27
  word      a     b     c     d     e     f     g     h     i     j     k     l     m     n     o     p     q     r     s     t     u
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
2 coats     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
3 coast     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
4 tacos     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     1     1     0
# … with 5 more variables: v <int>, w <int>, x <int>, y <int>, z <int>

CodePudding user response:

Here is tidyverse approach: We could compare test row with corpus row after using bind_rows, unite all to one column and compare the column for duplicates. Finally filter the duplicates and remove first row by drop_na (because word gives NA in test row):

library(dplyr)
library(tidyr)

test %>% 
  bind_rows(corpus) %>% 
  unite(., helper, -word, remove = FALSE) %>% 
  mutate(helper = ifelse(helper == first(helper), 1, 0)) %>% 
  filter(helper == 1) %>% 
  drop_na() %>% 
  select(word, everything(), -helper)
  word      a     b     c     d     e     f     g     h     i     j     k
  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot     1     0     1     0     0     0     0     0     0     0     0
2 coats     1     0     1     0     0     0     0     0     0     0     0
3 coast     1     0     1     0     0     0     0     0     0     0     0
4 tacos     1     0     1     0     0     0     0     0     0     0     0
# ... with 15 more variables: l <int>, m <int>, n <int>, o <int>,
#   p <int>, q <int>, r <int>, s <int>, t <int>, u <int>, v <int>,
#   w <int>, x <int>, y <int>, z <int>
# i Use `colnames()` to see all variable names

CodePudding user response:

Using the (paste(x, collapse="")) function to collapse all the columns into a single string and then compare.

corpus[apply(X=corpus[-1], MARGIN=1, FUN=function(x) {paste(test, collapse = "")==paste(x, collapse = "")}),]
# A method without the anonymous function and the comparison outside of apply
# Small speed improvement:
corpus[apply(corpus[-1],1,paste,collapse="")==paste(test, collapse = ""),]
# Do.Call method, Fastest yet:
corpus[ do.call(paste0, corpus[-1]) == paste(test, collapse = ""),]
  word      a     b     c     d     e     f     g     h     i     j     k     l     m     n     o     p
1 ascot     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0
2 coats     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0
3 coast     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0
4 tacos     1     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0
  • Related