thanks in advance for any help! I have a dataframe with lots of columns, and I'd like to filter it to only display rows that match specific values in each of those columns. I can easily produce the conditions I'd like to filter by, but I feel like there's an easier way to actually filter the data than by filtering for each value individually.
For a reproducible example, I'll generate a dataset that has lots of columns that will produce matching entries using anagrams. I'm not just looking for all duplicates, but each of them have duplicates -- I'm looking for duplicates of a given test value (test
below). Glad for any help!
# load my important libraries
library(dplyr); library(stringr)
corpus <- tibble(word = c("tables", "stables",
"elbow", "below", "bowel",
"ascot", "coats", "coast", "tacos",
"aridest", "astride", "staider", "tardies", "tirades"))
corpus[letters] <- NA
corpus %<>%
pivot_longer(cols = letters) %>%
mutate(value = str_count(word, name)) %>%
pivot_wider(names_from = name, values_from = value)
# I'm choosing a completely arbitrary row (8) as an example value to look for.
test <- corpus[8, letters]
I want to determine which rows in my dataframe, the corpus, have the same combination of letters as my test value. I can certainly filter the corpus one letter at a time, but it's so inelegant:
corpus %>%
filter(a == test$a,
b == test$b,
c == test$c,
d == test$d,
e == test$e,
# etc . . .
x == test$x,
y == test$y,
z == test$z)
I feel like there has to be an easier way to do this using either the match
, which
, or %in%
commands, so that it could match even 100 columns if necessary without me writing out each individual one. My ideal solution would look something like this:
corpus %>%
which(test)
or
which(corpus[letters] == test[letters])
and give me the results:
# A tibble: 4 × 27
word a b c d e f g h i j k l m n o p q r s
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
2 coats 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
3 coast 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
4 tacos 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
# … with 7 more variables: t <int>, u <int>, v <int>, w <int>, x <int>, y <int>, z <int>
CodePudding user response:
We could use across
library(dplyr)
corpus %>%
filter(across(-1, ~ .x == test[[cur_column()]]))
-output
# A tibble: 4 × 27
word a b c d e f g h i j k l m n o p q r s t u
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
2 coats 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
3 coast 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
4 tacos 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
# … with 5 more variables: v <int>, w <int>, x <int>, y <int>, z <int>
Or another option with imap/filter
library(purrr)
imap(corpus[-1], ~ .x == test[[.y]]) %>%
reduce(`&`) %>%
filter(corpus, .)
-output
# A tibble: 4 × 27
word a b c d e f g h i j k l m n o p q r s t u
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
2 coats 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
3 coast 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
4 tacos 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0
# … with 5 more variables: v <int>, w <int>, x <int>, y <int>, z <int>
CodePudding user response:
Here is tidyverse approach:
We could compare test row with corpus row after using bind_rows
, unite
all to one column and compare the column for duplicates. Finally filter the duplicates and remove first row by drop_na (because word gives NA in test row):
library(dplyr)
library(tidyr)
test %>%
bind_rows(corpus) %>%
unite(., helper, -word, remove = FALSE) %>%
mutate(helper = ifelse(helper == first(helper), 1, 0)) %>%
filter(helper == 1) %>%
drop_na() %>%
select(word, everything(), -helper)
word a b c d e f g h i j k
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ascot 1 0 1 0 0 0 0 0 0 0 0
2 coats 1 0 1 0 0 0 0 0 0 0 0
3 coast 1 0 1 0 0 0 0 0 0 0 0
4 tacos 1 0 1 0 0 0 0 0 0 0 0
# ... with 15 more variables: l <int>, m <int>, n <int>, o <int>,
# p <int>, q <int>, r <int>, s <int>, t <int>, u <int>, v <int>,
# w <int>, x <int>, y <int>, z <int>
# i Use `colnames()` to see all variable names
CodePudding user response:
Using the (paste(x, collapse=""))
function to collapse all the columns into a single string and then compare.
corpus[apply(X=corpus[-1], MARGIN=1, FUN=function(x) {paste(test, collapse = "")==paste(x, collapse = "")}),]
# A method without the anonymous function and the comparison outside of apply
# Small speed improvement:
corpus[apply(corpus[-1],1,paste,collapse="")==paste(test, collapse = ""),]
# Do.Call method, Fastest yet:
corpus[ do.call(paste0, corpus[-1]) == paste(test, collapse = ""),]
word a b c d e f g h i j k l m n o p 1 ascot 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 2 coats 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 3 coast 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 4 tacos 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0