Subset table based on strings present in multiple columns-CodePudding

I have a dataframe that looks like this

> df
               Col1                 Col2             P_value  
              Cell1                Cell2               0.001      
              Cell2                Cell1                0.05      
              Cell4                Cell1                0.01     
              Cell5                Cell2                0.03        
              Cell2                Cell3               0.008      
              Cell1                Cell4               0.008

I want to subset to a new dataframe, in which only strings present in BOTH Col1 and Col2 in both orders. So here, Cell1 and Cell2, when matched, appear in both Col1 and Col2.

> df
               Col1                 Col2             P_value  
              Cell1                Cell2               0.001      
              Cell2                Cell1                0.05 
              Cell1                Cell4               0.008 
              Cell4                Cell1                0.01

So here, Cell1 and Cell2, when matched, appear in both Col1 and Col2. Same for Cell1 and Cell4. This never happens for other strings.

CodePudding user response：

Perhaps this is over-simplifying, but ...

df %>%
  filter(Col1 %in% Col2 & Col2 %in% Col1)
#    Col1  Col2 P_value
# 1 Cell1 Cell2   <0.05
# 2 Cell2 Cell1   <0.05
# 3 Cell4 Cell1   <0.05
# 4 Cell1 Cell4   <0.05

CodePudding user response：

We may use

library(dplyr)
library(stringr)
df %>% 
  mutate(Col = str_c(pmin(Col1, Col2), pmax(Col1, Col2))) %>%
  filter(duplicated(Col)|duplicated(Col, fromLast = TRUE)) %>% 
  select(-Col)

Or may do

 df %>%
   add_count(pmin(Col1, Col2), pmax(Col1, Col2)) %>% 
   filter(n > 1) %>% 
   select(names(df))

-output

   Col1  Col2 P_value
1 Cell1 Cell2   <0.05
2 Cell2 Cell1   <0.05
3 Cell4 Cell1   <0.05
4 Cell1 Cell4   <0.05

data

df <- structure(list(Col1 = c("Cell1", "Cell2", "Cell4", "Cell5", "Cell2", 
"Cell1"), Col2 = c("Cell2", "Cell1", "Cell1", "Cell2", "Cell3", 
"Cell4"), P_value = c("<0.05", "<0.05", "<0.05", "<0.05", "<0.05", 
"<0.05")), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response：

A possible solution, based on inner_join:

library(dplyr)

inner_join(df, df[-3], by = c("Col2" = "Col1", "Col1" = "Col2")) 

#>    Col1  Col2 P_value
#> 1 Cell1 Cell2   0.001
#> 2 Cell2 Cell1   0.050
#> 3 Cell4 Cell1   0.010
#> 4 Cell1 Cell4   0.008