I have a dataframe that looks like this
> df
Col1 Col2 P_value
Cell1 Cell2 0.001
Cell2 Cell1 0.05
Cell4 Cell1 0.01
Cell5 Cell2 0.03
Cell2 Cell3 0.008
Cell1 Cell4 0.008
I want to subset to a new dataframe, in which only strings present in BOTH Col1
and Col2
in both orders. So here, Cell1
and Cell2
, when matched, appear in both Col1 and Col2.
> df
Col1 Col2 P_value
Cell1 Cell2 0.001
Cell2 Cell1 0.05
Cell1 Cell4 0.008
Cell4 Cell1 0.01
So here, Cell1
and Cell2
, when matched, appear in both Col1 and Col2. Same for Cell1 and Cell4. This never happens for other strings.
CodePudding user response:
Perhaps this is over-simplifying, but ...
df %>%
filter(Col1 %in% Col2 & Col2 %in% Col1)
# Col1 Col2 P_value
# 1 Cell1 Cell2 <0.05
# 2 Cell2 Cell1 <0.05
# 3 Cell4 Cell1 <0.05
# 4 Cell1 Cell4 <0.05
CodePudding user response:
We may use
library(dplyr)
library(stringr)
df %>%
mutate(Col = str_c(pmin(Col1, Col2), pmax(Col1, Col2))) %>%
filter(duplicated(Col)|duplicated(Col, fromLast = TRUE)) %>%
select(-Col)
Or may do
df %>%
add_count(pmin(Col1, Col2), pmax(Col1, Col2)) %>%
filter(n > 1) %>%
select(names(df))
-output
Col1 Col2 P_value
1 Cell1 Cell2 <0.05
2 Cell2 Cell1 <0.05
3 Cell4 Cell1 <0.05
4 Cell1 Cell4 <0.05
data
df <- structure(list(Col1 = c("Cell1", "Cell2", "Cell4", "Cell5", "Cell2",
"Cell1"), Col2 = c("Cell2", "Cell1", "Cell1", "Cell2", "Cell3",
"Cell4"), P_value = c("<0.05", "<0.05", "<0.05", "<0.05", "<0.05",
"<0.05")), class = "data.frame", row.names = c(NA, -6L))
CodePudding user response:
A possible solution, based on inner_join
:
library(dplyr)
inner_join(df, df[-3], by = c("Col2" = "Col1", "Col1" = "Col2"))
#> Col1 Col2 P_value
#> 1 Cell1 Cell2 0.001
#> 2 Cell2 Cell1 0.050
#> 3 Cell4 Cell1 0.010
#> 4 Cell1 Cell4 0.008