I have a list such as :
The_list=c('SP1','SP2','SP3')
And I have a dataframe such as
Names Groups
SP1 G1
SP2 G1
SP3 G1
SP1 G2
SP4 G3
SP5 G4
SP2 G5
SP3 G5
SP6 G5
SP2 G6
SP7 G6
And I would like to keep only Groups
where at least 2 element in Names
are present within The_list
;
Here I should get:
Names Groups
SP1 G1
SP2 G1
SP3 G1
SP2 G5
SP3 G5
SP6 G5
Here is the df if it can helps
structure(list(Names = c("SP1", "SP2", "SP3", "SP1", "SP4", "SP5",
"SP2", "SP3", "SP6", "SP2", "SP7"), Groups = c("G1", "G1", "G1",
"G2", "G3", "G4", "G5", "G5", "G5", "G6", "G6")), class = "data.frame", row.names = c(NA,
-11L))
CodePudding user response:
Using data.table
library(data.table)
setDT(df1)[df1[, .I[sum(The_list %in% Names) >=2], by = Groups]$V1]
-output
Names Groups
<char> <char>
1: SP1 G1
2: SP2 G1
3: SP3 G1
4: SP2 G5
5: SP3 G5
6: SP6 G5
CodePudding user response:
One solution you can use is
df |>
group_by(Groups) |>
filter(sum(Names %in% The_list) >= 2)
Correction... because I'm using Names %in% The_list it does not uniquely identify each Name, which may cause some groups to be displayed because duplicate names.
df |>
group_by(Groups) |>
filter(sum(The_list %in% Names) >= 2)
Names Groups
<chr> <chr>
1 SP1 G1
2 SP2 G1
3 SP3 G1
4 SP2 G5
5 SP3 G5
6 SP6 G5