I would like to use dplyr syntax for subset a data frame based on unique variable in one column. This seems simple and I have read a few other similar posts, but for some reason I am not getting correct results. Any help is appreciated.
The head of my dataset is below. So basically I would like to select the row 3 with "Bacteria_unclassified" as it is only found in the 'manual' method and not the 'automated'.
> head(gen_gr1_byMeth)
# A tibble: 6 × 3
# Groups: Rank6 [4]
Rank6 method avg_abundance
<chr> <chr> <dbl>
1 Alistipes automated 0.0358
2 Alistipes manual 0.0341
3 Bacteria_unclassified manual 0.0106
4 Bacteroides automated 0.245
5 Bacteroides manual 0.270
6 Blautia automated 0.0462
Below is the simple script I have tried. But it produces a table with has many of the rows (based on Rank6) that were found in manual and automated.
gen_gr1_byMeth %>%
distinct(Rank6,.keep_all = TRUE)
However, this produces which includes a table with many of the rows (based on Rank6) that were found in manual and automated.
> gen_gr1_byMeth %>%
distinct(Rank6,.keep_all = TRUE)
# A tibble: 15 × 3
# Groups: Rank6 [15]
Rank6 method avg_abundance
<chr> <chr> <dbl>
1 Alistipes automated 0.0358
2 Bacteria_unclassified manual 0.0106
3 Bacteroides automated 0.245
4 Blautia automated 0.0462
5 CAG-352 automated 0.0506
6 Clostridia_unclassified automated 0.0282
7 Faecalibacterium automated 0.195
8 Lachnospiraceae_ge automated 0.0166
9 Lachnospiraceae_unclassified automated 0.174
10 NK4A214_group automated 0.0217
11 Oscillospiraceae_unclassified automated 0.0180
12 Oscillospirales_ge automated 0.0144
13 Paraclostridium automated 0.0111
14 Ruminococcus automated 0.0171
15 UCG-005 automated 0.0215
For reproducibility I have put the whole dataset below:
> dput(gen_gr1_byMeth)
structure(list(Rank6 = c("Alistipes", "Alistipes", "Bacteria_unclassified",
"Bacteroides", "Bacteroides", "Blautia", "Blautia", "CAG-352",
"CAG-352", "Clostridia_unclassified", "Clostridia_unclassified",
"Faecalibacterium", "Faecalibacterium", "Lachnospiraceae_ge",
"Lachnospiraceae_unclassified", "Lachnospiraceae_unclassified",
"NK4A214_group", "NK4A214_group", "Oscillospiraceae_unclassified",
"Oscillospiraceae_unclassified", "Oscillospirales_ge", "Oscillospirales_ge",
"Paraclostridium", "Ruminococcus", "Ruminococcus", "UCG-005",
"UCG-005"), method = c("automated", "manual", "manual", "automated",
"manual", "automated", "manual", "automated", "manual", "automated",
"manual", "automated", "manual", "automated", "automated", "manual",
"automated", "manual", "automated", "manual", "automated", "manual",
"automated", "automated", "manual", "automated", "manual"), avg_abundance = c(0.0357965743806507,
0.0340890659276923, 0.0105762557139674, 0.244734155186976, 0.269538498829569,
0.0461962074026682, 0.0169201443577588, 0.050612656423668, 0.019822919435038,
0.0282244950866087, 0.0310652000945921, 0.194521227486183, 0.21300935245232,
0.0166445120721528, 0.174293792702295, 0.211302558495368, 0.0217093848542542,
0.020169671439762, 0.0179923614027502, 0.0272722839880321, 0.0143540383884467,
0.0111018468865368, 0.0111410776036011, 0.0170589412004154, 0.0255666306413381,
0.0214549726838332, 0.0228852703991436)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -27L), groups = structure(list(
Rank6 = c("Alistipes", "Bacteria_unclassified", "Bacteroides",
"Blautia", "CAG-352", "Clostridia_unclassified", "Faecalibacterium",
"Lachnospiraceae_ge", "Lachnospiraceae_unclassified", "NK4A214_group",
"Oscillospiraceae_unclassified", "Oscillospirales_ge", "Paraclostridium",
"Ruminococcus", "UCG-005"), .rows = structure(list(1:2, 3L,
4:5, 6:7, 8:9, 10:11, 12:13, 14L, 15:16, 17:18, 19:20,
21:22, 23L, 24:25, 26:27), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
CodePudding user response:
A possible solution:
library(dplyr)
df %>%
add_count(Rank6) %>%
ungroup %>%
filter(n == 1) %>%
select(-n)
#> # A tibble: 3 x 3
#> Rank6 method avg_abundance
#> <chr> <chr> <dbl>
#> 1 Bacteria_unclassified manual 0.0106
#> 2 Lachnospiraceae_ge automated 0.0166
#> 3 Paraclostridium automated 0.0111
CodePudding user response:
If you want to select rows where Rank6 == "Bacteria_unclassified"
and method == "manual"
, you can use filter
and the &
operator to concatenate the conditions:
df %>%
filter(Rank6 == "Bacteria_unclassified" & method == "manual")
# A tibble: 1 × 3
# Groups: Rank6 [1]
Rank6 method avg_abundance
<chr> <chr> <dbl>
1 Bacteria_unclassified manual 0.0106