Home > Mobile >  using dplyr and pipes to select rows from unique entries
using dplyr and pipes to select rows from unique entries

Time:06-30

I would like to use dplyr syntax for subset a data frame based on unique variable in one column. This seems simple and I have read a few other similar posts, but for some reason I am not getting correct results. Any help is appreciated.

The head of my dataset is below. So basically I would like to select the row 3 with "Bacteria_unclassified" as it is only found in the 'manual' method and not the 'automated'.

> head(gen_gr1_byMeth) 
# A tibble: 6 × 3
# Groups:   Rank6 [4]
  Rank6                 method    avg_abundance
  <chr>                 <chr>             <dbl>
1 Alistipes             automated        0.0358
2 Alistipes             manual           0.0341
3 Bacteria_unclassified manual           0.0106
4 Bacteroides           automated        0.245 
5 Bacteroides           manual           0.270 
6 Blautia               automated        0.0462

Below is the simple script I have tried. But it produces a table with has many of the rows (based on Rank6) that were found in manual and automated.

gen_gr1_byMeth %>%
  distinct(Rank6,.keep_all = TRUE) 

However, this produces which includes a table with many of the rows (based on Rank6) that were found in manual and automated.

> gen_gr1_byMeth %>%
    distinct(Rank6,.keep_all = TRUE) 
# A tibble: 15 × 3
# Groups:   Rank6 [15]
   Rank6                         method    avg_abundance
   <chr>                         <chr>             <dbl>
 1 Alistipes                     automated        0.0358
 2 Bacteria_unclassified         manual           0.0106
 3 Bacteroides                   automated        0.245 
 4 Blautia                       automated        0.0462
 5 CAG-352                       automated        0.0506
 6 Clostridia_unclassified       automated        0.0282
 7 Faecalibacterium              automated        0.195 
 8 Lachnospiraceae_ge            automated        0.0166
 9 Lachnospiraceae_unclassified  automated        0.174 
10 NK4A214_group                 automated        0.0217
11 Oscillospiraceae_unclassified automated        0.0180
12 Oscillospirales_ge            automated        0.0144
13 Paraclostridium               automated        0.0111
14 Ruminococcus                  automated        0.0171
15 UCG-005                       automated        0.0215

For reproducibility I have put the whole dataset below:

> dput(gen_gr1_byMeth)
structure(list(Rank6 = c("Alistipes", "Alistipes", "Bacteria_unclassified", 
"Bacteroides", "Bacteroides", "Blautia", "Blautia", "CAG-352", 
"CAG-352", "Clostridia_unclassified", "Clostridia_unclassified", 
"Faecalibacterium", "Faecalibacterium", "Lachnospiraceae_ge", 
"Lachnospiraceae_unclassified", "Lachnospiraceae_unclassified", 
"NK4A214_group", "NK4A214_group", "Oscillospiraceae_unclassified", 
"Oscillospiraceae_unclassified", "Oscillospirales_ge", "Oscillospirales_ge", 
"Paraclostridium", "Ruminococcus", "Ruminococcus", "UCG-005", 
"UCG-005"), method = c("automated", "manual", "manual", "automated", 
"manual", "automated", "manual", "automated", "manual", "automated", 
"manual", "automated", "manual", "automated", "automated", "manual", 
"automated", "manual", "automated", "manual", "automated", "manual", 
"automated", "automated", "manual", "automated", "manual"), avg_abundance = c(0.0357965743806507, 
0.0340890659276923, 0.0105762557139674, 0.244734155186976, 0.269538498829569, 
0.0461962074026682, 0.0169201443577588, 0.050612656423668, 0.019822919435038, 
0.0282244950866087, 0.0310652000945921, 0.194521227486183, 0.21300935245232, 
0.0166445120721528, 0.174293792702295, 0.211302558495368, 0.0217093848542542, 
0.020169671439762, 0.0179923614027502, 0.0272722839880321, 0.0143540383884467, 
0.0111018468865368, 0.0111410776036011, 0.0170589412004154, 0.0255666306413381, 
0.0214549726838332, 0.0228852703991436)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -27L), groups = structure(list(
    Rank6 = c("Alistipes", "Bacteria_unclassified", "Bacteroides", 
    "Blautia", "CAG-352", "Clostridia_unclassified", "Faecalibacterium", 
    "Lachnospiraceae_ge", "Lachnospiraceae_unclassified", "NK4A214_group", 
    "Oscillospiraceae_unclassified", "Oscillospirales_ge", "Paraclostridium", 
    "Ruminococcus", "UCG-005"), .rows = structure(list(1:2, 3L, 
        4:5, 6:7, 8:9, 10:11, 12:13, 14L, 15:16, 17:18, 19:20, 
        21:22, 23L, 24:25, 26:27), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -15L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

CodePudding user response:

A possible solution:

library(dplyr)

df %>% 
  add_count(Rank6) %>% 
  ungroup %>% 
  filter(n == 1) %>% 
  select(-n)

#> # A tibble: 3 x 3
#>   Rank6                 method    avg_abundance
#>   <chr>                 <chr>             <dbl>
#> 1 Bacteria_unclassified manual           0.0106
#> 2 Lachnospiraceae_ge    automated        0.0166
#> 3 Paraclostridium       automated        0.0111

CodePudding user response:

If you want to select rows where Rank6 == "Bacteria_unclassified" and method == "manual", you can use filter and the & operator to concatenate the conditions:

df %>%
  filter(Rank6 == "Bacteria_unclassified" & method == "manual")
# A tibble: 1 × 3
# Groups:   Rank6 [1]
  Rank6                 method avg_abundance
  <chr>                 <chr>          <dbl>
1 Bacteria_unclassified manual        0.0106
  • Related