Home > Software engineering >  How can I retrive the most represented value in coolumn of a dataframe?
How can I retrive the most represented value in coolumn of a dataframe?

Time:12-07

I'm working on a dataframe similar to this:

df <- data.frame(seqid = c("A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "D", "D", "D"),
             value = c("100", "50", "20", "7", "7", "7", "100", "100", "50", "50", "7", "7", "100"))

I would like to get the name of the seqids where I found the value 100 and the value 7 several times. In this case the output would be "B" and "C".

It might also be useful for me to add a command that makes me choose seqids in which those values are contained more than n times. For example the value 100 found more than 10 times and the value 7 more than 10 or 5 times.

I've already tried "dplyr" with group_by(seqid) and "data.table" but I can't get the output I want. Any advice is welcome.

CodePudding user response:

You could do

df %>% 
  group_by(value, seqid) %>% 
  filter(value %in% c(7, 100) & n() > 1) %>%
  count()
#> # A tibble: 3 x 3
#> # Groups:   value, seqid [3]
#>   value seqid     n
#>  <chr> <chr> <int>
#> 1 100   B         2
#> 2 7     B         3
#> 3 7     D         2

Or if you just want the unique seqid values then

df %>% 
  group_by(value, seqid) %>%
  filter(value %in% c(7, 100) & n() > 1) %>%
  count() %>%
  getElement(2) %>%
  unique()
#> [1] "B" "D"
  • Related