How can I subset the top 10 by frequency of a 63,000 row dataset in R-CodePudding

I have a dataset of analysis in the form:

| Compound | Concentration |SampleID|

There are 200 sample IDs, with 9000 unique Compounds giving a df with 63,000 rows. (not every compound is found in every sample)

What I would like to do is take the ten most frequently occurring compounds and create a subset so I can graph their concentrations using a boxplot or similar

I've tried using the below, but this leads to an error, and also only filters the top ten (so they're all the same compound)

df %>% 
  arrange(desc(df$Concentration)) %>%
  slice(1:10, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration)) 
  geom_point() 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
  labs(x="Compound", y="Frequency")

My other thought would be

  arrange(desc(as.data.frame(table(df$Compound)))) %>%
  slice(1:50, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration)) 
  geom_point() 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
  labs(x="Compound", y="Frequency")

Neither work. I feel like I need to make a df with a list of the top 10, then filter my df to give dftop and then subset my df to just those elements

Can anyone help simplify this?

CodePudding user response：

This data.table approach identifies the top 10, and uses a join:

library(data.table)
setDT(df)[df[,.N,Compounds][order(-N)][1:10],on="Compounds"]

The equivalent in tidyverse is:

inner_join(df, count(df,Compounds,sort = T)[1:10])

On a 63K dataset with informal benchmarking, I find the data.table approach to be about 20 times faster.

CodePudding user response：

Create a vector of the top 10 compounds, then filter the dataframe based on this vector. Illustration using most frequent homeworlds in dplyr::starwars:

library(dplyr)

top10 <- starwars %>% 
  count(homeworld, sort = TRUE) %>%
  head(10) %>%
  pull(homeworld)

starwars %>%
  filter(homeworld %in% top10)

# A tibble: 48 × 14
   name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
   <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
 1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
 2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
 3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
 4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
 5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
 6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
 7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
 8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
 9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
10 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
# … with 38 more rows, 4 more variables: species <chr>, films <list>,
#   vehicles <list>, starships <list>, and abbreviated variable names
#   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld