Home > Enterprise >  How can I subset the top 10 by frequency of a 63,000 row dataset in R
How can I subset the top 10 by frequency of a 63,000 row dataset in R

Time:11-01

I have a dataset of analysis in the form:

| Compound | Concentration |SampleID|

There are 200 sample IDs, with 9000 unique Compounds giving a df with 63,000 rows. (not every compound is found in every sample)

What I would like to do is take the ten most frequently occurring compounds and create a subset so I can graph their concentrations using a boxplot or similar

I've tried using the below, but this leads to an error, and also only filters the top ten (so they're all the same compound)

df %>% 
  arrange(desc(df$Concentration)) %>%
  slice(1:10, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration)) 
  geom_point() 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
  labs(x="Compound", y="Frequency")

My other thought would be

  arrange(desc(as.data.frame(table(df$Compound)))) %>%
  slice(1:50, preserve=T) %>%
  ggplot(., aes(x=df$Compound,y=df$Concentration)) 
  geom_point() 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
  labs(x="Compound", y="Frequency")

Neither work. I feel like I need to make a df with a list of the top 10, then filter my df to give dftop and then subset my df to just those elements

Can anyone help simplify this?

CodePudding user response:

This data.table approach identifies the top 10, and uses a join:

library(data.table)
setDT(df)[df[,.N,Compounds][order(-N)][1:10],on="Compounds"]

The equivalent in tidyverse is:

inner_join(df, count(df,Compounds,sort = T)[1:10])

On a 63K dataset with informal benchmarking, I find the data.table approach to be about 20 times faster.

CodePudding user response:

Create a vector of the top 10 compounds, then filter the dataframe based on this vector. Illustration using most frequent homeworlds in dplyr::starwars:

library(dplyr)

top10 <- starwars %>% 
  count(homeworld, sort = TRUE) %>%
  head(10) %>%
  pull(homeworld)

starwars %>%
  filter(homeworld %in% top10)
# A tibble: 48 × 14
   name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
   <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
 1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
 2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
 3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
 4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
 5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
 6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
 7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
 8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
 9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
10 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
# … with 38 more rows, 4 more variables: species <chr>, films <list>,
#   vehicles <list>, starships <list>, and abbreviated variable names
#   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
  • Related