I have a dataset of analysis in the form:
| Compound | Concentration |SampleID|
There are 200 sample IDs, with 9000 unique Compounds giving a df with 63,000 rows. (not every compound is found in every sample)
What I would like to do is take the ten most frequently occurring compounds and create a subset so I can graph their concentrations using a boxplot or similar
I've tried using the below, but this leads to an error, and also only filters the top ten (so they're all the same compound)
df %>%
arrange(desc(df$Concentration)) %>%
slice(1:10, preserve=T) %>%
ggplot(., aes(x=df$Compound,y=df$Concentration))
geom_point()
theme(axis.text.x = element_text(angle = 45, hjust = 1))
labs(x="Compound", y="Frequency")
My other thought would be
arrange(desc(as.data.frame(table(df$Compound)))) %>%
slice(1:50, preserve=T) %>%
ggplot(., aes(x=df$Compound,y=df$Concentration))
geom_point()
theme(axis.text.x = element_text(angle = 45, hjust = 1))
labs(x="Compound", y="Frequency")
Neither work. I feel like I need to make a df with a list of the top 10, then filter my df to give dftop and then subset my df to just those elements
Can anyone help simplify this?
CodePudding user response:
This data.table
approach identifies the top 10, and uses a join:
library(data.table)
setDT(df)[df[,.N,Compounds][order(-N)][1:10],on="Compounds"]
The equivalent in tidyverse is:
inner_join(df, count(df,Compounds,sort = T)[1:10])
On a 63K dataset with informal benchmarking, I find the data.table approach to be about 20 times faster.
CodePudding user response:
Create a vector of the top 10 compounds, then filter the dataframe based on this vector. Illustration using most frequent homeworlds in dplyr::starwars
:
library(dplyr)
top10 <- starwars %>%
count(homeworld, sort = TRUE) %>%
head(10) %>%
pull(homeworld)
starwars %>%
filter(homeworld %in% top10)
# A tibble: 48 × 14
name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
10 Anakin Sky… 188 84 blond fair blue 41.9 male mascu… Tatooi…
# … with 38 more rows, 4 more variables: species <chr>, films <list>,
# vehicles <list>, starships <list>, and abbreviated variable names
# ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld