I have a long tibble with 3 columns. Example:
Spectrum | Mass | Intensity |
---|---|---|
Spectrum A | 14.1 | 5 |
Spectrum A | 14.2 | 3 |
Spectrum A | 14.1 | 20 |
Spectrum A | 20.3 | 7 |
Spectrum B | 11.7 | 4 |
Spectrum B | 13.4 | 5 |
Spectrum B | 33.0 | 5 |
Spectrum B | 33.1 | 10 |
… | … | … |
Spectrum Z | n | n |
For each Spectrum, I wish to round the Mass values to the nearest integer. Then, if there are multiple entries of the same integer Mass, only keep the row with the highest Intensity value.
Example desired output:
Spectrum | Mass | Intensity |
---|---|---|
Spectrum A | 14 | 20 |
Spectrum A | 20 | 7 |
Spectrum B | 12 | 4 |
Spectrum B | 13 | 5 |
Spectrum B | 33 | 10 |
… | … | … |
Spectrum Z | n | n |
To do this, I thought this code should work:
df_binned <- df %>%
mutate(Mass = round(Mass, 0)) %>%
group_by(Spectrum, Mass) %>%
summarise(Intensity = max(Intensity)) %>%
ungroup()
Unfortunately, the actual output's Intensity is almost fully NAs. For many Spectrums, it's 100% NAs. I have verified that the Mass and Intensity columns are indeed numerics, not characters.
Does anyone know why this is happening? And/or how can I implement this "binning" transformation?
CodePudding user response:
You need to do
df %>%
mutate(Mass = round(Mass, 0)) %>%
group_by(Spectrum, Mass) %>%
summarise(Intensity = max(Intensity, na.rm = TRUE)) %>%
ungroup()
Otherwise even a single missing value in a group will cause the whole group to be NA