Finding the Top X percentile of values and combining all values below that percentile into an Others-CodePudding

I'm new to coding in R and I am having some trouble. I am trying to find the values in a column above a certain percentile (X%ile) per group, and then combining all rows below that percentile into a row others for every group.

My situation is very similar to the question here: How to use fct_lump() to get the top n levels by group and put the rest in 'other'?

Where I am grouping by two columns and trying to add rows in the second and third columns which add the name "Others" in the second column and sums all values below a percentile in the third column in the same row.

I am working with a large dataframe (df) where I have the following columns: Year (class = Integer, ie 2007, 2008, ...), SciName (class = character), and Flowers (class = numeric)

I am able to filter and only show rows with a value above a certain percentile using:

df_filter <- df %>%
filter(Flowers > quantile(Flowers, 0.7))
view(df_filter)

However, I have not been able to find a way to add the others row I need

Following the accepted answer from the similar question I linked above, I have tried:

df_Others <- df %>%
  ungroup() %>%
  group_by(Year) %>%
  arrange(desc(Flowers)) %>%
  mutate(a = row_number(-Flowers)) %>%
  mutate(SciName = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(SciName))) %>%
  mutate(a = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(a))) %>%
  group_by(Year, SciName, a) %>%
  summarize(Flowers = sum(Flowers)) %>%
  arrange(Year, a) %>%
  select(-a)

View(df_Others)

...but this does not work

Any suggestions on how to do this would be greatly appreciated!

EDIT:

Input:

Year    SciName    Flowers
2004    Liliac     2000
2004    Rose       3000
2004    Daisy      10
2004    Lily       5
2005    Liliac     20
2005    Rose       3
2005    Daisy      1000
2005    Lily       5000
...     ...        ...
 

Expected Output:
Year    SciName    Flowers
2004    Liliac     2000
2004    Rose       3000
2004    Others      15
2005    Daisy      1000
2005    Lily       5000
2005    Others     23
...     ...        ...

CodePudding user response：

Without your input data and your expected output, it is difficult to determine why your current approach is not working. I recommend something like the following:

library(dplyr)
# access iris, a build in dataset
data(iris)

df = iris %>%
  # all the groups that you want to work within should be listed here
  group_by(Species) %>%
  # this creates a new column containing the threshold for each group
  # replace Sepal.Length with the numeric column you want the percentile of
  mutate(threshold = quantile(Sepal.Length, 0.7)) %>%
  # in case of duplicate values in Sepal.Length
  # creating row-numbers to avoid merging them
  mutate(rn = row_number()) %>%
  # setup variable that contains the desired output groups
  mutate(new_group = ifelse(Sepal.Length < threshold, "other", rn))

# pause here to inspect and confirm the new_group column gives the preferred groups

df = df  %>%
  group_by(Species, new_group) %>%
  # summarise values as desired
  summarise(num = n()) %>%
  select(Species, Sepal.Length, num)

EDIT: In response to your input structure and expected output, I recommend the following:

output = df %>%
  group_by(Year) %>%
  mutate(threshold = quantile(Flowers, 0.7)) %>%
  mutate(new_group = ifelse(Flowers < threshold, "Others", SciName)) %>%
  group_by(Year, new_group) %>%
  summarise(Flowers = sum(Flowers)) %>%
  select(Year, SciName = new_group, Flowers)

As I am not using row numbers here, I am assuming SciName identities are unique.