Home > Software design >  Finding the Top X percentile of values and combining all values below that percentile into an Others
Finding the Top X percentile of values and combining all values below that percentile into an Others

Time:06-16

I'm new to coding in R and I am having some trouble. I am trying to find the values in a column above a certain percentile (X%ile) per group, and then combining all rows below that percentile into a row others for every group.

My situation is very similar to the question here: How to use fct_lump() to get the top n levels by group and put the rest in 'other'?

Where I am grouping by two columns and trying to add rows in the second and third columns which add the name "Others" in the second column and sums all values below a percentile in the third column in the same row.

I am working with a large dataframe (df) where I have the following columns: Year (class = Integer, ie 2007, 2008, ...), SciName (class = character), and Flowers (class = numeric)

I am able to filter and only show rows with a value above a certain percentile using:

df_filter <- df %>%
filter(Flowers > quantile(Flowers, 0.7))
view(df_filter)

However, I have not been able to find a way to add the others row I need

Following the accepted answer from the similar question I linked above, I have tried:

df_Others <- df %>%
  ungroup() %>%
  group_by(Year) %>%
  arrange(desc(Flowers)) %>%
  mutate(a = row_number(-Flowers)) %>%
  mutate(SciName = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(SciName))) %>%
  mutate(a = case_when(a < (quantile(df$Flowers, 0.7)) ~ "Others", TRUE ~ as.character(a))) %>%
  group_by(Year, SciName, a) %>%
  summarize(Flowers = sum(Flowers)) %>%
  arrange(Year, a) %>%
  select(-a)

View(df_Others)

...but this does not work

Any suggestions on how to do this would be greatly appreciated!

EDIT:

Input:

Year    SciName    Flowers
2004    Liliac     2000
2004    Rose       3000
2004    Daisy      10
2004    Lily       5
2005    Liliac     20
2005    Rose       3
2005    Daisy      1000
2005    Lily       5000
...     ...        ...
 

Expected Output:
Year    SciName    Flowers
2004    Liliac     2000
2004    Rose       3000
2004    Others      15
2005    Daisy      1000
2005    Lily       5000
2005    Others     23
...     ...        ...
 

CodePudding user response:

Without your input data and your expected output, it is difficult to determine why your current approach is not working. I recommend something like the following:

library(dplyr)
# access iris, a build in dataset
data(iris)

df = iris %>%
  # all the groups that you want to work within should be listed here
  group_by(Species) %>%
  # this creates a new column containing the threshold for each group
  # replace Sepal.Length with the numeric column you want the percentile of
  mutate(threshold = quantile(Sepal.Length, 0.7)) %>%
  # in case of duplicate values in Sepal.Length
  # creating row-numbers to avoid merging them
  mutate(rn = row_number()) %>%
  # setup variable that contains the desired output groups
  mutate(new_group = ifelse(Sepal.Length < threshold, "other", rn))

# pause here to inspect and confirm the new_group column gives the preferred groups

df = df  %>%
  group_by(Species, new_group) %>%
  # summarise values as desired
  summarise(num = n()) %>%
  select(Species, Sepal.Length, num)

EDIT: In response to your input structure and expected output, I recommend the following:

output = df %>%
  group_by(Year) %>%
  mutate(threshold = quantile(Flowers, 0.7)) %>%
  mutate(new_group = ifelse(Flowers < threshold, "Others", SciName)) %>%
  group_by(Year, new_group) %>%
  summarise(Flowers = sum(Flowers)) %>%
  select(Year, SciName = new_group, Flowers)

As I am not using row numbers here, I am assuming SciName identities are unique.

  • Related