Home > Enterprise >  How to calculate percentile in R using group by?
How to calculate percentile in R using group by?

Time:06-28

I have the following dataframe:

test <- tibble(
  period = c(
    '2019_q1',
    '2019_q1',
    '2019_q1',
    '2019_q1',
    '2019_q1',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2'
    
  ),
  company = c(
    'google',
    'facebook',
    'amazon',
    'ebay',
    'wikipedia',
    'google',
    'youtube',
    'amazon',
    'wikipedia',
    'yelp',
    'yahoo',
    'tide',
    'target'
  ),
  source = c('website',
             'website',
             'website',
             'website',
             'website',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone'),
  
  values = c(10,
             20,
             30,
             50,
             90,
             6,
             12,
             45,
             52,
             80,
             92,
             8,
             17)
)

I would like to calculate the percentiles for this dataframe using the following grouping:

group_by(period, source) 

However whenever i try and do this with the following code i get the error message:

Error in mutate(): ! Problem while computing percentile =
quantile(values, probs = seq(0, 1, 0.25)). x percentile must be size 1, not 5. i The error occurred in group 1: period = "2019_q1",
source = "website".

When using this code:

test %>%    
    group_by(period, source) %>%     
    arrange(period, source) %>%    
    filter(!is.na(values)) %>%    
    mutate(percentile = quantile(values, probs = seq(0,1,0.25)))

I'm trying to find out what the percentile is for each of these values by the grouping of period and source.

e.g. for the first grouping the new column would look like this:

period company values percentile
2019_q1 google 10 25
2019_q1 facebook 20 25
2019_q1 amazon 30 50
2019_q1 ebay 10 50
2019_q1 wikipedia 90 75

new update = there is grouping by both period and source columns

CodePudding user response:

To find the percentiles you shouldn't use the quantile function but rather its inverse – cumulative distribution function (CDF).

test$percentile <- unlist(tapply(test$values, test$period, function(x) {
  f <- ecdf(x)
  f(x)*100
}))

Here I used an empirical distribution function (ecdf) but if you have any assumptions on the distribution of the values, you could first fit the correct distribution to the data and then use its parameters in a distributions-specific CDF (dnorm, dbeta, etc.).

  • Related