How to calculate percentile in R using group by?-CodePudding

I have the following dataframe:

test <- tibble(
  period = c(
    '2019_q1',
    '2019_q1',
    '2019_q1',
    '2019_q1',
    '2019_q1',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2',
    '2019_q2'
    
  ),
  company = c(
    'google',
    'facebook',
    'amazon',
    'ebay',
    'wikipedia',
    'google',
    'youtube',
    'amazon',
    'wikipedia',
    'yelp',
    'yahoo',
    'tide',
    'target'
  ),
  source = c('website',
             'website',
             'website',
             'website',
             'website',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone',
             'phone'),
  
  values = c(10,
             20,
             30,
             50,
             90,
             6,
             12,
             45,
             52,
             80,
             92,
             8,
             17)
)

I would like to calculate the percentiles for this dataframe using the following grouping:

group_by(period, source)

However whenever i try and do this with the following code i get the error message:

Error in mutate(): ! Problem while computing percentile =
quantile(values, probs = seq(0, 1, 0.25)). x percentile must be size 1, not 5. i The error occurred in group 1: period = "2019_q1",
source = "website".

When using this code:

test %>%    
    group_by(period, source) %>%     
    arrange(period, source) %>%    
    filter(!is.na(values)) %>%    
    mutate(percentile = quantile(values, probs = seq(0,1,0.25)))

I'm trying to find out what the percentile is for each of these values by the grouping of period and source.

e.g. for the first grouping the new column would look like this:

period	company	values	percentile
2019_q1	google	10	25
2019_q1	facebook	20	25
2019_q1	amazon	30	50
2019_q1	ebay	10	50
2019_q1	wikipedia	90	75

new update = there is grouping by both period and source columns

CodePudding user response：

To find the percentiles you shouldn't use the quantile function but rather its inverse – cumulative distribution function (CDF).

test$percentile <- unlist(tapply(test$values, test$period, function(x) {
  f <- ecdf(x)
  f(x)*100
}))

Here I used an empirical distribution function (ecdf) but if you have any assumptions on the distribution of the values, you could first fit the correct distribution to the data and then use its parameters in a distributions-specific CDF (dnorm, dbeta, etc.).