I have the following dataframe:
test <- tibble(
period = c(
'2019_q1',
'2019_q1',
'2019_q1',
'2019_q1',
'2019_q1',
'2019_q2',
'2019_q2',
'2019_q2',
'2019_q2',
'2019_q2',
'2019_q2',
'2019_q2',
'2019_q2'
),
company = c(
'google',
'facebook',
'amazon',
'ebay',
'wikipedia',
'google',
'youtube',
'amazon',
'wikipedia',
'yelp',
'yahoo',
'tide',
'target'
),
source = c('website',
'website',
'website',
'website',
'website',
'phone',
'phone',
'phone',
'phone',
'phone',
'phone',
'phone',
'phone'),
values = c(10,
20,
30,
50,
90,
6,
12,
45,
52,
80,
92,
8,
17)
)
I would like to calculate the percentiles for this dataframe using the following grouping:
group_by(period, source)
However whenever i try and do this with the following code i get the error message:
Error in mutate(): ! Problem while computing percentile =
quantile(values, probs = seq(0, 1, 0.25)). x percentile must be size 1, not 5. i The error occurred in group 1: period = "2019_q1",
source = "website".
When using this code:
test %>%
group_by(period, source) %>%
arrange(period, source) %>%
filter(!is.na(values)) %>%
mutate(percentile = quantile(values, probs = seq(0,1,0.25)))
I'm trying to find out what the percentile is for each of these values by the grouping of period and source.
e.g. for the first grouping the new column would look like this:
period | company | values | percentile |
---|---|---|---|
2019_q1 | 10 | 25 | |
2019_q1 | 20 | 25 | |
2019_q1 | amazon | 30 | 50 |
2019_q1 | ebay | 10 | 50 |
2019_q1 | wikipedia | 90 | 75 |
new update = there is grouping by both period and source columns
CodePudding user response:
To find the percentiles you shouldn't use the quantile
function but rather its inverse – cumulative distribution function (CDF).
test$percentile <- unlist(tapply(test$values, test$period, function(x) {
f <- ecdf(x)
f(x)*100
}))
Here I used an empirical distribution function (ecdf
) but if you have any assumptions on the distribution of the values, you could first fit the correct distribution to the data and then use its parameters in a distributions-specific CDF (dnorm
, dbeta
, etc.).