Suppose we have a simple data frame:
structure(c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11), dim = c(6L,
2L), dimnames = list(NULL, c("lo", "li")))
How can I find the percentile for each observation for both variables?
CodePudding user response:
Here is a dplyr
approach to get the median, 5- and 95% quantiles.
library(tidyverse)
data = structure(c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11), dim = c(6L,
2L), dimnames = list(NULL, c("lo", "li")))
data %>%
as.data.frame() %>% # Coerce to dataframe
pivot_longer(cols = everything()) %>% # Pivot to long format
group_by(name) %>% # For each unique group..
summarise(perc5 = quantile(value, 0.05), # Calculate 5% quantile
median = median(value), # Calculate median
perc95 = quantile(value, 0.95)) # Calculate 95% quantile
#> # A tibble: 2 × 4
#> name perc5 median perc95
#> <chr> <dbl> <dbl> <dbl>
#> 1 li 2.5 7 53
#> 2 lo 1.25 4.5 7.5
Created on 2023-01-27 with reprex v2.0.2
CodePudding user response:
data.table solution
library(data.table)
data <- data.table(data)
q <- c(0.05, 0.95)
melt(data, measure.vars = names(data))[, setNames(as.list(quantile(value, q)), paste("q", q * 100, sep = "_")), variable]
results
variable q_5 q_95
1: lo 1.25 7.5
2: li 2.50 53.0
data
data = structure(
c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11),
dim = c(6L, 2L),
dimnames = list(NULL, c("lo", "li"))
)
CodePudding user response:
The most R friendly approach would be to (i) convert this to a dataframe (or tibble), (ii) reshape the data into long format, (iii) groupby lo and li, and (iv) calculate the percent rank.
Here's the code:
df%>%
as_tibble() %>% # convert to dataframe
gather(key=variable,value=value) %>% # gather into long form
group_by(variable)%>%. # group by lo and li
mutate(percentile=percent_rank(val)*100) # make new column
variable val percentile
<chr> <dbl> <dbl>
1 lo 2 20
2 lo 4 40
3 lo 5 60
4 lo 6 80
5 lo 8 100
6 lo 1 0
7 li 2 0
8 li 4 20
9 li 6 40
10 li 67 100
11 li 8 60
12 li 11 80
If you don't want to make the dataframe long, just do the two columns seperately:
df%>%
as_tibble()%>%
mutate(lo_pr=percent_rank(lo)*100)%>%
mutate(li_percentile=percent_rank(li)*100)
lo li lo_pr li_percentile
<dbl> <dbl> <dbl> <dbl>
1 2 2 20 0
2 4 4 40 20
3 5 6 60 40
4 6 67 80 100
5 8 8 100 60
6 1 11 0 80