How to find the percentile for each observation in a data frame in R?-CodePudding

Suppose we have a simple data frame:

structure(c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11), dim = c(6L, 
2L), dimnames = list(NULL, c("lo", "li")))

How can I find the percentile for each observation for both variables?

CodePudding user response：

Here is a dplyr approach to get the median, 5- and 95% quantiles.

library(tidyverse)
data = structure(c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11), dim = c(6L, 
                                                           2L), dimnames = list(NULL, c("lo", "li")))

data %>% 
  as.data.frame() %>% # Coerce to dataframe
  pivot_longer(cols = everything()) %>%  # Pivot to long format
  group_by(name) %>% # For each unique group..
  summarise(perc5 = quantile(value, 0.05), # Calculate 5% quantile
            median = median(value), # Calculate median
            perc95 = quantile(value, 0.95)) # Calculate 95% quantile

#> # A tibble: 2 × 4
#>   name  perc5 median perc95
#>   <chr> <dbl>  <dbl>  <dbl>
#> 1 li     2.5     7     53  
#> 2 lo     1.25    4.5    7.5

^{Created on 2023-01-27 with reprex v2.0.2}

CodePudding user response：

data.table solution

library(data.table)

data <- data.table(data)

q <- c(0.05, 0.95)

melt(data, measure.vars = names(data))[, setNames(as.list(quantile(value, q)), paste("q", q * 100, sep = "_")), variable]

results

variable  q_5 q_95
1:       lo 1.25  7.5
2:       li 2.50 53.0

data

data = structure(
  c(2, 4, 5, 6, 8, 1, 2, 4, 6, 67, 8, 11),
  dim = c(6L, 2L), 
  dimnames = list(NULL, c("lo", "li"))
)

CodePudding user response：

The most R friendly approach would be to (i) convert this to a dataframe (or tibble), (ii) reshape the data into long format, (iii) groupby lo and li, and (iv) calculate the percent rank.

Here's the code:

df%>%
  as_tibble() %>% # convert to dataframe
  gather(key=variable,value=value) %>% # gather into long form
  group_by(variable)%>%. # group by lo and li
  mutate(percentile=percent_rank(val)*100) # make new column

variable   val percentile
   <chr>    <dbl>      <dbl>
 1 lo           2         20
 2 lo           4         40
 3 lo           5         60
 4 lo           6         80
 5 lo           8        100
 6 lo           1          0
 7 li           2          0
 8 li           4         20
 9 li           6         40
10 li          67        100
11 li           8         60
12 li          11         80

If you don't want to make the dataframe long, just do the two columns seperately:

df%>%
  as_tibble()%>%
  mutate(lo_pr=percent_rank(lo)*100)%>%
  mutate(li_percentile=percent_rank(li)*100)


lo    li lo_pr li_percentile
  <dbl> <dbl> <dbl>         <dbl>
1     2     2    20             0
2     4     4    40            20
3     5     6    60            40
4     6    67    80           100
5     8     8   100            60
6     1    11     0            80