Show percentiles of Variable A, while the classification of percentiles is based on Variable B-CodePudding

I have a dataset that looks like the following:

INCOME	WEALTH
10.000	100000
15.000	111000
14.200	123456
12.654	654321

I have many more rows.

I now want to now find how much INCOME a household in a specific WEALTH percentile has. The following quantiles are relevant:

c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99)

I have always used the following code to get specific percentile values:

a <- quantile(WEALTH, probs = c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99))

But now I want to base my percentiles on WEALTH but get the respective INCOME. I have tried the following code but the results are not plausible:

df$percentile = ntile(df$WEALTH,100)
df <- df[df$percentile %in% c(1,5,10,25,50,75,90,95,99), ]

a <- df %>% 
  group_by(percentile) %>% 
  summarise(max = max(INCOME))

The results that I get a not consistent with other parts of the analysis that I have done. I assume that the percentile when using the "quantile" function are calculated differently that simply taking the maximum.

CodePudding user response：

Im not sure if i understood your question correctly, but the quantile has different methods of calculation. I for example always go for number 6, since this is what i was taought in my stat courses.

type: an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.

Read more about different types by using ?quantile commands (help on quantile)

CodePudding user response：

If you have fewer than 100 rows in your dataset, dplyr::ntile(x, 100) won’t yield accurate percentiles, but will only give you bins numbered through the total number of rows:

library(dplyr)

df %>% 
  mutate(percentile = ntile(WEALTH, 100))

# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <int>
1   10   100000          1
2   15   111000          2
3   14.2 123456          3
4   12.7 654321          4

To get true percentiles, you can rescale the result, manually or with scales::rescale():

library(scales)

df %>% 
  mutate(percentile = rescale(
    ntile(WEALTH, 100),
    c(1, 100)
  ))

# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <dbl>
1   10   100000          1
2   15   111000         34
3   14.2 123456         67
4   12.7 654321        100