Home > Net >  Show percentiles of Variable A, while the classification of percentiles is based on Variable B
Show percentiles of Variable A, while the classification of percentiles is based on Variable B

Time:11-28

I have a dataset that looks like the following:

INCOME WEALTH
10.000 100000
15.000 111000
14.200 123456
12.654 654321

I have many more rows.

I now want to now find how much INCOME a household in a specific WEALTH percentile has. The following quantiles are relevant:

c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99)

I have always used the following code to get specific percentile values:

a <- quantile(WEALTH, probs = c(0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99))

But now I want to base my percentiles on WEALTH but get the respective INCOME. I have tried the following code but the results are not plausible:

df$percentile = ntile(df$WEALTH,100)
df <- df[df$percentile %in% c(1,5,10,25,50,75,90,95,99), ]

a <- df %>% 
  group_by(percentile) %>% 
  summarise(max = max(INCOME))

The results that I get a not consistent with other parts of the analysis that I have done. I assume that the percentile when using the "quantile" function are calculated differently that simply taking the maximum.

CodePudding user response:

Im not sure if i understood your question correctly, but the quantile has different methods of calculation. I for example always go for number 6, since this is what i was taought in my stat courses.

type: an integer between 1 and 9 selecting one of the nine quantile algorithms detailed below to be used.

Read more about different types by using ?quantile commands (help on quantile)

CodePudding user response:

If you have fewer than 100 rows in your dataset, dplyr::ntile(x, 100) won’t yield accurate percentiles, but will only give you bins numbered through the total number of rows:

library(dplyr)

df %>% 
  mutate(percentile = ntile(WEALTH, 100))
# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <int>
1   10   100000          1
2   15   111000          2
3   14.2 123456          3
4   12.7 654321          4

To get true percentiles, you can rescale the result, manually or with scales::rescale():

library(scales)

df %>% 
  mutate(percentile = rescale(
    ntile(WEALTH, 100),
    c(1, 100)
  ))
# A tibble: 4 × 3
  INCOME WEALTH percentile
   <dbl>  <dbl>      <dbl>
1   10   100000          1
2   15   111000         34
3   14.2 123456         67
4   12.7 654321        100
  • Related