Home > Software engineering >  Create a column based on changes between levels in another column in R
Create a column based on changes between levels in another column in R

Time:10-26

I am looking to create a new column based off of factors present in another column. For example I have the following data frame:

> df <- data.frame(ID = c("1","1","2","2"), Month = c("01", "02", "01", "02"), Percentile = c("P50", "P95", "P97", "P85"))
> df
  ID Month Percentile
1  1    01        P50
2  1    02        P95
3  2    01        P97
4  2    02        P85

I want to create a new column, PercentileChange to say that the for an individual ID, look at the first months percentile and compare it to the next months percentile and see if it increased or decreased and by how many levels so I would get an output like this:

> df2
  ID Month Percentile PercentileChange
1  1    01        P50               NA
2  1    02        P95                4
3  2    01        P97               NA
4  2    02        P85               -3

These are the levels present in the Percentile column

df$Percentile <- factor(df$Percentile,levels=c("P01","P1","P3","P5","P10","P15","P25","P50","P75","P85","P90","P95","P97","P99","P999"))

Edit: Adding longer df example:

> df <- data.frame(ID = c("1","1","1","1","2","2","3","3","3"), Month = c("01", "02", "03", "04", "01", "02", "02", "03", "05"), Percentile = c("P50", "P95", "P97", "P85", "P01","P01", "P5","P5","P3"))
> df
  ID Month Percentile
1  1    01        P50
2  1    02        P95
3  1    03        P97
4  1    04        P85
5  2    01        P01
6  2    02        P01
7  3    02         P5
8  3    03         P5
9  3    05         P3


Output:
  ID Month Percentile PercentileChange
1  1    01        P50               NA
2  1    02        P95                4
3  1    03        P97                1
4  1    04        P85                3
5  2    01        P01               NA
6  2    02        P01                0
7  3    02         P5               NA
8  3    03         P5                0
9  3    05         P3               -1

CodePudding user response:

You can group_by ID and get the distance between the indexes of the factor levels in the Percentile column:

library(tidyverse)

df %>%
  group_by(ID) %>%
  mutate(PercentileChange = levels(Percentile) %>%
         {match(Percentile, .) - match(lag(Percentile), .)})

Further explanation:

levels(Percentile) %>%
             {match(Percentile, .) - match(lag(Percentile), .)}

Is equivalent to

match(Percentile, levels(Percentile)) - match(lag(Percentile), levels(Percentile))

And, match(Percentile, levels(Percentile)) will yied in which position on the levels(Percentile) vector the values on Percentile lie.

  • Related