Home > database >  R replace NA with last value for group ID ordered by date
R replace NA with last value for group ID ordered by date

Time:02-16

Sample dataframe:

eg_data <- data.frame(
  custID = c('655321' , '655321' , '655321' , '655321' , '655321' , '655321' , '655321' , '655321' , '655321' , '655321' , '655321' , '655321' , '655321', '655321' , '655321' , '655321' , 
             '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' , '102013' ) ,
  year = c('2018' , '2018' , '2018' , '2018' , '2019' , '2019' , '2019' , '2019' , '2020' , '2020' , '2020' , '2020' , '2021' , '2021' , '2021' , '2021' , 
           '2018' , '2018' , '2018' , '2018' , '2019' , '2019' , '2019' , '2019' , '2020' , '2020' , '2020' , '2020' , '2021' , '2021' , '2021' , '2021') , 
  quarter = c('1' , '2' , '3' , '4' , '1' , '2' , '3' , '4' , '1' , '2' , '3' , '4' , '1' , '2' , '3' , '4', 
              '1' , '2' , '3' , '4' , '1' , '2' , '3' , '4' , '1' , '2' , '3' , '4' , '1' , '2' , '3' , '4') ,
  orderType = c('retail' , NA , 'wholesale' , NA , 'commercial' , 'retail' , NA , NA , NA , 'wholesale' , NA , NA , 'retail' , NA , NA , NA ,
                'wholesale' , NA , NA , 'retail' , 'commercial' , NA , 'commercial' , NA , NA , 'wholesale' , NA , 'retail' , 'retail' , NA , 'retail' , NA ) )

Context: The data are orders for rubber chickens broken out by year quarter. The data are for two customres but you can assume there are more, each has a unique ID number.

There are three order types - retail, wholesale, and commercial. The value in the order type field is only updated when the order type changes.

EG for ID #655321 - the 2019 Q2 order type is retail, the next value is 2020 Q2 and is wholesale. The missing values for 2019 Q3, 2019 Q4, and 2020 Q1 are also retail but were left as NA because the order type stayed the same quarter to quarter.

I need to fill in the NA values accurately. It needs to be specific to the unique ID number and be in chronological order.

I have tried multiple methods of grouping but I haven't yet been successful.

Within each ID, I need the solution to look at the orderType field and for all NA values need to assign whatever was the last non-NA value.

Any help is appreciated. Solutions that make use of tidyverse et al / dplyr and avoid using functions are GREATLY appreciated but if the only way to do it is creating a function, OK fine.

Thank you

CodePudding user response:

You can use zoo::na.locf along with dplyr verbs. Make sure you group_by customer id first, and ensure your times are ordered correctly.

library(dplyr)

eg_data %>% 
  group_by(custID) %>% 
  arrange(custID, year, quarter) %>% 
  mutate(orderType = zoo::na.locf(orderType))

#> # A tibble: 32 x 4
#> # Groups:   custID [2]
#>    custID year  quarter orderType 
#>    <chr>  <chr> <chr>   <chr>     
#>  1 655321 2018  1       retail    
#>  2 655321 2018  2       retail    
#>  3 655321 2018  3       wholesale 
#>  4 655321 2018  4       wholesale 
#>  5 655321 2019  1       commercial
#>  6 655321 2019  2       retail    
#>  7 655321 2019  3       retail    
#>  8 655321 2019  4       retail    
#>  9 655321 2020  1       retail    
#> 10 655321 2020  2       wholesale 
#> # ... with 22 more rows

  • Related