Let's say we have data frame x
defined as
x <- data.frame(a = c('Start : 20220101', '1', '1', '1', 'Start : 20220102', '2', '2', 'Start : 20220103', '3', '3'),
b = c(NA, 200, 200, 200, NA, 200, 200, NA, 200, 200),
c = c(NA, 1, 3, 5, NA, 2, 4, NA, 3, 5))
a b c
1 Start : 20220101 NA NA
2 1 200 1
3 1 200 3
4 1 200 5
5 Start : 20220102 NA NA
6 2 200 2
7 2 200 4
8 Start : 20220103 NA NA
9 3 200 3
10 3 200 5
I need to replace column a
's value by previous Start : ...
's ...
which indicates it's full date.
My desired output might make by problem more clear.
a b c
1 20220101 200 1
2 20220101 200 3
3 20220101 200 5
4 20220102 200 2
5 20220102 200 4
6 20220103 200 3
7 20220103 200 5
Data x
always have patern with Start : YMD
and D
follows.
Original x
have more than 10^8
rows, so I think it need to be very efficient.
Any help would be grateful.
What I tried is
library(dplyr)
library(data.table)
library(readr)
x %>%
mutate(d = floor((rleid(a) 1)/2)) %>%
group_by(d) %>%
mutate(a = first(parse_number(a))) %>%
na.omit() %>%
ungroup %>%
select(-d)
CodePudding user response:
Here is one data.table
solution with zoo::na.locf
to fill NA
values.
library(data.table)
setDT(x)
#Change all the a values to NA except the ones that start with "Start"
x[, a := replace(a, !grepl('^Start', a), NA)]
#Remove "Start" from a so only the date remains.
x[, a := sub('Start\\s*:\\s*', '', a)]
#Replace NA with latest non-NA values.
zoo::na.locf(x)
# a b c
#1: 20220101 200 1
#2: 20220101 200 3
#3: 20220102 200 3
#4: 20220102 200 2
#5: 20220102 200 4
#6: 20220103 200 4
#7: 20220103 200 3
A tidyverse
solution for the same would be -
library(dplyr)
library(tidyr)
x %>%
mutate(a = replace(a, !grepl('^Start', a), NA)) %>%
fill(everything(), .direction = "downup") %>%
mutate(a = sub('Start\\s*:\\s*', '', a))