I have the following two columns:
ID Number
A 1
A 1
B NA
C NA
C NA
D 3
D 3
D 3
F NA
G 6
H NA
I want the NAs in the Number column to be replaced by the next integer number following the Non-NA number that precedes them. That new number should then stay the same as long as the ID in the ID column doesn't change.
So for example, using the example columns above, if the number associated with ID "A" is 1, and ID "B" below it has NA's, I want those NA's replaced with the number 2. Then, if ID "C" has NA's, they should be replaced with 3. We move down to ID "D". This ID has the number 3 in the Number column, so nothing changes. ID "E" below it has NA's so they get replaced with 4, etc.
Here is what my data frame should look like:
ID Number
A 1
A 1
B 2
C 3
C 3
D 3
D 3
D 3
F 4
G 6
H 7
How would I be able to code this in R, preferably using dplyr?
CodePudding user response:
I came up with the following, though I am not totally sure if the logic is correct, so please test it:
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(1, 1, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- zoo::na.locf(a - df$Number)
df$Number <- a - b
Resulting in:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
Some explanation:
a
simply relabels the groups with ascending integers:1 1 2 3 3 4 4 4 5 6 7
. This is almost the desired result, but we have to account for cases where the values indf$Number
get ahead of/behind this running integer label.b
tracks the difference betweena
anddf$Number
with a forward fill (na.locf
):0 0 0 0 0 1 1 1 1 0 0
. Places with a non-zero indicate the correction that should be applied toa
to "reset" the running labels, based on the values observed indf$Number
.a - b
applies the correction alluded to in the above point:1 1 2 3 3 3 3 3 4 6 7
.
One hiccup I noted is if the values start with NA
; in that case using na.locf
will return something smaller than the length of the dataframe. The fix I came up with was to manually prepend a value (0
), forward fill, then remove the value, like so:
# Note: the first 5 values are NA
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
Number=c(NA, NA, NA, NA, NA, 3, 3, 3, NA, 6, NA))
library(zoo)
a <- as.integer(factor(df$ID))
b <- na.locf(c(0, a - df$Number))[2:(length(a) 1)]
df$Number <- a - b
Result:
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
CodePudding user response:
Data
ID <- c("A","A","B","C","C","D","D","D","F","G","H")
Number <- c(1,1,NA_real_,NA_real_,NA_real_,3,3,3,NA_real_,6,NA_real_)
df <- data.frame(ID,Number,stringsAsFactors = F)
dplyr approach
df2 <- df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number cumulative)%>%
select(ID,Number)
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7
or in one LONG line:
df%>%select(-Number)%>%merge(df[!df$ID%>%duplicated(),]%>%
mutate(Number2=ifelse(is.na(Number),1,0))%>%
group_by(grp=cumsum(Number2==0))%>%
mutate(cumulative=cumsum(Number2))%>%
ungroup%>%
fill(Number)%>%
mutate(Number=Number cumulative)%>%
select(ID,Number),
by="ID",all.x=T)
original answer:
df2 <- df[!df$ID%>%duplicated(),]
while(sum(is.na(df2$Number))!=0){
df2$Number[is.na(df2$Number)] <- c(lag(df2$Number) 1)[is.na(df2$Number)]
}
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)
ID Number
1 A 1
2 A 1
3 B 2
4 C 3
5 C 3
6 D 3
7 D 3
8 D 3
9 F 4
10 G 6
11 H 7