Home > database >  Replacing NA's with numbers based on the numbers preceding them in R data frame
Replacing NA's with numbers based on the numbers preceding them in R data frame

Time:12-14

I have the following two columns:

ID Number
A    1
A    1
B    NA
C    NA
C    NA
D    3
D    3
D    3
F    NA
G    6
H    NA

I want the NAs in the Number column to be replaced by the next integer number following the Non-NA number that precedes them. That new number should then stay the same as long as the ID in the ID column doesn't change.

So for example, using the example columns above, if the number associated with ID "A" is 1, and ID "B" below it has NA's, I want those NA's replaced with the number 2. Then, if ID "C" has NA's, they should be replaced with 3. We move down to ID "D". This ID has the number 3 in the Number column, so nothing changes. ID "E" below it has NA's so they get replaced with 4, etc.

Here is what my data frame should look like:

ID Number
A    1
A    1
B    2
C    3
C    3
D    3
D    3
D    3
F    4
G    6
H    7

How would I be able to code this in R, preferably using dplyr?

CodePudding user response:

I came up with the following, though I am not totally sure if the logic is correct, so please test it:

df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
                 Number=c(1, 1, NA, NA, NA, 3, 3, 3, NA, 6, NA))

library(zoo)

a <- as.integer(factor(df$ID))
b <- zoo::na.locf(a - df$Number)
df$Number <- a - b

Resulting in:

   ID Number
1   A      1
2   A      1
3   B      2
4   C      3
5   C      3
6   D      3
7   D      3
8   D      3
9   F      4
10  G      6
11  H      7

Some explanation:

  • a simply relabels the groups with ascending integers: 1 1 2 3 3 4 4 4 5 6 7. This is almost the desired result, but we have to account for cases where the values in df$Number get ahead of/behind this running integer label.
  • b tracks the difference between a and df$Number with a forward fill (na.locf): 0 0 0 0 0 1 1 1 1 0 0. Places with a non-zero indicate the correction that should be applied to a to "reset" the running labels, based on the values observed in df$Number.
  • a - b applies the correction alluded to in the above point: 1 1 2 3 3 3 3 3 4 6 7.

One hiccup I noted is if the values start with NA; in that case using na.locf will return something smaller than the length of the dataframe. The fix I came up with was to manually prepend a value (0), forward fill, then remove the value, like so:

# Note: the first 5 values are NA
df <- data.frame(ID=c('A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'F', 'G', 'H'),
                 Number=c(NA, NA, NA, NA, NA, 3, 3, 3, NA, 6, NA))

library(zoo)

a <- as.integer(factor(df$ID))
b <- na.locf(c(0, a - df$Number))[2:(length(a)   1)]
df$Number <- a - b

Result:

  ID Number
1   A      1
2   A      1
3   B      2
4   C      3
5   C      3
6   D      3
7   D      3
8   D      3
9   F      4
10  G      6
11  H      7

CodePudding user response:

Data

ID <- c("A","A","B","C","C","D","D","D","F","G","H")
Number <- c(1,1,NA_real_,NA_real_,NA_real_,3,3,3,NA_real_,6,NA_real_)
df <- data.frame(ID,Number,stringsAsFactors = F)

dplyr approach

df2 <- df[!df$ID%>%duplicated(),]%>%
  mutate(Number2=ifelse(is.na(Number),1,0))%>%
  group_by(grp=cumsum(Number2==0))%>%
  mutate(cumulative=cumsum(Number2))%>%
  ungroup%>%
  fill(Number)%>%
  mutate(Number=Number cumulative)%>%
  select(ID,Number)

base::merge(df%>%select(-Number),df2,by="ID",all.x=T)

   ID Number
1   A      1
2   A      1
3   B      2
4   C      3
5   C      3
6   D      3
7   D      3
8   D      3
9   F      4
10  G      6
11  H      7

or in one LONG line:

df%>%select(-Number)%>%merge(df[!df$ID%>%duplicated(),]%>%
                               mutate(Number2=ifelse(is.na(Number),1,0))%>%
                               group_by(grp=cumsum(Number2==0))%>%
                               mutate(cumulative=cumsum(Number2))%>%
                               ungroup%>%
                               fill(Number)%>%
                               mutate(Number=Number cumulative)%>%
                               select(ID,Number),
                             by="ID",all.x=T)

original answer:

df2 <- df[!df$ID%>%duplicated(),]
while(sum(is.na(df2$Number))!=0){
  df2$Number[is.na(df2$Number)] <- c(lag(df2$Number) 1)[is.na(df2$Number)]
}
base::merge(df%>%select(-Number),df2,by="ID",all.x=T)

   ID Number
1   A      1
2   A      1
3   B      2
4   C      3
5   C      3
6   D      3
7   D      3
8   D      3
9   F      4
10  G      6
11  H      7
  • Related