Home > OS >  Fill the actual data with duplicated data and remove duplicated data
Fill the actual data with duplicated data and remove duplicated data

Time:02-21

I have the following data:-

library(data.table)
data <- data <- data.table(address = c("AA", "BB", "AA", "CC", "DD", "EE", "DD"),
                   revenue = c(NA, 121, 22, 33, 44, 33, NA),
                   castord = c(21, 22, NA, 3, NA, 223, 33),
                   versaze = c(NA, 22, 124, 33, NA, 44, 43))
data

#    address revenue castord versaze
# 1:      AA      NA      21      NA
# 2:      BB     121      22      22
# 3:      AA      22      NA     124
# 4:      CC      33       3      33
# 5:      DD      44      NA      NA
# 6:      EE      33     223      44
# 7:      DD      NA      33      43

Now this data has AA and DD addresses duplicated. And their first occurrence has some NA data as you can see in the row 1 and 5. What I want to do is fill this data by using the duplicated rows of these addresses. If the duplicated row has NA values, then NA shouldn't replace the values in the row of first occurrence. This will give me the following data:-

    data <- data.table(address = c("AA", "BB", "AA", "CC", "DD", "EE", "DD"),
                   revenue = c(22, 121, 22, 33, 44, 33, NA),
                   castord = c(21, 22, NA, 3, 33, 223, 33),
                   versaze = c(124, 22, 124, 33, 43, 44, 43))

#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      AA      22      NA     124
# 4:      CC      33       3      33
# 5:      DD      44      33      43
# 6:      EE      33     223      44
# 7:      DD      NA      33      43

And then remove those duplicated rows:-

data <- data.table(address = c("AA", "BB", "CC", "DD", "EE"),
                   revenue = c(22, 121, 33, 44, 33),
                   castord = c(21, 22, 3, 33, 223),
                   versaze = c(124, 22, 33, 43, 44))

#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      CC      33       3      33
# 4:      DD      44      33      43
# 5:      EE      33     223      44

CodePudding user response:

Using data.table, group by address, loop through columns, remove NAs, get the first value:

data[, lapply(.SD, function(i) (na.omit(i)[ 1 ])), by = address]
#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      CC      33       3      33
# 4:      DD      44      33      43
# 5:      EE      33     223      44

CodePudding user response:

If you use dplyr::group_by and summarise where you take the first value that is not NA using na.omit, it fills out the empty values for you if the first row is NA.

data <- data %>% group_by(address) %>% 
  summarise(
    revenue = first(na.omit(revenue)),
    castord = first(na.omit(castord)),
    versaze = first(na.omit(versaze))
    )
  • Related