Home > Software engineering >  R: complete/expand a dataset with a new column added
R: complete/expand a dataset with a new column added

Time:10-25

I have a dataset looks like this:

(Visualising the datasets below may help you to understand the question)

original <- data.frame(
  ID = c(rep("John", 3), "Steve"),
  A = c(rep(3, 3), 1),
  B = c(rep(4, 3), 2),
  b = c(2, 3, 2, 2),
  detail = c(rep("GOOOOD", 4))
)

Values in variable A, B, and b are all integers. Variable b is incomplete in this dataset and it actually has values from 1 to the value of B.

I need to complete this dataset with a new variable a added, the completed dataset will look like this:

completed1 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  A = c(rep(3, 12), rep(1, 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  B = c(rep(4, 12), rep(2, 2)),
  b = c(rep(1:4, 3), 1, 2),
  detail = c(NA, "GOOOOD", "GOOOOD", NA, NA, "GOOOOD", rep(NA, 7), "GOOOOD")
)

Values in variable a are integers too and a has values from 1 to the value of A. Values in b are nested in each value of a, and values in a are nested in each factor of ID.

I think the most relevant functions to complete a dataset in this way are tidyr::complete() and tidyr::expand(), but they can only complete combinations of values in existing variables, they cannot add a new column(variable).

I know the challenge is that there are multiple locations to allocate values in detail correspondingly to values in the newly added a through the nested relationship, for example, the completed dataset can also be this:

completed2 <- data.frame(
  ID = c(rep("John", 12), rep("Steve", 2)),
  A = c(rep(3, 12), rep(1, 2)),
  a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
  B = c(rep(4, 12), rep(2, 2)),
  b = c(rep(1:4, 3), 1, 2),
  detail = c(NA, "GOOOOD", rep(NA, 4), "GOOOOD", NA, NA, "GOOOOD", rep(NA, 3), "GOOOOD")
)

Where the values in detail got located in the completed dataset does not matter to me, I just need to complete the dataset in this way. My actual dataset has more than 40,000 rows, so I really need something to automate it.

Is it possible to do this? Thanks very much!!!

CodePudding user response:

I wonder whether doing the complete twice, first for the a and then for b can be a solution. You can adjust different nesting, or group_by if needed.

Depending if the maximum a shall be from A within the ID group or not you shall adjust/remove the group_by (similar for b within the a group)

library(dplyr)
library(tidyr)

original %>%
  dplyr::mutate(a = 1) %>%
  dplyr::group_by( ID ) %>%
  tidyr::complete( a = 1:max(A), nesting(ID, A, B, b), fill = list( detail = NA_character_)) %>%
  group_by( a ) %>%
  tidyr::complete( b = 1:max(B), nesting(ID, A, B, a), fill = list( detail = NA_character_)) %>%
  dplyr::ungroup()

CodePudding user response:

It's pretty messy using for loop, and it will give very random position of GOOOOD

comp_dummy <- original %>%
  group_by(ID) %>%
  expand(A = A, a = 1:A, B = B, b = 1:B)

original <- original %>%
  group_by(ID, A, B, b) %>%
  summarise(n = n())

vec <- rep(NA_character_, nrow(comp_dummy))

for (i in 1:nrow(original)){
  x <- original[i,]
  
  y <- comp_dummy %>%
    rownames_to_column(., "row") %>%
    filter(ID == x$ID, A == x$A, B == x$B, b == x$b)  %>%
    pull(row)
  z <- sample(y, x$n, replace = FALSE)  %>% as.numeric()
  print(z)
  vec[{z}] <- "GOOOOD"
}

comp_dummy$detail <- vec
comp_dummy

   ID        A     a     B     b detail
   <chr> <dbl> <int> <dbl> <int> <chr> 
 1 John      3     1     4     1 NA    
 2 John      3     1     4     2 GOOOOD
 3 John      3     1     4     3 NA    
 4 John      3     1     4     4 NA    
 5 John      3     2     4     1 NA    
 6 John      3     2     4     2 NA    
 7 John      3     2     4     3 NA    
 8 John      3     2     4     4 NA    
 9 John      3     3     4     1 NA    
10 John      3     3     4     2 GOOOOD
11 John      3     3     4     3 GOOOOD
12 John      3     3     4     4 NA    
13 Steve     1     1     2     1 NA    
14 Steve     1     1     2     2 GOOOOD

CodePudding user response:

A base R solution

do.call(
  rbind,
  by(original,list(original$ID),function(x){
    tmp=merge(
      unique(x),
      setNames(
        expand.grid(
          unique(x$ID),
          x$A[1],
          1:max(x$A),
          x$B[1],
          1:max(x$B)
        ),
        c("ID","A","a","B","b")
      ),
      by=c("ID","A","B","b"),
      all=T
    )
    tmp[order(tmp$a,tmp$b),c("ID","A","a","B","b","detail")]
  })
)

resulting in

           ID A a B b detail
John.1   John 3 1 4 1   <NA>
John.5   John 3 1 4 2 GOOOOD
John.8   John 3 1 4 3 GOOOOD
John.11  John 3 1 4 4   <NA>
John.2   John 3 2 4 1   <NA>
John.4   John 3 2 4 2 GOOOOD
John.9   John 3 2 4 3 GOOOOD
John.12  John 3 2 4 4   <NA>
John.3   John 3 3 4 1   <NA>
John.6   John 3 3 4 2 GOOOOD
John.7   John 3 3 4 3 GOOOOD
John.10  John 3 3 4 4   <NA>
Steve.1 Steve 1 1 2 1   <NA>
Steve.2 Steve 1 1 2 2 GOOOOD
  • Related