I have a dataset looks like this:
(Visualising the datasets below may help you to understand the question)
original <- data.frame(
ID = c(rep("John", 3), "Steve"),
A = c(rep(3, 3), 1),
B = c(rep(4, 3), 2),
b = c(2, 3, 2, 2),
detail = c(rep("GOOOOD", 4))
)
Values in variable A
, B
, and b
are all integers. Variable b
is incomplete in this dataset and it actually has values from 1 to the value of B
.
I need to complete this dataset with a new variable a
added, the completed dataset will look like this:
completed1 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", "GOOOOD", NA, NA, "GOOOOD", rep(NA, 7), "GOOOOD")
)
Values in variable a
are integers too and a
has values from 1 to the value of A
. Values in b
are nested in each value of a
, and values in a
are nested in each factor of ID
.
I think the most relevant functions to complete a dataset in this way are tidyr::complete()
and tidyr::expand()
, but they can only complete combinations of values in existing variables, they cannot add a new column(variable).
I know the challenge is that there are multiple locations to allocate values in detail
correspondingly to values in the newly added a
through the nested relationship, for example, the completed dataset can also be this:
completed2 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
A = c(rep(3, 12), rep(1, 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
B = c(rep(4, 12), rep(2, 2)),
b = c(rep(1:4, 3), 1, 2),
detail = c(NA, "GOOOOD", rep(NA, 4), "GOOOOD", NA, NA, "GOOOOD", rep(NA, 3), "GOOOOD")
)
Where the values in detail
got located in the completed dataset does not matter to me, I just need to complete the dataset in this way. My actual dataset has more than 40,000 rows, so I really need something to automate it.
Is it possible to do this? Thanks very much!!!
CodePudding user response:
I wonder whether doing the complete
twice, first for the a
and then for b
can be a solution. You can adjust different nesting, or group_by
if needed.
Depending if the maximum a
shall be from A
within the ID
group or not you shall adjust/remove the group_by
(similar for b
within the a
group)
library(dplyr)
library(tidyr)
original %>%
dplyr::mutate(a = 1) %>%
dplyr::group_by( ID ) %>%
tidyr::complete( a = 1:max(A), nesting(ID, A, B, b), fill = list( detail = NA_character_)) %>%
group_by( a ) %>%
tidyr::complete( b = 1:max(B), nesting(ID, A, B, a), fill = list( detail = NA_character_)) %>%
dplyr::ungroup()
CodePudding user response:
It's pretty messy using for
loop, and it will give very random position of GOOOOD
comp_dummy <- original %>%
group_by(ID) %>%
expand(A = A, a = 1:A, B = B, b = 1:B)
original <- original %>%
group_by(ID, A, B, b) %>%
summarise(n = n())
vec <- rep(NA_character_, nrow(comp_dummy))
for (i in 1:nrow(original)){
x <- original[i,]
y <- comp_dummy %>%
rownames_to_column(., "row") %>%
filter(ID == x$ID, A == x$A, B == x$B, b == x$b) %>%
pull(row)
z <- sample(y, x$n, replace = FALSE) %>% as.numeric()
print(z)
vec[{z}] <- "GOOOOD"
}
comp_dummy$detail <- vec
comp_dummy
ID A a B b detail
<chr> <dbl> <int> <dbl> <int> <chr>
1 John 3 1 4 1 NA
2 John 3 1 4 2 GOOOOD
3 John 3 1 4 3 NA
4 John 3 1 4 4 NA
5 John 3 2 4 1 NA
6 John 3 2 4 2 NA
7 John 3 2 4 3 NA
8 John 3 2 4 4 NA
9 John 3 3 4 1 NA
10 John 3 3 4 2 GOOOOD
11 John 3 3 4 3 GOOOOD
12 John 3 3 4 4 NA
13 Steve 1 1 2 1 NA
14 Steve 1 1 2 2 GOOOOD
CodePudding user response:
A base R solution
do.call(
rbind,
by(original,list(original$ID),function(x){
tmp=merge(
unique(x),
setNames(
expand.grid(
unique(x$ID),
x$A[1],
1:max(x$A),
x$B[1],
1:max(x$B)
),
c("ID","A","a","B","b")
),
by=c("ID","A","B","b"),
all=T
)
tmp[order(tmp$a,tmp$b),c("ID","A","a","B","b","detail")]
})
)
resulting in
ID A a B b detail
John.1 John 3 1 4 1 <NA>
John.5 John 3 1 4 2 GOOOOD
John.8 John 3 1 4 3 GOOOOD
John.11 John 3 1 4 4 <NA>
John.2 John 3 2 4 1 <NA>
John.4 John 3 2 4 2 GOOOOD
John.9 John 3 2 4 3 GOOOOD
John.12 John 3 2 4 4 <NA>
John.3 John 3 3 4 1 <NA>
John.6 John 3 3 4 2 GOOOOD
John.7 John 3 3 4 3 GOOOOD
John.10 John 3 3 4 4 <NA>
Steve.1 Steve 1 1 2 1 <NA>
Steve.2 Steve 1 1 2 2 GOOOOD