Here's a simple question about completing data. Probably a duplicate but I didn't figure out an answer.
Consider the following df
which has combination of letters (a
to e
) & numbers (1
to 10
), but not all possible combinations are there.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
set.seed(123)
df <-
expand_grid(lttrs = letters[1:5], num = 1:10) |>
group_by(lttrs) |>
slice_sample(n = 5) |>
ungroup() |>
arrange(lttrs, num)
df |>
print(n = 25)
#> # A tibble: 25 × 2
#> lttrs num
#> <chr> <int>
#> 1 a 2
#> 2 a 3
#> 3 a 6
#> 4 a 8
#> 5 a 10
#> 6 b 1
#> 7 b 4
#> 8 b 5
#> 9 b 6
#> 10 b 8
#> 11 c 1
#> 12 c 3
#> 13 c 5
#> 14 c 8
#> 15 c 10
#> 16 d 2
#> 17 d 3
#> 18 d 5
#> 19 d 9
#> 20 d 10
#> 21 e 3
#> 22 e 4
#> 23 e 6
#> 24 e 7
#> 25 e 9
Created on 2022-05-29 by the reprex package (v2.0.1)
I want to complete each letter to 1:10
, where they're missing, while accounting for this missingness. That is, I'm looking for the following desired output:
tibble::tribble(
~lttrs, ~orig_num, ~completed_num,
"a", NA, 1,
"a", 2L, 2,
"a", 3L, 3,
"a", NA, 4,
"a", NA, 5,
"a", 6L, 6,
"a", NA, 7,
"a", 8L, 8,
"a", NA, 9,
"a", 10L, 10,
"b", 1L, 1,
"b", NA, 2,
"b", NA, 3,
"b", 4L, 4,
"b", 5L, 5,
"b", 6L, 6,
"b", NA, 7,
"b", 8L, 8,
"b", NA, 9,
"b", NA, 10,
"c", 1L, 1,
"c", NA, 2,
"c", 3L, 3,
"c", NA, 4,
"c", 5L, 5,
"c", NA, 6,
"c", NA, 7,
"c", 8L, 8,
"c", NA, 9,
"c", 10L, 10,
"d", NA, 1,
"d", 2L, 2,
"d", 3L, 3,
"d", NA, 4,
"d", 5L, 5,
"d", NA, 6,
"d", NA, 7,
"d", NA, 8,
"d", 9L, 9,
"d", 10L, 10,
"e", NA, 1,
"e", NA, 2,
"e", 3L, 3,
"e", 4L, 4,
"e", NA, 5,
"e", 6L, 6,
"e", 7L, 7,
"e", NA, 8,
"e", 9L, 9,
"e", NA, 10
)
#> # A tibble: 50 × 3
#> lttrs orig_num completed_num
#> <chr> <int> <dbl>
#> 1 a NA 1
#> 2 a 2 2
#> 3 a 3 3
#> 4 a NA 4
#> 5 a NA 5
#> 6 a 6 6
#> 7 a NA 7
#> 8 a 8 8
#> 9 a NA 9
#> 10 a 10 10
#> # … with 40 more rows
Created on 2022-05-29 by the reprex package (v2.0.1)
It's probably super simple with tidyr
, but I couldn't figure out how to get from df
directly to the desired output.
CodePudding user response:
Here is a tidyverse solution:
First we create a copy of num
then we use complete
together with nesting
:
library(dplyr)
library(tidyr)
df %>%
mutate(num_new = num) %>%
complete(lttrs, nesting(num_new)) %>%
data.frame()
lttrs num_new num
1 a 1 1
2 a 2 2
3 a 3 NA
4 a 4 4
5 a 5 5
6 a 6 NA
7 a 7 7
8 a 8 NA
9 a 9 NA
10 a 10 NA
11 b 1 1
12 b 2 2
13 b 3 3
14 b 4 NA
15 b 5 NA
16 b 6 NA
17 b 7 7
18 b 8 NA
19 b 9 9
20 b 10 NA
21 c 1 NA
22 c 2 NA
23 c 3 3
24 c 4 NA
25 c 5 5
26 c 6 6
27 c 7 7
28 c 8 NA
29 c 9 NA
30 c 10 10
31 d 1 NA
32 d 2 2
33 d 3 NA
34 d 4 4
35 d 5 5
36 d 6 NA
37 d 7 NA
38 d 8 8
39 d 9 9
40 d 10 NA
41 e 1 1
42 e 2 2
43 e 3 3
44 e 4 NA
45 e 5 NA
46 e 6 NA
47 e 7 NA
48 e 8 8
49 e 9 9
50 e 10 NA
CodePudding user response:
OK, seems like I've figure something out:
df |>
mutate(missing_num = num) |>
group_by(lttrs) |>
complete(num = full_seq(1:10, 1))
#> # A tibble: 50 × 3
#> # Groups: lttrs [5]
#> lttrs num missing_num
#> <chr> <dbl> <int>
#> 1 a 1 NA
#> 2 a 2 2
#> 3 a 3 3
#> 4 a 4 NA
#> 5 a 5 NA
#> 6 a 6 6
#> 7 a 7 NA
#> 8 a 8 8
#> 9 a 9 NA
#> 10 a 10 10
#> # … with 40 more rows