I have data as follows:
dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A",
"Area B"), class = "factor"), type= c("cat_1", "cat_1",
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1",
"cat_1", NA, "cat_2", NA, NA, "cat_2", "cat_2", "cat_2", "cat_2",
"cat_2")), class = c("data.table", "data.frame"), row.names = c(NA,
-20L))
amount_of_categories municipality type
1: 2 Area A cat_1
2: 2 Area A cat_1
3: 2 Area A cat_1
4: 2 Area A cat_1
5: 2 Area A cat_1
6: 2 Area A cat_1
7: 2 Area A cat_1
8: 2 Area A cat_1
9: 2 Area A cat_1
10: 2 Area A cat_1
11: 2 Area A cat_1
12: 2 Area A <NA>
13: 2 Area A cat_2
14: 1 Area B <NA>
15: 1 Area B <NA>
16: 1 Area B cat_2
17: 1 Area B cat_2
18: 1 Area B cat_2
19: 1 Area B cat_2
20: 1 Area B cat_2
The idea is to create a new column type_estimation
, which replaces the NA
's in the type
column, with the right type. The right type can only be established if there is only one category (amount_of_categories==1
) for that Area
. So it should fill the last two NA
's but not the first.
I tried:
dat <- setDT(dat)[is.na(type) & amount_of_categories==1, type_estimation:= shift(type), by="municipality"]
But this does not work. What is right syntax here?
Desired outcome:
amount_of_categories municipality type type_estimation
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_2 cat_2
17: 1 Area B cat_2 cat_2
18: 1 Area B cat_2 cat_2
19: 1 Area B cat_2 cat_2
20: 1 Area B cat_2 cat_2
EDIT:
I tried to come up with a situation in which the solution provided by Waldi could lead to an issue. After thinking about it for a bit I realised this would be the case if:
dat[,estimation:=zoo::na.locf(type)]
fills with the wrong type, because the last observation is ofArea A
is carried forward, to the first observation ofArea B
andArea B
has only one category, so that[amount_of_categories!=1&is.na(type) ,estimation:=NA][]
does make this valueNA
.
In the example data:
dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A",
"Area B"), class = "factor"), type= c("cat_1", "cat_1",
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1",
"cat_1", NA, "cat_2", NA, NA, "cat_3", "cat_3", "cat_3", "cat_3",
"cat_3")), class = c("data.table", "data.frame"), row.names = c(NA,
-20L))
amount_of_categories municipality type estimation
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_3 cat_3
17: 1 Area B cat_3 cat_3
18: 1 Area B cat_3 cat_3
19: 1 Area B cat_3 cat_3
20: 1 Area B cat_3 cat_3
As Waldi already pointed out, this issue cannot be fixed by using:
dat[,estimation:=zoo::na.locf(type), by="municipality"][amount_of_categories!=1&is.na(type) ,estimation:=NA][]
Any solution for this issue would still be greatly appreciated.
CodePudding user response:
In two steps:
dat[,estimation:=zoo::na.locf(type)][amount_of_categories!=1&is.na(type) ,estimation:=NA][]
amount_of_categories municipality type estimation
<int> <fctr> <char> <char>
1: 2 Area A cat_1 cat_1
2: 2 Area A cat_1 cat_1
3: 2 Area A cat_1 cat_1
4: 2 Area A cat_1 cat_1
5: 2 Area A cat_1 cat_1
6: 2 Area A cat_1 cat_1
7: 2 Area A cat_1 cat_1
8: 2 Area A cat_1 cat_1
9: 2 Area A cat_1 cat_1
10: 2 Area A cat_1 cat_1
11: 2 Area A cat_1 cat_1
12: 2 Area A <NA> <NA>
13: 2 Area A cat_2 cat_2
14: 1 Area B <NA> cat_2
15: 1 Area B <NA> cat_2
16: 1 Area B cat_2 cat_2
17: 1 Area B cat_2 cat_2
18: 1 Area B cat_2 cat_2
19: 1 Area B cat_2 cat_2
20: 1 Area B cat_2 cat_2
amount_of_categories municipality type estimation
Note that I used zoo::na.locf
because data.table::nafill(type='locf')
doesn't yet work with characters.
CodePudding user response:
Does this approach leveraging unique()
and join help with both cases?
unique(
dat[amount_of_categories==1 & !is.na(type), .(municipality,type_estimation=type)]
)[dat, on=.(municipality)][is.na(type_estimation),type_estimation:=type][]
Output for example 1:
municipality type_estimation amount_of_categories type
<fctr> <char> <int> <char>
1: Area A cat_1 2 cat_1
2: Area A cat_1 2 cat_1
3: Area A cat_1 2 cat_1
4: Area A cat_1 2 cat_1
5: Area A cat_1 2 cat_1
6: Area A cat_1 2 cat_1
7: Area A cat_1 2 cat_1
8: Area A cat_1 2 cat_1
9: Area A cat_1 2 cat_1
10: Area A cat_1 2 cat_1
11: Area A cat_1 2 cat_1
12: Area A <NA> 2 <NA>
13: Area A cat_2 2 cat_2
14: Area B cat_2 1 <NA>
15: Area B cat_2 1 <NA>
16: Area B cat_2 1 cat_2
17: Area B cat_2 1 cat_2
18: Area B cat_2 1 cat_2
19: Area B cat_2 1 cat_2
20: Area B cat_2 1 cat_2
Output for example 2:
municipality type_estimation amount_of_categories type
<fctr> <char> <int> <char>
1: Area A cat_1 2 cat_1
2: Area A cat_1 2 cat_1
3: Area A cat_1 2 cat_1
4: Area A cat_1 2 cat_1
5: Area A cat_1 2 cat_1
6: Area A cat_1 2 cat_1
7: Area A cat_1 2 cat_1
8: Area A cat_1 2 cat_1
9: Area A cat_1 2 cat_1
10: Area A cat_1 2 cat_1
11: Area A cat_1 2 cat_1
12: Area A <NA> 2 <NA>
13: Area A cat_2 2 cat_2
14: Area B cat_3 1 <NA>
15: Area B cat_3 1 <NA>
16: Area B cat_3 1 cat_3
17: Area B cat_3 1 cat_3
18: Area B cat_3 1 cat_3
19: Area B cat_3 1 cat_3
20: Area B cat_3 1 cat_3