Home > front end >  Filling a column based on the value of another column in data.table
Filling a column based on the value of another column in data.table

Time:05-05

I have data as follows:

dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A", 
"Area B"), class = "factor"), type= c("cat_1", "cat_1", 
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", 
"cat_1", NA, "cat_2", NA, NA, "cat_2", "cat_2", "cat_2", "cat_2", 
"cat_2")), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L))

    amount_of_categories municipality  type
 1:                    2       Area A cat_1
 2:                    2       Area A cat_1
 3:                    2       Area A cat_1
 4:                    2       Area A cat_1
 5:                    2       Area A cat_1
 6:                    2       Area A cat_1
 7:                    2       Area A cat_1
 8:                    2       Area A cat_1
 9:                    2       Area A cat_1
10:                    2       Area A cat_1
11:                    2       Area A cat_1
12:                    2       Area A  <NA>
13:                    2       Area A cat_2
14:                    1       Area B  <NA>
15:                    1       Area B  <NA>
16:                    1       Area B cat_2
17:                    1       Area B cat_2
18:                    1       Area B cat_2
19:                    1       Area B cat_2
20:                    1       Area B cat_2

The idea is to create a new column type_estimation, which replaces the NA's in the type column, with the right type. The right type can only be established if there is only one category (amount_of_categories==1) for that Area. So it should fill the last two NA's but not the first.

I tried:

dat <- setDT(dat)[is.na(type) & amount_of_categories==1, type_estimation:= shift(type), by="municipality"]

But this does not work. What is right syntax here?

Desired outcome:

    amount_of_categories municipality  type  type_estimation
 1:                    2       Area A cat_1            cat_1
 2:                    2       Area A cat_1            cat_1
 3:                    2       Area A cat_1            cat_1
 4:                    2       Area A cat_1            cat_1
 5:                    2       Area A cat_1            cat_1
 6:                    2       Area A cat_1            cat_1
 7:                    2       Area A cat_1            cat_1
 8:                    2       Area A cat_1            cat_1
 9:                    2       Area A cat_1            cat_1
10:                    2       Area A cat_1            cat_1
11:                    2       Area A cat_1            cat_1
12:                    2       Area A  <NA>             <NA> 
13:                    2       Area A cat_2            cat_2
14:                    1       Area B  <NA>            cat_2
15:                    1       Area B  <NA>            cat_2
16:                    1       Area B cat_2            cat_2
17:                    1       Area B cat_2            cat_2
18:                    1       Area B cat_2            cat_2
19:                    1       Area B cat_2            cat_2
20:                    1       Area B cat_2            cat_2

EDIT:

I tried to come up with a situation in which the solution provided by Waldi could lead to an issue. After thinking about it for a bit I realised this would be the case if:

  1. dat[,estimation:=zoo::na.locf(type)] fills with the wrong type, because the last observation is of Area A is carried forward, to the first observation of Area B and
  2. Area B has only one category, so that [amount_of_categories!=1&is.na(type) ,estimation:=NA][] does make this value NA.

In the example data:

dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A", 
"Area B"), class = "factor"), type= c("cat_1", "cat_1", 
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", 
"cat_1", NA, "cat_2", NA, NA, "cat_3", "cat_3", "cat_3", "cat_3", 
"cat_3")), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L))

   amount_of_categories municipality  type estimation
 1:                    2       Area A cat_1      cat_1
 2:                    2       Area A cat_1      cat_1
 3:                    2       Area A cat_1      cat_1
 4:                    2       Area A cat_1      cat_1
 5:                    2       Area A cat_1      cat_1
 6:                    2       Area A cat_1      cat_1
 7:                    2       Area A cat_1      cat_1
 8:                    2       Area A cat_1      cat_1
 9:                    2       Area A cat_1      cat_1
10:                    2       Area A cat_1      cat_1
11:                    2       Area A cat_1      cat_1
12:                    2       Area A  <NA>       <NA>
13:                    2       Area A cat_2      cat_2
14:                    1       Area B  <NA>      cat_2
15:                    1       Area B  <NA>      cat_2
16:                    1       Area B cat_3      cat_3
17:                    1       Area B cat_3      cat_3
18:                    1       Area B cat_3      cat_3
19:                    1       Area B cat_3      cat_3
20:                    1       Area B cat_3      cat_3

As Waldi already pointed out, this issue cannot be fixed by using:

dat[,estimation:=zoo::na.locf(type), by="municipality"][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

Any solution for this issue would still be greatly appreciated.

CodePudding user response:

In two steps:

dat[,estimation:=zoo::na.locf(type)][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

    amount_of_categories municipality   type estimation
                   <int>       <fctr> <char>     <char>
 1:                    2       Area A  cat_1      cat_1
 2:                    2       Area A  cat_1      cat_1
 3:                    2       Area A  cat_1      cat_1
 4:                    2       Area A  cat_1      cat_1
 5:                    2       Area A  cat_1      cat_1
 6:                    2       Area A  cat_1      cat_1
 7:                    2       Area A  cat_1      cat_1
 8:                    2       Area A  cat_1      cat_1
 9:                    2       Area A  cat_1      cat_1
10:                    2       Area A  cat_1      cat_1
11:                    2       Area A  cat_1      cat_1
12:                    2       Area A   <NA>       <NA>
13:                    2       Area A  cat_2      cat_2
14:                    1       Area B   <NA>      cat_2
15:                    1       Area B   <NA>      cat_2
16:                    1       Area B  cat_2      cat_2
17:                    1       Area B  cat_2      cat_2
18:                    1       Area B  cat_2      cat_2
19:                    1       Area B  cat_2      cat_2
20:                    1       Area B  cat_2      cat_2
    amount_of_categories municipality   type estimation

Note that I used zoo::na.locf because data.table::nafill(type='locf') doesn't yet work with characters.

CodePudding user response:

Does this approach leveraging unique() and join help with both cases?

unique(
  dat[amount_of_categories==1 & !is.na(type), .(municipality,type_estimation=type)]
)[dat, on=.(municipality)][is.na(type_estimation),type_estimation:=type][]

Output for example 1:

    municipality type_estimation amount_of_categories   type
          <fctr>          <char>                <int> <char>
 1:       Area A           cat_1                    2  cat_1
 2:       Area A           cat_1                    2  cat_1
 3:       Area A           cat_1                    2  cat_1
 4:       Area A           cat_1                    2  cat_1
 5:       Area A           cat_1                    2  cat_1
 6:       Area A           cat_1                    2  cat_1
 7:       Area A           cat_1                    2  cat_1
 8:       Area A           cat_1                    2  cat_1
 9:       Area A           cat_1                    2  cat_1
10:       Area A           cat_1                    2  cat_1
11:       Area A           cat_1                    2  cat_1
12:       Area A            <NA>                    2   <NA>
13:       Area A           cat_2                    2  cat_2
14:       Area B           cat_2                    1   <NA>
15:       Area B           cat_2                    1   <NA>
16:       Area B           cat_2                    1  cat_2
17:       Area B           cat_2                    1  cat_2
18:       Area B           cat_2                    1  cat_2
19:       Area B           cat_2                    1  cat_2
20:       Area B           cat_2                    1  cat_2

Output for example 2:

    municipality type_estimation amount_of_categories   type
          <fctr>          <char>                <int> <char>
 1:       Area A           cat_1                    2  cat_1
 2:       Area A           cat_1                    2  cat_1
 3:       Area A           cat_1                    2  cat_1
 4:       Area A           cat_1                    2  cat_1
 5:       Area A           cat_1                    2  cat_1
 6:       Area A           cat_1                    2  cat_1
 7:       Area A           cat_1                    2  cat_1
 8:       Area A           cat_1                    2  cat_1
 9:       Area A           cat_1                    2  cat_1
10:       Area A           cat_1                    2  cat_1
11:       Area A           cat_1                    2  cat_1
12:       Area A            <NA>                    2   <NA>
13:       Area A           cat_2                    2  cat_2
14:       Area B           cat_3                    1   <NA>
15:       Area B           cat_3                    1   <NA>
16:       Area B           cat_3                    1  cat_3
17:       Area B           cat_3                    1  cat_3
18:       Area B           cat_3                    1  cat_3
19:       Area B           cat_3                    1  cat_3
20:       Area B           cat_3                    1  cat_3
  • Related