Home > Software engineering >  How to sum two categories in panel-data using data.table
How to sum two categories in panel-data using data.table

Time:09-15

I'm having the particular case in R where two categories are displayed in a yearly panel data but do not correspond to the level I am using, this is, there are two cities that belong to the same state (A) but are on their individual city levels (a and b) and the rest of the data is on State level (Dep).

year Dep Homicides
2000 C 20
2000 a 5
2000 b 3
2000 D 17
2000 E 18
2001 C 21
2001 a 6
2001 b 4
2001 D 18
2001 E 18

So I want to form a new category by sum using Dep == "a" and Dep =="b", that is, to sum my two cities that belongs to Dep A (also this is not exist, so it needs to be created) by each year, where the final output looks like:

year Dep Homicides
2000 C 20
2000 A 8
2000 D 17
2000 E 18
2001 C 21
2001 A 10
2001 D 18
2001 E 18

Thank you in advance!

CodePudding user response:

We could recode the 'Dep' levels to 'A' and use as grouping variable to get the sum

library(forcats)
library(data.table)
dt1[, .(Homicides = sum(Homicides)), 
   .(year, Dep = fct_other(Dep, drop = c("a", "b"), other_level = "A"))]

-output

    year    Dep Homicides
    <int> <fctr>     <int>
1:  2000      C        20
2:  2000      A         8
3:  2000      D        17
4:  2000      E        18
5:  2001      C        21
6:  2001      A        10
7:  2001      D        18
8:  2001      E        18

Or could use fct_collapse as well

dt1[, .(Homicides = sum(Homicides)), .(year, 
       Dep = fct_collapse(Dep, A = c("a", "b")))]

-output

   year    Dep Homicides
   <int> <fctr>     <int>
1:  2000      C        20
2:  2000      A         8
3:  2000      D        17
4:  2000      E        18
5:  2001      C        21
6:  2001      A        10
7:  2001      D        18
8:  2001      E        18

data

dt1 <- structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 
2001L, 2001L, 2001L, 2001L), Dep = c("C", "a", "b", "D", "E", 
"C", "a", "b", "D", "E"), Homicides = c(20L, 5L, 3L, 17L, 18L, 
21L, 6L, 4L, 18L, 18L)), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))
  • Related