Operations with different lines in R-CodePudding

structure(list(id = c(1L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 
8L, 8L), firm = c(111L, 222L, 111L, 333L, 444L, 111L, 444L, 555L, 
999L, 12L, 13L, 14L), quarter_1 = c(0L, 1L, 0L, 1L, 1L, 1L, 0L, 
0L, 1L, 0L, 1L, 0L), quarter_2 = c(0L, 1L, 0L, 0L, 1L, 1L, 0L, 
1L, 0L, 1L, 0L, 0L), quarter_3 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 
0L, 0L, 1L, 0L, 0L), quarter_4 = c(1L, 0L, 0L, 0L, 0L, 1L, 1L, 
0L, 0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, 
-12L))


   id firm quarter_1 quarter_2 quarter_3 quarter_4
1   1  111         0         0         1         1
2   1  222         1         1         0         0
3   2  111         0         0         1         0
4   3  333         1         0         0         0
5   4  444         1         1         1         0
6   5  111         1         1         1         1
7   6  444         0         0         1         1
8   7  555         0         1         0         0
9   7  999         1         0         0         0
10  8   12         0         1         1         0
11  8   13         1         0         0         0
12  8   14         0         0         0         1

I would like to create 4 new columns (firm_q_1, firm_q_2, firm_q_3, firm_q_4) that associate - for each id - the firm the person was employed at each quarter (if any). I potentially have more than one observation for each person (id). Each dummy for quarter indicate wether the person was employed at the firm in that quarter (==1 if yes, ==0 otherwise).

For example, the first two lines refer to person 1. In quarters 3 and 4, the dummies quarter_3 and quarter_4 indicate that we has employed at firm 111. In quarters 1 and 2, he was employed at firm 222. I want to create 4 new columns that display the firm for each quarter as follows:

structure(list(id = c(1L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 
8L, 8L), firm = c(111L, 222L, 111L, 333L, 444L, 111L, 444L, 555L, 
999L, 12L, 13L, 14L), quarter_1 = c(0L, 1L, 0L, 1L, 1L, 1L, 0L, 
0L, 1L, 0L, 1L, 0L), quarter_2 = c(0L, 1L, 0L, 0L, 1L, 1L, 0L, 
1L, 0L, 1L, 0L, 0L), quarter_3 = c(1L, 0L, 1L, 0L, 1L, 1L, 1L, 
0L, 0L, 1L, 0L, 0L), quarter_4 = c(1L, 0L, 0L, 0L, 0L, 1L, 1L, 
0L, 0L, 0L, 0L, 1L), firm_q_1 = c(222L, NA, 0L, 333L, 444L, 111L, 
0L, 999L, NA, 13L, NA, NA), firm_q_2 = c(222L, NA, 0L, 0L, 444L, 
111L, 0L, 555L, NA, 12L, NA, NA), firm_q_3 = c(111L, NA, 111L, 
0L, 444L, 111L, 444L, 0L, NA, 12L, NA, NA), firm_q_4 = c(111L, 
NA, 0L, 0L, 0L, 111L, 444L, 0L, NA, 14L, NA, NA)), class = "data.frame", row.names = c(NA, 
-12L))

If the person is not employed in any firm at a given quarter, I would like to make it NA.

In short, I wish to sum up information for each id that may be in different lines into only one line since I later want to transform my data in a panel.

I may potentially have even more than two lines for each id.

CodePudding user response：

I would reshape the data to long form, then back to wide as follows. The intermediate dataset has one row for ever combination of person/firm/time period, then we select only those rows where the person was employed in that quarter at that firm, before reshaping wide.

If you want to merge it back into your origindal dataset then you can, but I'm not sure why you would.

library(dplyr)
library(tidyr)

dat |>
  pivot_longer(c(quarter_1, quarter_2,quarter_3,quarter_4),names_prefix = "quarter_") |>
  filter(value==1) |>
  select(-value) |>
  arrange(id,name) |>
  pivot_wider(names_from = name, values_from = firm, names_prefix = "firm_q_")



# A tibble: 8 x 5
     id firm_q_1 firm_q_2 firm_q_3 firm_q_4
  <int>    <int>    <int>    <int>    <int>
1     1      222      222      111      111
2     2       NA       NA      111       NA
3     3      333       NA       NA       NA
4     4      444      444      444       NA
5     5      111      111      111      111
6     6       NA       NA      444      444
7     7      999      555       NA       NA
8     8       NA       12       12       NA