I have a large wide-format dataset which contains many variables measured at multiple waves, for which there is one column per variable-wave combination (e.g. age1,age2,age3, age4), as well as a few time-fixed variables (e.g. ID, sex). After pivoting, I want each variable to be represented by a single column, alongside a new 'wave' column.
It's working almost perfectly, except that I just can't get waves 1-9 and 10-13 represented in the same column.
df <- data.frame(
ID = c(10001,10002),
Sex = c(1,2),
Age1 = c(73,25),
Age2 = c(74,26),
Age3=c(75,27),
Age4 = c(76,28),
Age5 = c(77,29),
Age6=c(78,30),
Age7 = c(79,31),
Age8 = c(80,31),
Age9=c(81,33),
Age10=c(82,34),
Age11 = c(83,35),
Age12 = c(84,36),
Age13=c(85,37)
)
names_test<-names(df)
no_numb<-grep("*[A-Za-z]$", names_test) #to identify all the column names ending with a letter, which I do NOT want to pivot into longer form
df_long<-pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
names_pattern = "(.*)(\\d )$")
Long data output:
> df_long
# A tibble: 20 x 5
ID Sex wave Age Age1
<dbl> <dbl> <chr> <dbl> <dbl>
1 10001 1 1 73 83
2 10001 1 2 74 84
3 10001 1 3 75 85
4 10001 1 4 76 NA
5 10001 1 5 77 NA
6 10001 1 6 78 NA
7 10001 1 7 79 NA
8 10001 1 8 80 NA
9 10001 1 9 81 NA
10 10001 1 0 NA 82
11 10002 2 1 25 35
12 10002 2 2 26 36
13 10002 2 3 27 37
14 10002 2 4 28 NA
15 10002 2 5 29 NA
16 10002 2 6 30 NA
17 10002 2 7 31 NA
18 10002 2 8 31 NA
19 10002 2 9 33 NA
20 10002 2 0 NA 34
As you can see, there is a column Age, which has values for waves 1-9, and also a column Age1, which has values for 1,2,3,and 0 (ie Age10,Age11,Age12,Age13). I assume the problem here is either with the names_to argument or the names_pattern argument. Any help would be much appreciated!
CodePudding user response:
The .*
is greedy, so it takes the max matching string. You could use
pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
names_pattern = "(Age)(\\d )$")
instead or make it non greedy by adding a ?
pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
names_pattern = "(.*?)(\\d )$")
This returns
# A tibble: 26 x 4
ID Sex wave Age
<dbl> <dbl> <chr> <dbl>
1 10001 1 1 73
2 10001 1 2 74
3 10001 1 3 75
4 10001 1 4 76
5 10001 1 5 77
6 10001 1 6 78
7 10001 1 7 79
8 10001 1 8 80
9 10001 1 9 81
10 10001 1 10 82
# ... with 16 more rows