Home > Mobile >  Pivot_longer with regex for two-digit numbers in names_pattern
Pivot_longer with regex for two-digit numbers in names_pattern

Time:10-03

I have a large wide-format dataset which contains many variables measured at multiple waves, for which there is one column per variable-wave combination (e.g. age1,age2,age3, age4), as well as a few time-fixed variables (e.g. ID, sex). After pivoting, I want each variable to be represented by a single column, alongside a new 'wave' column.

It's working almost perfectly, except that I just can't get waves 1-9 and 10-13 represented in the same column.

df <- data.frame(
  ID = c(10001,10002),
  Sex = c(1,2),
  Age1 = c(73,25),
  Age2 = c(74,26),
  Age3=c(75,27),
  Age4 = c(76,28),
  Age5 = c(77,29),
  Age6=c(78,30),
  Age7 = c(79,31),
  Age8 = c(80,31),
  Age9=c(81,33),
  Age10=c(82,34),
  Age11 = c(83,35),
  Age12 = c(84,36),
  Age13=c(85,37)
)

names_test<-names(df) 
no_numb<-grep("*[A-Za-z]$", names_test) #to identify all the column names ending with a letter, which I do NOT want to pivot into longer form

df_long<-pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
                  names_pattern = "(.*)(\\d )$")

Long data output:

> df_long
# A tibble: 20 x 5
      ID   Sex wave    Age  Age1
   <dbl> <dbl> <chr> <dbl> <dbl>
 1 10001     1 1        73    83
 2 10001     1 2        74    84
 3 10001     1 3        75    85
 4 10001     1 4        76    NA
 5 10001     1 5        77    NA
 6 10001     1 6        78    NA
 7 10001     1 7        79    NA
 8 10001     1 8        80    NA
 9 10001     1 9        81    NA
10 10001     1 0        NA    82
11 10002     2 1        25    35
12 10002     2 2        26    36
13 10002     2 3        27    37
14 10002     2 4        28    NA
15 10002     2 5        29    NA
16 10002     2 6        30    NA
17 10002     2 7        31    NA
18 10002     2 8        31    NA
19 10002     2 9        33    NA
20 10002     2 0        NA    34

As you can see, there is a column Age, which has values for waves 1-9, and also a column Age1, which has values for 1,2,3,and 0 (ie Age10,Age11,Age12,Age13). I assume the problem here is either with the names_to argument or the names_pattern argument. Any help would be much appreciated!

CodePudding user response:

The .* is greedy, so it takes the max matching string. You could use

pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
                  names_pattern = "(Age)(\\d )$")

instead or make it non greedy by adding a ?

pivot_longer(df,cols = !no_numb, names_to = c('.value',"wave"),
                  names_pattern = "(.*?)(\\d )$")

This returns

# A tibble: 26 x 4
      ID   Sex wave    Age
   <dbl> <dbl> <chr> <dbl>
 1 10001     1 1        73
 2 10001     1 2        74
 3 10001     1 3        75
 4 10001     1 4        76
 5 10001     1 5        77
 6 10001     1 6        78
 7 10001     1 7        79
 8 10001     1 8        80
 9 10001     1 9        81
10 10001     1 10       82
# ... with 16 more rows
  • Related