Home > database >  Convert tibble to long form when the variables consist of several parts
Convert tibble to long form when the variables consist of several parts

Time:10-31

I have such data

library(tidyverse)

df = tribble(
  ~id, ~a1, ~a2, ~a3, ~b1, ~b2, ~b3, ~c1, ~c2, ~c3,
    1,   1,   4,   7,  11,  14,  17,  21,  24,  27,
    2,   2,   5,   8,  12,  15,  18,  22,  25,  28,
    3,   3,   6,   8,  13,  16,  19,  23,  26,  29,
)

I would like to convert it to a long form, where the variable names contain two parts name (a, b, c) and number (1, 2, 3) which should become new variables in the long version of the table as below.

   id name nr data
1   1    a  1    1
2   2    a  1    2
3   3    a  1    3
4   1    a  2    4
5   2    a  2    5
6   3    a  2    6
7   1    a  3    7
8   2    a  3    8
9   3    a  3    8
10  1    b  1   11
11  2    b  1   12
12  3    b  1   13
13  1    b  2   14
14  2    b  2   15
15  3    b  2   16
16  1    b  3   17
17  2    b  3   18
18  3    b  3   19
19  1    C  1   21
20  2    C  1   22
21  3    C  1   23
22  1    C  2   24
23  2    C  2   25
24  3    C  2   26
25  1    C  3   27
26  2    C  3   28
27  3    C  3   29

Can it be done simply by using the functions in the dplyr package? I tried the pivot_longer effect was disappointing.

Any prompts are welcome.

CodePudding user response:

I know this question has been asked before, but I can't find a good duplicate target. In the meantime, if you specify the regex to differentiate between the name portion and the nr portion of your column names, you can do it in one function call:


df %>%
  pivot_longer(-id, names_to = c("name", "nr"), 
               values_to = "data", 
               names_pattern = "(^[a-z])(\\d$)")
#> # A tibble: 27 × 4
#>       id name  nr     data
#>    <dbl> <chr> <chr> <dbl>
#>  1     1 a     1         1
#>  2     1 a     2         4
#>  3     1 a     3         7
#>  4     1 b     1        11
#>  5     1 b     2        14
#>  6     1 b     3        17
#>  7     1 c     1        21
#>  8     1 c     2        24
#>  9     1 c     3        27
#> 10     2 a     1         2
#> # … with 17 more rows

Adapt the regex as needed if you have different column names in practice, but this separates them so that the first piece comes from a single lowercase letter at the beginning of the string, and the second piece comes from a single number at the end of the string.

CodePudding user response:

We may do this in a couple of ways - i.e. first reshape to 'long' format with pivot_longer excluding the 'id' column, then just separate the 'name' column into two by specifying the sep as a regex lookaround i.e. (as there is only a single lower case letter), split after the first occurrence of the letter ((?<=[a-z]))

library(dplyr)
library(tidyr)
df %>% 
   pivot_longer(cols = -id, names_to = 'name', values_to = 'data') %>%
   separate(name, into = c("name", 'nr'), sep = "(?<=[a-z])")

-output

 A tibble: 27 × 4
      id name  nr     data
   <dbl> <chr> <chr> <dbl>
 1     1 a     1         1
 2     1 a     2         4
 3     1 a     3         7
 4     1 b     1        11
 5     1 b     2        14
 6     1 b     3        17
 7     1 c     1        21
 8     1 c     2        24
 9     1 c     3        27
10     2 a     1         2
# … with 17 more rows

Or another optioin is to append a suffix in the column names and then use pivot_longer

library(stringr)
df %>% 
   rename_with(~ str_c(., "_data"), -id) %>% 
   pivot_longer(cols = -id, names_to = c("name", "nr", ".value"), 
       names_pattern = "^(.)(.)_(.*)")
  • Related