Home > other >  How to pivot_longer with multiple column categories (none numeric) in R
How to pivot_longer with multiple column categories (none numeric) in R

Time:09-03

I have a df similar to the following:

df <- data.frame("name" = c("sue","joe","tom"),
                 "x0_date" = c("2020-1-20", "2020-2-20", "2020-3-20"),
                 "x0_class" = c("biology", "physics", "chemistry"),
                 "x1_date" = c("2021-4-18", NA, "2021-6-18"),
                 "x1_class" = c("english", NA, "drama"))
df
  name   x0_date  x0_class   x1_date x1_class
1  sue 2020-1-20   biology 2021-4-18  english
2  joe 2020-2-20   physics      <NA>     <NA>
3  tom 2020-3-20 chemistry 2021-6-18    drama

I am trying to pivot longer such that my outcome is as follows:

  name      date     class
1  sue 2020-1-20   biology
2  sue 2021-4-18   english
3  joe 2020-2-20   physics
4  tom 2020-3-20 chemistry
5  tom 2021-6-18     drama

I guess I've only ever pivoted when some of the data is numeric, so I'm having trouble. I've tried something like:

test <- df %>% 
  pivot_longer(matches("date|class"),
    names_to = c("date", "class"), 
    names_pattern = "(\\d )_(. )",
    values_drop_na = TRUE)

But the output isn't quite right. If it's relevant, my wide data is extremely wide ("x0" to "x_400"). Any help is appreciated!

CodePudding user response:

We need names_to as .value. According to ?pivot_longer

names_to - If length >1, multiple columns will be created. In this case, one of names_sep or names_pattern must be supplied to specify how the column names should be split. There are also two additional character values you can take advantage of:

NA will discard the corresponding component of the column name.

".value" indicates that the corresponding component of the column name defines the name of the output column containing the cell values, overriding values_to entirely.

Thus, specify that as .value and use names_pattern to capture only the characters that succeeds the _ i.e. (_(.*))

library(dplyr)
library(tidyr)
df %>% 
   pivot_longer(cols = -name, names_to = c(".value"),
       names_pattern = ".*_(.*)", values_drop_na = TRUE)

-output

# A tibble: 5 × 3
  name  date      class    
  <chr> <chr>     <chr>    
1 sue   2020-1-20 biology  
2 sue   2021-4-18 english  
3 joe   2020-2-20 physics  
4 tom   2020-3-20 chemistry
5 tom   2021-6-18 drama    

If we need the x0, x1 prefix as another column, then we need to specify the column name for the prefix and also to capture the prefix part

df %>%
  pivot_longer(cols = -name, names_to = c("categ", ".value"),
     names_pattern = "(.*)_(.*)", values_drop_na = TRUE)

-output

# A tibble: 5 × 4
  name  categ date      class    
  <chr> <chr> <chr>     <chr>    
1 sue   x0    2020-1-20 biology  
2 sue   x1    2021-4-18 english  
3 joe   x0    2020-2-20 physics  
4 tom   x0    2020-3-20 chemistry
5 tom   x1    2021-6-18 drama    

The reason the OP's code didn't work

df %>% 
    pivot_longer(matches("date|class"),
      names_to = c("date", "class"), 
      names_pattern = "(\\d )_(. )",
      values_drop_na = TRUE)
# A tibble: 10 × 4
   name  date  class value    
   <chr> <chr> <chr> <chr>    
 1 sue   0     date  2020-1-20
 2 sue   0     class biology  
 3 sue   1     date  2021-4-18
 4 sue   1     class english  
 5 joe   0     date  2020-2-20
 6 joe   0     class physics  
 7 tom   0     date  2020-3-20
 8 tom   0     class chemistry
 9 tom   1     date  2021-6-18
10 tom   1     class drama    

is because it is capturing one or more digits (\\d ) before the _ and then the rest (. ) as the column prefix, suffix, thus we get the 0, 1, from 'x0', 'x1' in 'date', the suffix part in 'class' and as we didn't specify the .value, the default column value will return with the values of the columns

CodePudding user response:

  • We can use Base R rbind function by splitting df to list of data.frames with column names name , data , class
lapply(1:((ncol(df)-1)/2) , \(i) df[c(1,2*i,2*i 1)]) |> 
lapply(\(x) {colnames(x) <- c("name" , "data" , "class") ; x}) |> 
do.call(rbind , args = _) |> na.omit()
  • Output
  name      data     class
1  sue 2020-1-20   biology
2  joe 2020-2-20   physics
3  tom 2020-3-20 chemistry
4  sue 2021-4-18   english
6  tom 2021-6-18     drama
  • Related